The following description relates to an earbud and a related method that support voice activity detection (VAD) by eliminating noise and malfunctions due to noise.
An earbud is a device that is, by wire or wirelessly, to various types of electronic devices, such as portable media players, smartphones, tablet computers, laptop computers, stereo systems, and other types of devices, to provide sound output from a corresponding electronic device to a user.
A wired earbud includes one or more small speakers positioned above, inside, or near the ears of a user, structural components that hold the speakers in position, and a cable that electrically connects the earbud to an electronic device. A wireless earbud may be a wireless device that does not include a cable but instead wirelessly receives a stream of audio data from a wireless sound source.
An object of the present disclosure is to provide an earbud, a head mounted display (HMD), and a related method that perform voice activity detection (VAD) by simultaneously using a voice pick up (VPU) sensor and an earbud internal microphone.
According to an embodiment, an earbud for supporting voice activity detection (VAD) includes a first filter unit configured to filter a first signal input through a microphone, a first VAD unit configured to perform VAD on a signal passing through the first filter unit, a second filter unit configured to filter a second signal input through a bone conduction voice pick up (VPU) sensor, a second VAD unit configured to perform VAD on a signal passing through the second filter unit, and a determination unit configured to compare a detection result of the first VAD unit and a detection result of the second VAD unit to determine whether there is utterance.
According to an embodiment, a method of determining voice activity detection (VAD) includes filtering a first signal input through a microphone, performing VAD on the filtered first signal, filtering a second signal input through a bone conduction voice pick up (VPU) sensor, performing VAD on the filtered second signal, and comparing a VAD detection result related to the first signal and a VAD detection result related to the second signal to determine whether there is utterance.
The first VAD unit and the second VAD unit may simultaneously detect VAD.
The detection result of the first VAD unit and the detection result of the second VAD unit may be either detection of utterance or non-detection of utterance, and the determination unit may determine that there is utterance when both the detection result of the first VAD unit and the detection result of the second VAD unit are detection of utterance.
The first filter unit and the second filter unit may include a high pass filter (HPF).
The microphone may be muted based on that the determination unit determines that there is utterance.
Based on that the determination unit determines that there is utterance, content being played on the earbud may be stopped.
A volume of the earbud may be lowered to a preset level based on that the determination unit determines that there is utterance.
The first filter unit, the second filter unit, the first VAD unit, the second VAD unit, and the determination unit may be provided in a digital signal processor (DSP) unit, and the DSP may be provided in the earbud.
The microphone and the bone conduction VPU sensor may be provided in the earbud.
The first signal may include a digital signal obtained by passing an analog signal input through the microphone through a first ADC, and the second signal may include a digital signal obtained by passing an analog signal input through the bone conduction VPU sensor through a first ADC.
According to an embodiment, a head mounted display (HMD) for supporting voice activity detection (VAD) includes a display unit configured to provide an image to a user, a wearing unit configured to allow the display unit to be worn on a head of a user, an earbud configured to provide a sound related to the image to the user, a first filter unit configured to filter a first signal input through a microphone, a first VAD unit configured to perform VAD on a signal passing through the first filter unit, a second filter unit configured to filter a second signal input through a bone conduction voice pick up (VPU) sensor, a second VAD unit configured to perform VAD on a signal passing through the second filter unit, and a determination unit configured to compare a detection result of the first VAD unit and a detection result of the second VAD unit to determine whether there is utterance.
The first filter unit, the second filter unit, the first VAD unit, the second VAD unit, and the determination unit may be provided in a DSP unit, and the DSP unit may be provided in either the HMD or the earbud.
The microphone and the bone conduction VPU sensor may be provided in the earbud.
The first VAD unit and the second VAD unit may simultaneously detect VAD.
The detection result of the first VAD unit and the detection result of the second VAD unit may be either detection of utterance or non-detection of utterance, and the determination unit may determine that there is utterance when both the detection result of the first VAD unit and the detection result of the second VAD unit are detection of utterance.
Based on that the determination unit determines that there is utterance, content being played on the earbud may be stopped.
The microphone may be muted based on that the determination unit determines that there is utterance.
Based on that the determination unit determines that there is utterance, content being played on the earbud may be stopped.
A volume of the earbud may be lowered to a preset level based on that the determination unit determines that there is utterance.
According to an embodiment, a method of determining voice activity detection (VAD) includes filtering a first signal input through a microphone, performing VAD on the filtered first signal, filtering a second signal input through a bone conduction voice pick up (VPU) sensor, performing VAD on the filtered second signal, and comparing a VAD detection result related to the first signal and a VAD detection result related to the second signal to determine whether there is utterance.
According to an embodiment, whether there is utterance may be accurately determined by simultaneously using a microphone and a bone conduction voice pick up (VPU).
User experience may be improved by accurately performing voice activity detection (VAD) to provide interruption of content being played during VAD, a mute function during a call, and an ambient mode control (ANC) function without malfunction.
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the disclosure and together with the description serve to explain the principle of the disclosure. In the drawings:
In various examples of the present disclosure, “/” and “,” need to be construed as indicating “and/or”. For example, “A/B” may mean “A and/or B”. Furthermore, “A, B” may mean “A and/or B”. Furthermore, “A/B/C” may mean “at least one of A, B, and/or C”. Furthermore, “A, B, and C” may mean “at least one of A, B and/or C”.
In various examples of the present disclosure, “or” needs to be construed as indicating “and/or”. For example, “A or B” may include “only A”, “only B”, and/or “both A and B”. In other words, “or” needs to be construed as indicating “additionally or alternatively”.
Reference will now be made in detail to the exemplary embodiments of the present disclosure with reference to the accompanying drawings. The detailed description, which will be given below with reference to the accompanying drawings, is intended to explain exemplary embodiments of the present disclosure, rather than to show the only embodiments that may be implemented according to the present disclosure. The following detailed description includes specific details in order to provide a thorough understanding of the present disclosure. However, it will be apparent to those skilled in the art that the present disclosure may be practiced without such specific details.
In some instances, well-known structures and devices are omitted in order to avoid obscuring the concepts of the present disclosure and important functions of the structures and devices are shown in block diagram form.
It should be noted that specific terms disclosed in the present disclosure are proposed for convenience of description and better understanding of the present disclosure, and the use of these specific terms may be changed to other formats within the technical scope or spirit of the present disclosure.
An earbud generally has a structure that is relatively vulnerable to a noise environment due to a long distance between the mouth and a microphone. In an effort to overcome this, there is a recent trend of installing a bone conduction voice pick up (VPU) sensor, especially in a premium product.
However, the VPU sensor is a structure that receives a user sound through bone conductivity, and thus may block most of the external noise, which is very advantageous for voice activity detection (VAD). However, the VPU sensor is sensitive to vibration caused by touching and rubbing of earbud devices/equipment, and vigorous movement of a user (running or stomping), and thus malfunction is easily caused.
To compensate for this shortcoming, according to an embodiment of the present disclosure, a VAD unit/module that receives an input of a VPU Sensor and a VAD unit/module that receives an input of an internal microphone may be simultaneously used. Various embodiments of the present disclosure relate to a technology to reduce VAD malfunction by using the fact that input signals of the VPU sensor and the microphone have similar characteristics for speech but have different characteristics for external stimuli and, hereinafter, will be described in detail below.
Referring to
Here, the first VAD unit and the second VAD unit may simultaneously detect VAD. In other words, the detection results of the first VAD unit and the second VAD unit, which are input to the determination unit, may be the detection results that are simultaneously determined by the first VAD unit and the second VAD unit for the same time period (or within a preset error range). Here, simultaneous may include a preset (allowable) error.
The first signal is a digital signal obtained by passing an analog signal input through the microphone through a first ADC 103, and the second signal is a digital signal obtained by passing an analog signal input through a bone conduction VPU sensor through a second ADC 203. The first filter unit and the second filter unit may each be a high pass filter (HPF). Here, the HPF may remove noise outside a frequency band corresponding to user utterance (e.g., sounds generated by user movement, and sound generated by contact with the earbud (sound of the user rubbing the earbud)). For example, a microphone HPF may be set to remove frequencies below 600 Hz and a VPU HPF may be set to remove frequencies below 100 Hz. However, embodiments of the present disclosure are not necessarily limited to specific frequency values such as 600 Hz and 100 Hz, and other values may be used.
The detection result of the first VAD unit and the detection result of the second VAD unit are either detection of utterance or non-detection of utterance, and the determination unit may determine that there is utterance when all of the detection result of the first VAD unit and the detection result of the second VAD unit are detection of utterance.
For example, when the first VAD unit determines the detection result of utterance as true (T), the second VAD unit determines the detection result of utterance as true (T), and the first and second AVD units provide the determination results to the determination unit, all of the detection results of the two VAD units are detection of utterance, and thus the determination unit may determine whether there is utterance as true (T)/utterance through OR operation.
In contrast, when any one of the first VAD unit and the second VAD unit determines that the detection result of utterance is false (F) and provides this to the determination unit, not all detection results of utterance are T, and thus the determination unit may determine whether there is utterance as F.
This example is shown in
Based on that the determination unit determines that there is utterance, content being played on the earbud may be stopped. This means that a function is supported to automatically pause music and turns on an external sound listening mode when a user utters while listening to the music through earbuds and to automatically turn off an ambient sound listening mode and resume the music when the user does not say anything for a while. When this function is performed, it is important to accurately determine whether the user utters. This is because, when the user does not utter and stops the music repeatedly by misrecognizing other noises as utterance, the user has an unpleasant user experience and does not trust or use the function.
Regarding the mute function Controller during a call, the microphone may be muted based on that the determination unit determines that there is utterance. This is a function that mutes the microphone when the user does not utter, preventing ambient noise from being transmitted to the other party. As described above, it is also important to accurately determine whether the user utters when this function is performed, which may be obtained through the embodiment described above.
Regarding the ambient mode controller, a volume of the earbud may be lowered to a preset level based on that the determination unit determines that there is utterance. This is a function related to an ambient mode control (ANC) function, for example, gradually reducing the volume of content that is currently played when user utterance is detected. This ANC function also provides only an unpleasant user experience when user utterance is not accurately determined, and thus whether the user utters may be accurately determined according to the embodiment described above.
The first filter unit, the second filter unit, the first VAD unit, the second VAD unit, and the determination unit may be provided in a digital signal processor (DSP) unit, and the DSP may be provided in the earbud. The microphone and the bone conduction VPU sensor may be provided in the earbud.
According to another embodiment, a head-mounted display (HMD) supporting voice activity detection (VAD) is disclosed. The HMD according to an embodiment may include a display unit that provides an image to a user, a wearing unit that allows the display unit to be worn on the head of the user, an earbud that provides a sound related to the image to the user, a first filter unit that filters a first signal input through a microphone, a first VAD unit that performs VAD on a signal passing through the first filter unit, a second filter unit that filters a second signal input through a bone conduction VPU sensor, a second VAD unit that performs VAD on a signal passing through the second filter unit, and a determination unit that compares the detection result of the first VAD unit and the detection result of the second VAD unit to determine whether there is utterance.
The first filter unit, the second filter unit, the first VAD unit, the second VAD unit, and the determination unit may be provided in a DSP unit as illustrated in
In the above embodiment related to an earbud, each of the units/modules described above in the earbud needs to be built, but the HMD is relatively bulky and has less space constraints, and thus various arrangements other than the built-in arrangement described above may be applied, and combinations of various arrangements of each unit/module are provided in the scope of the present disclosure.
The detailed description related to VAD of the HMD is replaced with the description of the VAD earbud described above.
A method of determining VAD according to an embodiment may include filtering a first signal input through a microphone (S701), performing VAD on the filtered first signal (S702), filtering a second signal input through a bone conduction VPU sensor (S703), performing VAD on the filtered second signal (S704), and comparing the VAD detection result related to the first signal and the VAD detection result related to the second signal to determine whether there is utterance (S705). The detailed description related to the method of determining whether there is utterance is replaced with the description related to the earbud according to the embodiment described above.
Referring to
The communication unit 110 may transmit and receive signals (e.g., media data or control signals) with external devices such as other wireless devices, mobile devices, or media servers. The media data may include video, images, and sound. The controller 120 may perform various operations by controlling components of the XR device 100a. For example, the controller 120 may be configured to control and/or perform procedures such as video/image acquisition, (video/image) encoding, and metadata generation and processing. The memory unit 130 may store data/parameters/programs/codes/commands necessary for driving the XR device 100a/generating an XR object. The input/output unit 140a may obtain control information, data, and the like from the outside and output the generated XR object. The input/output unit 140a may include a camera, a microphone, a user input unit, a display unit, a speaker, and/or a haptic module. The sensor unit 140b may obtain an XR device state, surrounding environment information, user information, and the like. The sensor unit 140b may include a proximity sensor, an illumination sensor, an acceleration sensor, a magnetic sensor, a gyro sensor, an inertial sensor, an RGB sensor, an IR sensor, a fingerprint recognition sensor, an ultrasonic sensor, an optical sensor, a microphone, and/or a radar. The power supply 140c may supply power to the XR device 100a and may include a wired/wireless charging circuit and a battery.
For example, the memory unit 130 of the XR device 100a may include information (e.g., data) necessary for generating an XR object (e.g., AR/VR/MR object). The input/output unit 140a may obtain a command to manipulate the XR device 100a from the user, and the controller 120 may drive the XR device 100a according to the drive command of the user. For example, when the user attempts to watch a movie, news, or the like through the XR device 100a, the controller 120 may transmit content request information to another device (e.g., mobile device 100b) or a media server through the communication unit 130. The communication unit 130 may download/stream content such as movies and news from another device (e.g., mobile device 100b) or a media server to the memory unit 130. The controller 120 may control and/or perform procedures such as video/image acquisition, (video/image) encoding, and metadata generation/processing for the content and generate/output an XR object based on information on a surrounding space or an actual object, obtained through the input/output unit 140a/sensor unit 140b.
The XR device 100a may be wirelessly connected to the mobile device 100b through the communication unit 110, and an operation of the XR device 100a may be controlled by the mobile device 100b. For example, the mobile device 100b may operate as a controller for the XR device 100a. To this end, the XR device 100a may obtain 3D location information of the mobile device 100b and then generate and output an XR object corresponding to the mobile device 100b.
The above description is merely illustrative of the technical idea of the present disclosure. Those of ordinary skill in the art to which the present disclosure pertains will be able to make various modifications and variations without departing from the essential characteristics of the present disclosure.
Therefore, embodiments disclosed in the present disclosure are not intended to limit the technical idea of the present disclosure, but to describe, and the scope of the technical idea of the present disclosure is not limited by such embodiments. The scope of protection of the present disclosure should be interpreted by the claims below, and all technical ideas within the scope equivalent thereto should be construed as being included in the scope of the present disclosure.
The embodiments described above may be applied to various mobile communication systems.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/KR2022/000825 | 1/17/2022 | WO |