The present invention relates to intelligently controlling the audio volume of a robot that can interact with users.
There have been many publications about automatic audio volume control. For example, Johnston talks about providing audio compensation within some frequency bands based on intensity of background noise. Ding et. al. use ultrasound ranging device to determine listener's distance and adjust audio volume accordingly. The method disclosed herein focuses on an audio volume control method for a robot that is capable of interacting with users through its audio and visual devices. By user herein we mean a person who listens to the robot and even talks to the robot. The robot speaking to a user too loud causes annoyance while speaking too softly creates intelligibility problem. For example, in a crowd, the robot may speak loud, whereas in a quiet room, the robot can speak softly. In an open space, it depends. Speaking to a person down the hall the robot may turn up the volume. Speaking to a nearby person in a hall, the robot may speak loud enough but not too loud lest other people in the hall are annoyed. The alternatives of using manual audio volume control are less attractive. For example, even given the tool to adjust the robot audio volume manually, users may not be adequately trained, and users may not feel convenient. As another example, while a remote user is doing videoconferencing via a robot with a user local to the robot, the remote user may not be able to tell whether the robot audio volume is appropriate. In this invention, we present a method that enables automatic audio volume control on the robot considering the local user environment.
The object of this invention is enabling a robot to intelligently control its audio volume according to the local user's environment.
According to the recommendations from the American National Standards Institute (ANSI) and the Acoustical Society of America (ASA), a speaker's voice should reach a listener at no less than +15 dB signal to noise ratio for good speech intelligibility. In this invention, when the robot talks to a user the robot intermittently assesses the user's environment. Specifically, the robot estimates the user's distance from the robot and measures the background noise intensity. The robot increases the audio volume as the background noise intensity increases to maintain the proper signal to noise ratio. Also, audio signals attenuate by 6 dB travelling twice the distance. Therefore, the robot increases its audio volume by 6 dB when the user's distance from the robot is doubled.
There are multiple techniques for a robot to measure user's distance. A simple one assumes a camera mounted on the robot. Assuming a user's head is of a certain size, we can estimate the user's distance by the size of the user's head on an image. The second technique uses a stereo camera on the robot to capture a pair of images of the same user from different angles, involving epipolar geometry calculations. The third technique uses ranging devices such as laser distance meters, sonar distance meters, and radar distance meters.
Background noise generally refers to noise of a lower amplitude that persists for longer, while intermittent noise refers to higher-amplitude noise that lasts for only a short time (on the order of seconds). Background noise may undermine the intelligibility of the robot audio output. The robot may boost its audio volume by the same number of decibels to compensate for the background noise in the user's environment after estimating the background noise intensity in decibels. The robot, equipped with a microphone, captures the audio signals in the user's environment constantly and assesses the background noise intensity.
The robot audio volume is adjusted according to the user's distance and the background noise intensity. For example, in a controlled environment with no background noise, we find out that a typical user hears well and comfortably at d feet away from a robot when the audio output intensity is a dB. Now let's assume that in the actual deployment the background noise is b dB evenly in user's environment, and the user is D feet away. The robot audio output intensity is then adjusted to (a+b+6 log2(D/d)) dB. We can calibrate for each design of robots for the set of a and d values before the deployment of the robots. Then adjust the audio volume according to measurements of b and D as described.
In this invention, we further present a close-looped audio volume control technique. The technique involves finding out in real time whether the audio volume adjustment is effective. The robot uses its microphone to capture audio signal in user's environment while there is audio output from the robot. The acoustic echo signal is therefore captured, i.e., the sound of the audio output from the robot, along with background noise and other sound, enters the microphone of the robot. In a typical teleconferencing application, acoustic echo cancellation is applied. If the robot has to do acoustic echo cancellation, then before doing that the robot may calculate the signal to noise ratio of the acoustic echo signal of its audio output. The robot may automatically adjust the audio volume so as to make the signal to noise ratio of the acoustic echo signal to be no less than a threshold, say, A dB. Then for a user of distance D feet away, adjust the audio volume to make the signal to noise ratio of the acoustic echo signal to be (A+6 log2D) dB.
A robot that interacts with multiple users may need to understand the context of audio output delivery further. For example, in a conference setting, the robot should account for the user farthest away. In the case that the robot needs to deliver individual audio output one by one to users at different distances, the robot needs to quickly adjust its audio volume for each user.
For a semi-autonomous robot that facilitates videoconferencing between local users and remote users, the users determine the context of the audio output delivery and input the context to the robot manually. Alternatively, the robot may assume the user near the center of its field of vision to be the intended recipient of its audio output.
The present invention will be understood more fully from the detailed description that follows and from the accompanying drawings, which however, should not be taken to limit the disclosed subject matter to the specific embodiments shown, but are for explanation and understanding only.
The object of this invention is enabling a robot to intelligently control its audio volume according to the local user's environment.
According to the recommendations from the American National Standards Institute (ANSI) and the Acoustical Society of America (ASA), a speaker's voice should reach a listener at no less than +15 dB signal to noise ratio for good speech intelligibility. A signal to noise ratio for speech intelligibility speech and aural comfort is between 15 dB and 30 dB. Not only background noise intensity affects speech intelligibility but also the distance that the audio output signal needs to travel to reach the user does. In this invention, when the robot talks to a user the robot intermittently assesses the user's environment. Specifically, the robot estimates the user's distance from the robot and measures the background noise intensity. The robot increases the audio volume as the background noise intensity increases to maintain the proper signal to noise ratio. Also, audio signals attenuate by 6 dB travelling twice the distance. Therefore, the robot increases its audio volume by 6 dB when the user is twice farther away.
One embodiment of the method disclosed is illustrated in
The disadvantage of the embodiment of
There are multiple techniques for a robot to measure user's distance in step 20. A simple one assumes a camera mounted on the robot. The geometry of a single lens camera is illustrated in
d
o
=d
i
×h
o
÷+h
i
When the object distance, which is the user's distance that we are interested in, is much larger than twice the focal length f of the lens, di is approximately equal to f. Assume an average user's head size; then ho is considered known. Knowing the camera resolution, we can obtain hi based on the camera resolution and the number of pixels corresponding to the user's head on the image. The camera resolution is usually represented in pixels per inch. The unit can be converted into pixels per feet. Multiplying the number of pixels by the camera resolution yields hi in feet. Therefore, the estimated user's distance D is the product of the focal length f and an average head size ho divided by the size of the user's head in the image hi. The camera may have zooming capability. The zooming can be implemented by changing the focal length (usually being the combined focal length of a set of lenses) or by changing the image resolution via image processing techniques. As long as the focal length and the image resolution are known, the user's distance estimation technique described is applicable.
The second technique assumes a stereo camera on the robot. The stereo camera consists of two lenses and is able to capture a pair of images of the same user from different angles, as illustrated in
Ø=tan−1 (S/2f)
α1=tan−1((P1−N1/2)/(N1/2)×tan Ø), where Pi is the pixel location of the object in the left image and Ni is the total number of pixels in the left image.
α2=tan−1((N2/2−P2)/(N2/2)×tan Ø), where P2 is the pixel location of the object in the right image and N2 is the total number of pixels in the right image.
D=(tan(n/2−α1)×tan(π/2−α2)×ΔX)/(tan(n/2−α1)+tan(π/2−α2)), where ΔX is the distance between the lenses.
In the case that an object is located to the left of both lenses:
Ø=tan−1 (S/2f)
α1=tan−1((Ni/2−P1)/(N1/2)×tan Ø), where P1 is the pixel location of the object in the left image and N1 is the total number of pixels in the left image.
α2=tan−1((N2/2−P2)/(N2/2)×tan Ø), where P2 is the pixel location of the object in the right image and N2 is the total number of pixels in the right image.
D=(sin(π/2−α1)×sin(π/2−α2)×ΔX)/(sin(α2−α1)), where ΔX is the distance between the lenses.
In an image, the user image region is composed of many pixels as a person has a number of body parts. The technique requires identifying the pixels in the pair of images that represent the same part of the user. Applying the formulae described, the distance D of a specific part of the user is obtained. The same calculation can be applied to a number of parts of the user so as to obtain a number of distance estimates. The average value of the distance estimates can be used as the estimated distance of user.
The third technique uses ranging devices such as laser distance meters, sonar distance meters, and radar distance meters. The theory of operations of those devices is well known.
Which distance estimation techniques to use is mostly a cost decision. A robot that interacts with users is usually equipped with at least one camera, and perhaps a stereo camera or a ranging device for autonomous navigation.
Step 30 and step 35 involve measurement of background noise. Background noise generally refers to noise of a lower amplitude that persists for longer, while intermittent noise refers to higher-amplitude noise that lasts for only a short time (on the order of seconds). Background noise may undermine the intelligibility of the robot audio output. We assume that the robot cannot remove the background noise in user's environment local to the robot although the robot may apply well-known digital signal processing techniques to reduce the background noise intrinsic to its audio output. The robot, however, may boost its audio volume by the same number of decibels to compensate for the background noise in the user's environment after estimating the background noise intensity in decibels. That technique assumes that the background noise is evenly intense in the space between the user and the robot. The robot, equipped with a microphone, captures the audio signals in the user's environment constantly and calculates the noise intensity or signal to noise ratio via digital audio processing techniques.
The technique described in
A robot that interacts with multiple users may need to understand the context of audio output delivery further. For example, in a conference setting, the robot should account for the user farthest away because the audio output is meant for all users in the conference. In a reception hall setting, the robot may need to deliver individual audio output one by one to users at different distances, the robot needs to quickly adjust its audio volume for each user. For a semi-autonomous robot that facilitates videoconferencing between local users and remote users, the users may determine the context of the audio output delivery and input the context to the robot manually. Alternatively, it would be desirable that the robot assesses the context of the audio output delivery via artificial intelligence. For example, the robot may assume the user near the center of its field of vision to be the intended recipient of its audio output.
The embodiments described above are illustrative examples and it should not be construed that the present invention is limited to these particular embodiments. Thus, various changes and modifications may be effected by one skilled in the art without departing from the spirit or scope of the invention as defined in the appended claims.