This application is based upon and claims the benefit of priority from the prior Japanese Patent Application No. 2017-123643, filed Jun. 23, 2017, the entire contents of which are incorporated herein by reference.
The present invention relates to a sound source separation information detecting device capable of separating a signal voice from a noise voice, a robot, a sound source separation information detecting method, and a storage medium therefor.
There have been known robots each having a form imitating a human or an animal or the like and capable of communicating with a human being by means of a conversation or the like. Some of these robots detect sounds generated around them on the basis of outputs from microphones mounted on the robots and, if determining that the sounds are voices uttered by a target person, the robots turn their faces or bodies to a direction where the target person is present and then make moves such as talking or waving to the target person.
To implement a move of the robot, there is a need for a sound source separation technique of separating only a signal voice uttered by the target person from the sounds generated around the robot by removing unnecessary noise voices (noise sources), which are voices other than the signal voice, therefrom in order to detect a direction or a position of the signal voice (signal source), which is the voice uttered by the target person.
Conventionally, there has been known a technique of performing beam forming, which is a type of sound source separation technique, in order to increase a signal-to-noise ratio (S/N ratio) (for example, Japanese Patent Application Laid-Open No. 2005-253071).
According to an aspect of the present invention, there is provided a sound source separation information detecting device, including:
a voice acquisition unit having predetermined directivity to acquire a voice;
a first direction detection unit configured to detect a first direction, which is an arrival direction of a signal voice of a predetermined target, from the voice acquired by the voice acquisition unit;
a second direction detection unit configured to detect a second direction, which is an arrival direction of a noise voice, from the voice acquired by the voice acquisition unit; and
a detection unit configured to detect a sound source separation direction or a sound source separation position, based on the first direction and the second direction.
According to another aspect of the present invention, there is provided a robot, including:
the sound source separation information detecting device;
a moving unit configured to move its own device;
an operating unit configured to operate the its own device; and
a control unit configured to control the sound source separation information detecting device, the moving unit, and the operating unit.
According to still another aspect of the present invention, there is provided a sound source separation information detecting method, including the steps of:
detecting a first direction, which is an arrival direction of a signal voice of a predetermined target, from a voice acquired by a voice acquisition unit having predetermined directivity to acquire the voice;
detecting a second direction, which is an arrival direction of a noise voice, from the voice acquired by the voice acquisition unit; and
detecting a sound source separation direction or a sound source separation position, based on the first direction and the second direction.
According to further another aspect of the present invention, there is provided a storage medium configured to store a program causing a computer of a sound source separation information detecting device to function so as to:
detect a first direction, which is an arrival direction of a signal voice of a predetermined target, from a voice acquired by a voice acquisition unit having predetermined directivity to acquire the voice;
detect a second direction, which is an arrival direction of a noise voice, from the voice acquired by the voice acquisition unit; and
detect a sound source separation direction or a sound source separation position, based on the first direction and the second direction.
Additional objects and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objects and advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out hereinafter.
The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the invention, and together with the general description given above and the detailed description of the embodiments given below, serve to explain principles of the invention.
Modes for carrying out the present invention will be described in detail below with reference to accompanying drawings.
As illustrated in
The camera 104 is disposed in a lower part of a front surface of the head 101 or in a location of what is called “nose” of a human face. The camera 104 captures an image under the control of a control unit 201 described later.
The microphone array 103 is composed of, for example, 13 microphones. Eight microphones of the 13 microphones are arranged in locations at a height of what is called “forehead” of the human face at regular intervals around a periphery of the head 101. In an upper part of the head 101 above the eight microphones, four microphones are arranged at regular intervals around the head 101. Further, one microphone is arranged at a top of the head 101. The microphone array 103 detects sounds generated around the robot 100.
The loudspeaker 105 is provided lower than the camera 104, that is, in a location of what is called “mouth” of the human face. The loudspeaker 105 outputs various voices under the control of the control unit 201 described later.
The sensor group 106 is provided in locations of what are called “eyes” and “ears” of the human face. The sensor group 106 includes an acceleration sensor, an obstacle detection sensor, and the like, and is used to control a posture of the robot 100 or to secure safety thereof.
The neck joint drive unit 107 is a member which connects the head 101 with the trunk 102. The head 101 is connected to the trunk 102 through the neck joint drive unit 107 indicated by a dashed line. The neck joint drive unit 107 includes a plurality of motors. If the control unit 201 described later drives the plurality of motors, the head 101 of the robot 100 rotates. The neck joint drive unit 107 serves as a face rotation amount acquisition unit, which rotates the head 101 of the robot 100 and acquires a rotation amount thereof.
The undercarriage drive unit 108 serves as a moving unit configured to move the robot 100. Although not particularly illustrated, the undercarriage drive unit 108 includes four wheels provided on the underside of the trunk 102. Two of the four wheels are arranged on a front side of the trunk 102 and the remaining two are arranged on a back side of the trunk 102. As wheels, for example, omni wheels or mecanum wheels are used. The control unit 201 described later causes the wheels of the undercarriage drive unit 108 to rotate so as to move the robot 100.
The storage unit 202 includes a solid-state disk drive, a hard disk drive, a flash memory, and the like and is provided in an inside of the trunk 102. The storage unit 202 stores the control program 205 executed by the control unit 201 and various data including voice data collected by the microphone array 103, image data captured by the camera 104, and the like. The control program 205 stored in the storage unit 202 includes a sound source separation information detection program, a movement program, and a dialogue program, and the like described later.
Operation buttons 203 are provided on a back of the trunk 102 (not illustrated in
A power supply unit 204 is a rechargeable battery built in the trunk 102 and supplies electric power to respective parts of the robot control system 200.
In
An image input unit 304, a face detection unit 305, and a mouth part detection unit 306, which function as image acquisition units, acquire a lips image of the target person, who is a predetermined target, at a timing when the voice input unit 301 acquires the voice. Specifically, the image input unit 304 inputs the image from the camera 104 in
A mouth opening/closing determination unit 307, which functions as a determination unit, determines whether the lips of the target person are opened or closed on the basis of the lips image output from the mouth part detection unit 306.
A sound source arrival direction estimation unit 302 functions as a first direction detection unit when the mouth opening/closing determination unit 307 determines the opening of the lips (a state in which the lips are opened) and then, assuming that the voice input by the voice input unit 301 is a signal voice, estimates a first direction, which is an arrival direction of the signal voice, on the basis of the lips image output from the mouth part detection unit 306 and the signal voice power of the signal voice.
On the other hand, the sound source arrival direction estimation unit 302 functions as a second direction detection unit when the mouth opening/closing determination unit 307 determines the closure of the lips (a state in which the lips are closed) and then, assuming that a voice input by the voice input unit 301 is a noise voice, estimates a second direction, which is an arrival direction of the noise voice, on the basis of the noise voice power of the noise voice.
The sound source arrival direction estimation unit 302 estimates a sound source localization of the noise voice (the position of a noise source) from a sound source other than the target person by performing processing based on a multiple signal classification (MUSIC) method, which is one of sound source localization techniques, as a processing example in the case of functioning as the second direction detection unit. The details of this processing will be described later.
A sound source separation unit 303 performs arithmetic processing based on a beam forming technique, for example, described in the following Document 1 to perform sound source separation processing in which the signal voice uttered by the target person is emphasized or the noise voice other than the signal voice is suppressed, with the first direction, which is the arrival direction of the signal voice currently obtained by the sound source arrival direction estimation unit 302, or the second direction, which is the arrival direction of the noise voice, as an input.
<Document 1>
Futoshi Asano, “Sound source separation,” [online], received on November 2011, “Chishiki-no-mori (Forest of Knowledge)” issued by IEICE, [searched on Jun. 15, 2017], Internet
Specifically, if the mouth opening/closing determination unit 307 determines the opening of the lips, the sound source separation unit 303 performs a beam steering operation, in which the signal voice is beam-steered (emphasized) in the first direction currently obtained by the sound source arrival direction estimation unit 302 by the aforementioned beam forming arithmetic processing, to acquire the emphasized signal voice and then outputs the emphasized signal voice to a sound volume calculation unit 308.
On the other hand, if the mouth opening/closing determination unit 307 determines the closing of the lips, the sound source separation unit 303 performs a null steering operation, in which the noise voice is null-steered (suppressed) in the second direction currently obtained by the sound source arrival direction estimation unit 302 by the aforementioned beam forming arithmetic processing, to acquire the suppressed noise voice and then outputs the suppressed noise voice to the sound volume calculation unit 308.
The processing performed by the sound source separation unit 303 may be performed by using physical directivity microphones having predetermined directivity as the microphone array 103.
The sound volume calculation unit 308 calculates the sound volume of the beam-steered (emphasized) signal voice or the null-steered (suppressed) noise voice output from the sound source separation unit 303.
An S/N ratio calculation unit 309 calculates a signal-to-noise ratio (hereinafter, referred to as “S/N ratio”) on the basis of the sound volume of the signal voice and the sound volume of the noise voice calculated by the sound volume calculation unit 308 and then determines whether or not the S/N ratio is greater than a threshold value. The sound source separation unit 303, the sound volume calculation unit 308, and the S/N ratio calculation unit 309 function as detection units which detect a sound source separation direction or a sound source separation position on the basis of the first direction and the second direction.
If the S/N ratio is equal to or lower than the threshold value as a result of the determination in the S/N ratio calculation unit 309, the control unit 201 in
After moving the robot 100, the control unit 201 activates the robot control function in
In
First, the face detection unit 305 in
<Document 2>
Kazuhiro Hotta, “Special Issue: Face Recognition Technique, 1. Research Tendency of Face Recognition,” [online], published on Mar. 28, 2012, The journal of The Institute of Image Information and Television Engineers, Vol. 64, No. 4(2010), pp. 459 to 462, [Searched on Jun. 15, 2017], Internet
Subsequently, the mouth part detection unit 306 in
<Document 3>
littlewing, “Summary of face recognition techniques available in Web camera-2,” [online], published on Apr. 7, 2015, [searched on Jun. 15, 2017], Internet
The mouth part detection processing in step S402 enables an acquisition of face part detection results, which are labelled coordinate values, first, for example. As a format example of the labelled face part detection results, an example described as
<Document 4>
C. sagonas, “Facial point annotations,” [online], [searched on Jun. 15, 2017], Internet
In the mouth part detection processing of step S402, for example, labels 49 to 68 are detected as a mouth part and labels 28 to 36 are detected as a nose part, out of the face part detection results illustrated in
Subsequently, the mouth opening/closing determination unit 307 in
In step S403, the mouth opening/closing determination unit 307, first, calculates a change Δy in the ordinate (the vertical direction of the face) of the lips. At the present moment, a y coordinate amount difference sum y(t) is calculated in a frame F(t) at a certain time by an arithmetic operation of the following expression (1).
y(t)=yy1+yy2 (1)
In the expression (1), yy1 represents the y coordinate amount difference sum between the upper lip (lower part) and the lower lip (upper part) and is calculated by an accumulation operation of the following expressions (2) to (7) according to the relationship in
yy1+=f abs(data.y[61](t)−data.y[67](t)) (2)
yy1+=f abs(data.y[61](t)−data.y[58](t)) (3)
yy1+=f abs(data.y[62](t)−data.y[66](t)) (4)
yy1+=f abs(data.y[62](t)−data.y[57](t)) (5)
yy1+=f abs(data.y[63](t)−data.y[65](t)) (6)
yy1+=f abs(data.y[63](t)−data.y[56](t)) (7)
In expression (1), yy2 represents the y coordinate amount difference sum between the under-nose part and the lower lip (upper part) and is calculated by the arithmetic operation of the following expressions (8) to (12) according to the relationship in
yy2+=f abs(data.y[31](t)−data.y[60](t)) (8)
yy2+=f abs(data.y[32](t)−data.y[61](t)) (9)
yy2+=f abs(data.y[33](t)−data.y[62](t)) (10)
yy2+=f abs(data.y[34](t)−data.y[63](t)) (11)
yy2+=f abs(data.y[34](t)−data.y[64](t)) (12)
In step S403 of
Δy=abs(y(t)−y(t−1)) (13)
The value Δy calculated by the expression (13) represents the moving amount of the lips and increases when the upper lip and the lower lip move in a direction away from or approaching each other. In other words, the mouth opening/closing determination unit 307 operates as a lips moving amount acquisition unit.
In step S403 of
In other words, an x coordinate amount difference sum x(t) is now calculated by the arithmetic operation of the following expression (14) in the frame F(t) at a certain time. In the expression (14), for example, “data.x[61](t)” represents an x coordinate data value of label 61 in
x(t)=data.x[61](t)+data.x[62](t)+data.x[63](t)+data.x[67](t)+data.x[66](t)+data.x[65](t) (14)
Subsequently, the expression (15) described below is used to calculate a difference absolute value Δx between the x coordinate amount difference sum x(t) calculated by the arithmetic operation of the expression (14) for the frame image F(t) at time t and the x coordinate amount difference sum x(t−1) calculated by the same arithmetic operation as that of the expression (14) for the frame image F(t−1) at the time (t−1), which is one frame earlier than the time t.
Δx=abs(x(t)−x(t−1)) (15)
The Δx value calculated by the expression (15) indicates the moving amount of the lips similarly to the value Δy and increases when the lips are moving either to the right or the left. Also in this case, the mouth opening/closing determination unit 307 operates as the lips moving amount acquisition unit.
In step S403 of
Δroll=abs(F(t)roll−F(t−1)roll) (16)
Δyaw=abs(F(t)yaw−F(t−1)yaw) (17)
Δpitch=abs(F(t)pitch−F(t−1)pitch) (18)
Incidentally, for example, F(t)roll is a roll angle value, which is input from the neck joint drive unit 107 in
In step S403 of
As methods of estimating the rotation angle of the head 101, various methods are known and a technique other than the above may be employed.
In step S403 of
Specifically, the mouth opening/closing determination unit 307 determines the opening of the lips if the upper lip and the lower lip move in a direction away from or approaching each other, the moving amount of the lips in the horizontal direction is small, and the head 101 of the robot 100 does not rotate so much. The use of not only Δy, but also Δx, Δroll, Δyaw, and Δpitch for the opening/closing determination of the lips enables erroneous determination to be unlikely to occur even in an action of disapproval (shaking the head from side to side) or of inclining the head for thinking.
Returning to the description of
First, the sound source arrival direction estimation unit 302 in
Subsequently, the sound source separation unit 303 in
Thereafter, the sound volume calculation unit 308 in
On the other hand, if the mouth opening/closing determination unit 307 determines the closing of the lips through the series of processes in step S403, a series of processes of subsequent steps S407 to S409 are performed.
First, the sound source arrival direction estimation unit 302 in
Subsequently, the sound source separation unit 303 in
The sound volume calculation unit 308 in
Thereafter, the S/N ratio calculation unit 309 in
S/N ratio=Spow/Npow (20)
Furthermore, the S/N ratio calculation unit 309 determines whether or not the calculated S/N ratio is greater than a threshold value sn_th according to a determination operation of the following expression (21) (step S410).
S/N ratio>sn_th (21)
If the determination in step S410 is NO, the control unit 201 in
After the moving of the robot 100, a series of control processes of steps S401 to S409 in
If the determination of step S410 is YES in due course, the control unit 201 in
First, a voice input to the microphone array 103 in
Assuming that N is the number of sound sources, a signal Sn of an n-th sound source is able to be expressed by the following expression (22), where co is an angular frequency and f is a frame number (the same applies to the following description).
Sn(ω,f)(n=1,2, . . . ,N) (22)
The signal observed by each microphone of the microphone array 103 in
Xm(ω,f)(m=1,2, . . . ,M) (23)
The sound issued from the sound source travels through air and is observed by the microphones of the microphone array 103. Assuming that a transfer function is Hnm(ω) at that time, the signal observed by the microphones of the microphone array 103 can be found by multiplying the expression which expresses the signal of the sound source by the transfer function. A signal Xm(ω, f) observed by an m-th microphone can be expressed by the following expression (24).
The robot 100 has a plurality of microphones as the microphone array 103, and therefore a signal x(ω, f) observed by the entire microphone array 103 can be expressed by the following expression (25).
Similarly, also a signal s(ω, f) of the entire sound source can be expressed by the following expression (26).
Similarly, a transfer function hn(ω) of an n-th sound source can be expressed by the following expression (27).
All transfer functions are denoted by the following expression (28).
h(ω)=[h1(ω),h2(ω), . . . hN(ω)] (28)
If the transfer function expressed by the expression (28) is applied to the aforementioned expression (24), the transfer function can be expressed by the following expression (29).
x(ω,f)=h(ω)s(ω,f) (29)
The transfer function hn(ω) is independent for each sound source position and Sn(ω, f) is able to be considered to be uncorrelated in terms of a certain number of frames (for example, L is assumed to indicate the number of frames). Therefore, x(ω, f) constitutes a hyperplane in which the number of sound sources N is RANK. At this time, the distribution tends to spread in a direction of a transfer function whose sound source has a great sound volume which has been normalized by the distance. Accordingly, it will now be discussed that the space is decomposed into a subspace and a null space.
Referring to
Subsequently, eigenvalue decomposition is performed (step S703). In this process, it is assumed that an eigenvalue λm (ω, f) and an eigenvector em(ω, f) are rearranged in such a way that the eigenvalues are arranged in descending order.
In principle, the transfer function hn(ω) is able to be restored from a weighted addition of the eigenvector em(ω, f)(m=1 to N) of the subspace. The restoration, however, is actually difficult and therefore the sound source localization is achieved by utilizing that the eigenvector em(ω, f)(m=N+1 to M) constituting the null space is orthogonal to the transfer function hn(ω).
Since the sound source of the noise voice, however, is likely to move in, for example, a room of a building, the sound source position cannot be previously known and therefore it is difficult to acquire the transfer function of the sound source position in advance. Therefore, a provisional sound source position is determined and then a transfer function of the provisional sound source position is previously prepared to perform the sound source localization.
Since the plurality of microphones of the microphone array 103 are arranged on the head 110 of the robot 100, the microphones can be considered to be arranged along the circumference. Assuming that θ1, θ2, θ3, and θ4 indicate angles between the positive direction of the X axis and the respective lines connected from the center of the circle formed by the microphones (corresponding to the center position of the head 110 of the robot 100) to the provisional sound sources 1 to 4 respectively, the respective transfer functions hθ(ω) are calculated in advance.
Although description has been made by giving an example in which four sound sources are used in
Referring to
In the above, the denominator of the expression (31) cannot be zero due to a noise, an error, an influence of a signal leakage between frequency bands in SIFT, or the like. In addition, if the direction of the sound source is close to any one of the predetermined angles θ (θ1, θ2, - - - , θN), in other words, if hn(ω) is close to hθ(ω), the value of the expression (31) is extremely large. In the example illustrated in
Subsequently, to find the integrated MUSIC power, weighted addition is applied to the MUSIC spectrum for each frequency band by the arithmetic operation expressed by the following expression (32) (step S705).
The weighting coefficient is also able to be calculated according to power included in Sn(ω, f) if it is set to be larger as the eigenvalue λm(ω, f) is larger. In this case, it is possible to reduce adverse effect in the case of little power in Sn(ω, f).
At the end, an appropriate peak (maximum value) is selected from the power spectrum (step S706). Specifically, first, a plurality of peaks is calculated, an appropriate peak is selected out of the peaks, and θ of the selected peak is assumed to be the noise direction angle N_ang of the sound source direction of the noise voice described in step S407 of
Although the above description has been made by assuming a plane for the sound source arrival direction of the noise voice, the above description is also applicable even if a three-dimensional space is assumed.
abs(S_ang−N_ang) (33)
As the algorithm implemented by processing performed in step S410 of
Alternatively, it is also possible to consider a point where the S/N ratio rises up to the highest level after exceeding the threshold value as the sound source separation position, instead of the point where the S/N ratio exceeds the threshold value sn_th.
For example,
In the above operation, preferably the control unit 201 in
If the S/N ratio is equal to or lower than the threshold value sn_th as the result of the determination of step S410 by the S/N ratio calculation unit 309 in the control processing illustrated in the flowchart of
In addition, while uttering words as described above, the control unit 201 may perform control such as giving an instruction to the target person by uttering words such as “a little more” or “stop” until the continuously-acquired noise direction angle N_ang reaches a favorable angle.
For example, if map information in the room of the building is available, the control unit 201 may perform control such as estimating two- or three-dimensional sound source positions of the target person and the noise on the map and moving the robot 100 to the sound source separation position on the basis of the estimation results. The map of the sound source positions may be made by getting as close as possible to the noise source and identifying the position for registration.
On the other hand, if the map of the sound source positions is unavailable, the noise source position may be estimated on the basis of the noise direction acquired during moving of the robot 100, the position thereof, and the orientation of the body of the robot 100. In this case, if there are two or more observation points, the sound source position is able to be determined. A certain error may be allowed in the estimated direction so that the estimation is performed from more observation points.
Furthermore, it is also possible to perform control of giving the instruction to the target person by uttering words like “rotate—degrees further” on the basis of the estimation result of the sound source position of the noise obtained by using the above map information.
If the robot 100 moves while looking the other way or moves around by itself during moving to the sound source separation position in the above embodiment, the target person feels odd. Therefore, it is preferable that the robot 100 moves to the sound source separation position in such a way that the target person does not feel odd. For example, preferably the robot 100 moves to the sound source separation position while making eye contact with the target person or looking toward the target person. Moreover, the robot 100 may move to the sound source separation position by moving slightly or only rotating, instead of moving to the sound source separation position at the time.
According to the above embodiments, it is possible to detect the optimized sound source separation information (the sound source separation direction or the sound source separation position) which enables the sound source separation in the state where the signal voice is separated from the noise voice most successfully. Thereby, voices other than the voice of the target person can be removed to decrease erroneous voice recognition.
When the control unit 201 in
Number | Date | Country | Kind |
---|---|---|---|
2017-123643 | Jun 2017 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
5771306 | Stork | Jun 1998 | A |
6449593 | Valve | Sep 2002 | B1 |
7415117 | Tashev et al. | Aug 2008 | B2 |
7567678 | Kong et al. | Jul 2009 | B2 |
7680667 | Sonoura et al. | Mar 2010 | B2 |
9263044 | Cassidy | Feb 2016 | B1 |
9591427 | Lyren | Mar 2017 | B1 |
10419924 | Yamaya | Sep 2019 | B2 |
10424320 | Shimada | Sep 2019 | B2 |
20030018475 | Basu | Jan 2003 | A1 |
20030061032 | Gonopolskiy | Mar 2003 | A1 |
20050147258 | Myllyla | Jul 2005 | A1 |
20050234729 | Scholl | Oct 2005 | A1 |
20060143017 | Sonoura et al. | Jun 2006 | A1 |
20070100630 | Manabe | May 2007 | A1 |
20110311060 | Kim | Dec 2011 | A1 |
20130182064 | Muench | Jul 2013 | A1 |
20150063589 | Yu | Mar 2015 | A1 |
20150331490 | Yamada | Nov 2015 | A1 |
20170339488 | Yoshida | Nov 2017 | A1 |
20180009107 | Fujimoto | Jan 2018 | A1 |
20180033447 | Ramprashad | Feb 2018 | A1 |
20180176680 | Knight | Jun 2018 | A1 |
20180285672 | Yamaya | Oct 2018 | A1 |
20180286432 | Shimada | Oct 2018 | A1 |
20180288609 | Yamaya | Oct 2018 | A1 |
20190278294 | Shimada | Sep 2019 | A1 |
20190392840 | Nakagome | Dec 2019 | A1 |
Number | Date | Country |
---|---|---|
2004334218 | Nov 2004 | JP |
2005253071 | Sep 2005 | JP |
2005529421 | Sep 2005 | JP |
2006181651 | Jul 2006 | JP |
2011191423 | Sep 2011 | JP |
2014153663 | Aug 2014 | JP |
2017005356 | Jan 2017 | JP |
Entry |
---|
Japanese Office Action dated Aug. 6, 2019 (and English translation thereof) issued in Japanese Patent Application No. JP 2017-123643. |
Number | Date | Country | |
---|---|---|---|
20180374494 A1 | Dec 2018 | US |