The example embodiments relates to an attribute identifying device, an attribute identifying method, and a program storage medium.
A voice processing device that estimates attribute information such as the sex or age of a person from a biological signal such as a voice signal obtained from an utterance of a speaker is known.
When estimating the attribute information of a person, the voice processing device of this type may estimate the attribute information as a discrete value or the attribute information as a continuous value.
PTL 1 describes a technique for estimating the age as an attribute of a person from a face image signal. In the age estimation technique described in PTL 1, first, the age is estimated as a discrete value from the face image signal, and the age is estimated as a continuous value. In the age estimation technique described in PTL 1, the estimation results by the above discrete value and continuous value are integrated to calculate a final estimated value.
However, the technique described in PTL 1 has a problem that the accuracy of identifying the attribute of a person is not sufficient.
In the technique described in PTL 1, when the age is estimated as the attribute of a person from the face image signal, a first estimated value, which is a discrete value, and a second estimated value, which is a continuous value, are integrated in accordance with a previously designed rule to calculate the final estimated value. The technique described in PTL 1 independently obtains the first estimated value and the second estimated value. Therefore, there is a case where the first estimate value and the second estimated value are greatly different, and in this case, two estimated values seem correct even after the integration, and it is difficult to narrow down the two estimated values to a single estimated value. Therefore, the identification accuracy of the age may be impaired.
The example embodiments has been made in view of the above problem, and an object thereof is to provide an attribute identifying device, an attribute identifying method, and a program storage medium in which the accuracy of attribute identification of a person is further enhanced.
An attribute identifying device according to one aspect of the example embodiments includes a first attribute identifying means identifying, based on a biological signal, first attribute information, which is a range of specific attribute values, from the biological signal, and a second attribute identifying means identifying second attribute information, which is specific attribute information, from the biological signal and the first attribute information.
An attribute identifying method according to one aspect of the example embodiments includes identifying, based on a biological signal, first attribute information, which is a range of specific attribute values, from the biological signal, and identifying second attribute information, which is specific attribute information, from the biological signal and the first attribute information.
A program storage medium according to one aspect of the example embodiments stores a program that causes a computer to execute a process of identifying, based on a biological signal, first attribute information, which is a range of specific attribute values, from the biological signal, and a process of identifying second attribute information, which is specific attribute information, from the biological signal and the first attribute information.
According to the example embodiments, it is possible to provide an attribute identifying device, an attribute identifying method, and a program storage medium in which the accuracy of attribute identification of a person is further enhanced.
Example embodiments will be described below with reference to the drawings. It should be noted that in the example embodiments, components denoted by the same reference signs perform similar operations, and thus, the description thereof may be omitted. Directions of arrows in the drawings are illustrative and do not limit directions of signals between blocks.
Hardware constituting a voice processing device or an attribute identifying device according to a first example embodiment and another example embodiment will be described.
As illustrated in
The storage device 14 stores a program 18. The processor 11 uses the RAM 12 to execute the program 18 associated with the voice processing device. The program 18 may be stored in the ROM 13. The program 18 may be stored on a storage medium 20 and read out by a drive device 17 or transmitted from an external device via a network.
The input/output interface 15 exchanges data with a peripheral device (keyboard, mouse, display device, or the like) 19. The input/output interface 15 may function as a means acquiring or outputting data. The bus 16 connects the components.
It should be noted that there are various modified examples of a method for implementing the voice processing device. For example, each unit of the voice processing device can be implemented as hardware (dedicated circuit). In addition, the voice processing device can be implemented by a combination of a plurality of devices.
The scope of each example embodiment also includes a processing method in which a program for operating the configuration of each example embodiment so as to implement functions of the present example embodiment and the another example embodiment (more specifically, a program that causes a computer to execute processing illustrated in
As the storage medium, for example, a floppy (registered trademark) disk, a hard disk, an optical disk, a magneto-optical disk, a compact disc (CD)-ROM, a magnetic tape, a nonvolatile memory card, or a ROM can be used. In addition, the scope of each example embodiment includes not only a method that executes the processing by using only the program stored on the storage medium, but also a method that executes the processing by operating the program on an operating system (OS) in cooperation with other software or expansion board functions.
The voice section detecting unit 110 receives a voice signal from the outside. The voice signal is a signal representing a voice based on an utterance of a speaker. The acquired signal is not limited to the voice signal, but may be a biological signal generated from a body by a biological phenomenon such as heartbeat, brain wave, pulse, respiration, or sweating.
The voice section detecting unit 110 detects and segments a voice section included in the received voice signal. At this time, the voice section detecting unit 110 may segment the voice signal into voice signals having predetermined lengths or into voice signals having different lengths. For example, the voice section detecting unit 110 may determine that a section of the voice signal in which the volume is lower than a predetermined value continuously for a predetermined period of time is silent, and may determine that sections before and after the section are different voice sections, so as to perform segmentation. The voice section detecting unit 110 outputs the segmented voice signals as segmentation results (processing results of the voice section detecting unit 110) to the speaker feature vector calculating unit 120. Here, the reception of the voice signal means example, reception of the voice signal from an external device or another processing device, or delivery of processing results of voice signal processing from another program. The output means example, transmission to an external device or another processing device, or delivery of the processing results of the voice section detecting unit 110 to another program.
The speaker feature vector calculating unit 120 receives the segmented voice signals from the voice section detecting unit 110. The speaker feature vector calculating unit 120 calculates speaker feature vectors expressing feature vectors of individuality included in the segmented voice signals on the basis of the received segmented voice signals. The speaker feature vector calculating unit 120 outputs the calculated speaker feature vectors (processing results of the speaker feature vector calculating unit 120).
That is, the speaker feature vector calculating unit 120 serves as a speaker feature vector calculating means calculating speaker feature vectors representing the individuality of a speaker on the basis of a voice signal representing a voice, which is a biological signal. Hereinafter, speaker feature vectors calculated for a certain voice signal are referred to as speaker feature vectors of the voice signal.
An example of the speaker feature vectors calculated by the speaker feature vector calculating unit 120 will be described. The speaker feature vector calculating unit 120 calculates a feature vector vector based on an i-vector representing the individuality of a voice quality of a speaker on the basis of the segmented voice signals received from the voice section detecting unit 110. The speaker feature vector calculating unit 120 may use, for example, a method described in NPL 1 as a method for calculating the feature vector vector based on the i-vector representing the individuality of the voice quality of the speaker. It should be noted that the speaker feature vectors calculated by the speaker feature vector calculating unit 120 are vectors that can be calculated by a predetermined operation on the segmented voice signals, and may be feature vectors representing the individuality of the speaker, and the i-vector is an example thereof.
Another example of the speaker feature vectors calculated by the speaker feature vector calculating unit 120 will be described. The speaker feature vector calculating unit 120 calculates a feature vector vector representing frequency analysis results of the voice signal on the basis of the segmented voice signals received from the voice section detecting unit 110. The speaker feature vector calculating unit 120 calculates, as feature vectors representing the frequency analysis results, frequency filter bank feature vectors obtained by fast Fourier transform (FFT) processing and filter bank processing, or mel frequency cepstral coefficient (MFCC) feature vectors obtained by discrete cosine transform processing in addition to the above processing, for example.
The first attribute identifying unit 130 receives the speaker feature vectors output by the speaker feature vector calculating unit 120. The first attribute identifying unit 130 estimates (identifies) specific attribute information by using the speaker feature vectors and outputs the attribute information as first attribute information. The specific attribute information may be, for example, information indicating an age group of a speaker. The first attribute identifying unit 130 serves as a first attribute identifying means identifying, on the basis of the biological signal, first attribute information, which is a range of specific attribute values, from the biological signal. The identification includes estimation of an attribute value, classification based on a range of the attribute values, and the like.
An example of a method in which the first attribute identifying unit 130 estimates the first attribute information will be described. The first attribute identifying unit 130 may use, for example, a neural network as an identifier. The first attribute identifying unit 130 may use, as the identifier, a probabilistic model such as a Gaussian mixture distribution, or an identification model such as a linear discriminant analysis or a support vector machine. Here, the identifier of the first attribute identifying unit 130 learns learning data in which speaker feature vectors related to voice signals are associated with classes (details will be described later) including attribute values of a speaker. The learning generates the identifier whose input is the speaker feature vectors and whose output is a class (first attribute information). For example, when the neural network is used as the identifier, the first attribute identifying unit 130 calculates attribute information to be output on the basis of the speaker feature vectors of the input and a weighting coefficient of the neural network.
The first attribute identifying unit 130 defines the classes on the basis of the range of possible values of the attribute to be estimated. Here, it is assumed, for example, that the possible value of the attribute to be estimated is a natural number from “10” to “60”. At this time, as illustrated in
The second attribute identifying unit 140 receives the speaker feature vectors output by the speaker feature vector calculating unit 120 and the first attribute information output by the first attribute identifying unit 130. The second attribute identifying unit 140 estimates (identifies) specific attribute information (second attribute information) by using the received speaker feature vectors and first attribute information. The specific attribute information may be, for example, information indicating the age of a speaker. The second attribute identifying unit 140 serves as a second attribute identifying means identifying second attribute information, which is the specific attribute information, from the biological signal and the first attribute information.
An example of a method in which the second attribute identifying unit 140 estimates the specific attribute information will be described. The second attribute identifying unit 140 may use, for example, a neural network as an identifier.
Here, the identifier of the second attribute identifying unit 140 learns learning data in which speaker feature vectors related to voice signals, attribute values of a speaker, and classes including the attribute values are associated. The learning generates the identifier whose input is the speaker feature vectors and the first attribute information, which is the output of the first attribute identifying unit 130 to which the speaker feature vectors are input, and whose output is the attribute information (attribute value), which is an estimation result. When the neural network is used as the identifier, the second attribute identifying unit 140 calculates the attribute information to be output on the basis of the input including the speaker feature vectors and the first attribute information and a weighting coefficient of the neural network.
At this time, the second attribute identifying unit 140 calculates the estimation result as a continuous value.
As described above, the second attribute identifying unit 140 can enhance the accuracy of attribute identification by using, as the input, the first attribute information output by the first attribute identifying unit 130. The reason for this is that the second attribute identifying unit 140 estimates the attribute information by using, as prior information, the result estimated by the first attribute identifying unit 130, so that there is a high possibility of outputting a value close to a true value, as compared with estimation only from the speaker feature vectors without the prior information. In particular, when the second attribute identifying unit 140 estimates a continuous value, since the identifier learns to minimize a residual at a learning stage, the estimated value tends to be biased toward the middle of the overall values if the overall performance is to be improved. That is, when the true value is lower than an average value, the estimated value is likely to be estimated higher, and when the true value is higher than the average value, the estimated value is likely to be estimated lower. On the other hand, when the range of the attribute values estimated by the first attribute identifying unit 130 is used as the prior information, the above-described bias can be reduced.
Here, the second attribute identifying unit 140 may calculate the estimation result as a discrete value. In this case, the second attribute identifying unit 140 calculates, as the estimation result by the discrete value, a class whose value range is narrower (more limited) than that of the class defined by the first attribute identifying unit 130. The identifier of the second attribute identifying unit 140 learns learning data in which speaker feature vectors related to input voice signals and classes including attribute values of a speaker are associated. At this time, the second attribute identifying unit 140 uses, for the learning data, classes each defined in a range narrower than the range of the attribute values defined by the first attribute identifying unit 130. In a case of the above-described example, the range of values included in each of the classes C1 to C3 and classes D1 to D3 defined by the first attribute identifying unit 130 is “10”. Therefore, the second attribute identifying unit 140 defines the classes in units of “5”, for example, so that the range is narrower than “10”. The second attribute identifying unit 140 uses the classes defined in this way for the learning data. The learning generates the identifier whose input is the speaker feature vectors and the first attribute information, which is the output of the first attribute identifying unit 130 to which the speaker feature vectors are input, and whose output is the attribute information (class), which is the estimation result.
When the above model is used, the second attribute identifying unit 140 calculates the estimation result as a discrete value.
The second attribute identifying unit 140 may have a multi-stage configuration.
In this case, the second attribute identifying unit 140 calculates a discrete value as a tentative estimated value (tentative attribute information) in the processing unit 141, and uses the tentative estimated value to calculate an estimated value as a continuous value by the processing unit 142.
The processing unit 141 learns learning data in which speaker feature vectors related to voice signals are associated with classes including attribute values of a speaker. The learning generates an identifier whose input is the speaker feature vectors and the output of the first attribute identifying unit 130 to which the speaker feature vectors are input, and whose output is a class (tentative estimated value). At this time, the processing unit 141 calculates the tentative estimated value indicated by a class in a unit of “5”, for example, as described above, by using the speaker feature vectors and the first attribute information.
The processing unit 142 learns learning data in which speaker feature vectors related to voice signals, attribute values of a speaker, and classes including the attribute values of the speaker are associated. The learning generates an identifier whose input is the speaker feature vectors, the output (tentative estimated value) of the processing unit 141 to which the speaker feature vectors are input, and the first attribute information, and whose output is the second attribute information (attribute value), which is the estimation result.
The processing unit 142 calculates the estimated value by a continuous value by using the speaker feature vectors, the first attribute information, and the tentative estimated value, which is the output of the processing unit 141. The processing unit 142 may also use the tentative estimated value calculated by the processing unit 141 to calculate and output, as a discrete value, a class defined in a range narrower than a range of the attribute values defined by the processing unit 141.
As described above, the second attribute identifying unit 140 defines the range of the attribute values that is narrower (finer) than the range of the attribute values defined by the first attribute identifying unit 130, and estimates the class. Alternatively, the second attribute identifying unit 140 estimates the attribute value as a continuous value. Therefore, it can be said that the second attribute identifying unit 140 has a function capable of outputting the true value. Although the voice processing device 100 includes a plurality of attribute identifying units, a single estimated value can be calculated because the second attribute identifying unit 140 calculates a final estimated value. As described above, in the voice processing device 100, the second attribute identifying unit 140 calculates the attribute information by using the first attribute information output by the first attribute identifying unit 130 in addition to the speaker feature vectors, so that the attribute estimation result with high accuracy can be output.
It should be noted that in the above description, the first attribute identifying unit 130 outputs one piece of attribute information, but the first attribute identifying unit 130 may output a plurality of pieces of attribute information.
As described above, the first attribute identifying unit 130 can enhance the accuracy of attribute identification by defining a plurality of classes such that the classes adopt different ways of dividing the range of possible values of the attribute to be estimated. For example, in the case of identifying which of the classes C1 to C3 includes the attribute value, the class C2 including “21” to “40” is focused on. In this case, the identification accuracy is lower at “21” and “40” close to boundaries of the range of values included in C2 than at “30” close to the middle of the range of values included in C2. That is, “21” may be identified as an incorrect one of the class C1 and C2, and “40” may be identified as an to incorrect one of the class C2 and C3. Therefore, as described above, the classes D1 to D3 are separately defined as ranges of values in which values close to the boundaries such as “21” and “40” are close to the middle. That is, the first attribute identifying unit 130 divides the attribute values in two or more patterns so that boundary values in ranges of the attribute values are different from each other, and identifies a range of the attribute values in each of the divisions. As a result, values close to boundaries in the classes C1 to C3 can be identified similarly to values close to the middle, so that the identification accuracy can be enhanced.
As described above, in the voice processing device 100 according to the present example embodiment, the first attribute identifying unit 130 roughly estimates the attribute value as the first attribute information, and the second attribute identifying unit 140 estimates the attribute value in detail by using the first attribute information. Thus, according to the present example embodiment, the attribute value can be accurately estimated for the voice signal. That is, the voice processing device 100 according to the present example embodiment can enhance the accuracy of attribute identification of a person.
Next, an operation of the voice processing device 100 according to the first example embodiment will be described with reference to a flowchart of
The voice processing device 100 receives one or more voice signals from the outside and provides the voice signals to the voice section detecting unit 110. The voice section detecting unit 110 segments the received voice signals and outputs the segmented voice signals to the speaker feature vector calculating unit 120 (step S101).
The speaker feature vector calculating unit 120 calculates speaker feature vectors for each of the received one or more segmented voice signals (step S102).
The first attribute identifying unit 130 identifies and outputs first attribute information on the basis of the received one or more speaker feature vectors (step S103).
The second attribute identifying unit 140 identifies and outputs second attribute information on the basis of the received one or more speaker feature vectors and first attribute information (step S104). When the reception of the voice signals from the outside is completed, the voice processing device 100 ends the series of processing.
As described above, according to the voice processing device 100 according to the present example embodiment, the accuracy of attribute identification of a person can be improved. This is because the voice processing device 100 uses the first attribute information roughly estimated by the first attribute identifying unit 130 to estimate and output the attribute information in more detail by the second attribute identifying unit 140.
As described above, according to the voice processing device 100 according to the present example embodiment, the estimated value can be obtained with a certain accuracy regardless of the possible value of the attribute, by a calculation method in which the attribute identification of a person is calculated in a gradually detailed manner.
The voice processing device 100 according to the first example embodiment is an example of an attribute identifying device that identifies specific attribute information from a voice signal. The voice processing device 100 can be used as an age identifying device when the specific attribute information is the age of a speaker. The attribute information may be information indicating the sex of a speaker, information indicating an age group of a speaker, or information indicating the physique of a speaker.
The voice processing device 100 can be used as an emotion identifying device when the specific attribute information is information indicating the emotion of a speaker during an utterance. The voice processing device 100 can also be used, for example, as a part of a voice retrieval device or a voice display device provided with a mechanism for specifying one of a plurality of stored voice signals associated to a specific feeling, on the basis of emotion information estimated by use of emotion feature vectors. The emotion information includes, for example, information indicating emotion expressions and information indicating the character of a speaker.
An example embodiment having a minimum configuration of the embodiment will be described.
The first attribute identifying unit 130 identifies, on the basis of a biological signal, first attribute information, which is a range of specific attribute values, from the biological signal. The second attribute identifying unit 140 identifies second attribute information, which is specific attribute information, from the biological signal and the first attribute information.
According to the second example embodiment, by adopting the above configuration, the second attribute identifying unit 140 uses, as an input, the first attribute information output by the first attribute identifying unit 130, and thus it is possible to obtain an effect that the accuracy of attribute identification of a person can be further enhanced.
While the embodiments has been particularly shown and described with reference to example embodiments thereof, the embodiments is not limited to these example embodiments. It will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the example embodiments as defined by the claims. That is, the embodiments is not limited to the above example embodiments, but various modifications are possible, and it goes without saying that the modifications are also included within the scope of the example embodiments.
As described above, the voice processing device and the like according to one aspect of the example embodiments have the effect of enhancing the accuracy of attribute identification of a person, and are useful as a voice processing device and the like and an attribute identifying device.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2018/023594 | 6/21/2018 | WO | 00 |