This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2012-256715, filed on Nov. 22, 2012, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are, for example, related to a signal processing device that controls input and output signals such as speech signals and image signals, a method for processing a signal, a computer-readable recording medium on which a signal processing program is recorded, and a mobile terminal device.
Currently, methods for controlling a speech signal, which is an example of an input signal such that the speech signal becomes easy to hear have been disclosed. For example, as disclosed in Katsuyuki Niyada “Research on Speech Recognition Interface for Elderly People Based on Acoustic Feature Extraction of Elderly Speech”, Report of the Grant-in-Aid for Scientific Research, 19560387, because the speech recognition performance of elderly people declines as hearing decreases due to aging, it generally becomes difficult for the elderly people to recognize speech of a speaker in a received speech sound in bidirectional speech communication using mobile terminals or the like if the speed of the speech increases. Therefore, in Japanese Patent No. 3619946, for example, a technique for improving the audibility of speech by detecting speech periods in a received speech sound and extending the speech and for reducing delay caused by extending the speech by shortening non-speech periods is disclosed.
In the control of speech signals, because the appropriate amount of control (for example, the amount of power of a speech signal to be amplified) varies depending on the age, a user is supposed to input his/her age, which decreases operability. In the case of a mobile terminal used only by a single user, it is possible that the user does not have to input his/her age so frequently. In these years, however, calls are often made through a speech application installed on a personal computer shared by general users. In addition, a fixed telephone installed in a call center or the like is often shared by general users, and therefore the decrease in operability caused by input of one's age is not negligible.
Under such circumstances, a method for estimating the age on the basis of the speed of speech has been disclosed. In this method, the speed of speech is calculated from an input sound at a time when a speaker has uttered a particular sample sentence or word, and if the speed of speech is lower than a predetermined threshold, the speaker is determined to be an elderly person. The speed of speech in this case is the number of vowels (the number of moras) uttered in one second, and the unit used is moras/s.
In accordance with an aspect of the embodiments, a signal processing device includes a processor; and a memory which stores a plurality of instructions, which when executed by the processor, cause the processor to execute, receiving speech of a speaker as a first signal; detecting an expiration period included in the first signal; extracting a number of phonemes included in the expiration period; and controlling, a second signal, which is an output to the speaker, on the basis of the number of phonemes and a length of the expiration period.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
These and/or other aspects and advantages will become apparent and more readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawing of which:
Signal processing devices, a method for processing a signal, and a signal processing program according to embodiments will be described hereinafter. The embodiments do not limit the technique disclosed herein.
The input unit 2 is, for example, a hardware circuit adopting wired logic connection. Alternatively, the input unit 2 may be a function module realized by a computer program executed by the signal processing device 1. The input unit 2 obtains, from the outside, a first signal, which is a near-end signal, generated by speech uttered by a speaker and a second signal, which is a far-end signal, including a received speech sound. The input unit 2 may receive the first signal from, for example, a microphone, which is not illustrated, connected to or provided in the signal processing device 1. The input unit 2 may receive the second signal through, for example, a wired or wireless circuit and decode the second signal using a decoding unit, which is not illustrated, connected to or provided in the signal processing device 1. The input unit 2 outputs the received first signal to the detection unit 3 and the extraction unit 4. In addition, the input unit 2 outputs the received second signal to the FFT unit 7.
The detection unit 3 is, for example, a hardware circuit adopting wired logic connection. Alternatively, the detection unit 3 may be a function module realized by a computer program executed by the signal processing device 1. The detection unit 3 receives the first signal from the input unit 2. The detection unit 3 detects an expiration period included in the first signal. The expiration period is, for example, a period from a beginning of speech after the speaker inspires during speech to next inspiration (that is, a period between first inspiration and second inspiration or a period for which speech continues). The detection unit 3 may detect, for example, an average signal-to-noise ratio (SNR), which serves as a signal power-to-noise ratio, from a plurality of frames included in the first signal, and detect a period during which the average SNR satisfies a certain condition as the expiration period. The detection unit 3 outputs the detected expiration period to the extraction unit 4 and the calculation unit 5.
Here, a process for detecting an expiration period performed by the detection unit 3 will be described.
In
Here, f denotes a frame number (f is an integer equal to or larger than 0) sequentially given to each frame from the beginning of input of acoustic frames included in the first signal, M denotes the time length of one frame, t denotes time, and c(t) denotes the amplitude (power) of the first signal.
The noise estimation section 12 receives the volume S(f) of each frame from the volume calculation section 11. The noise estimation section 12 estimates noise in each frame and outputs a result of the estimation of noise to the average SNR calculation section 13. In the estimation of noise in each frame made by the noise estimation section 12, for example, a first method for estimating noise or a second method for estimating noise, which will be described hereinafter, may be used.
(First Method for Estimating Noise)
The noise estimation section 12 may estimate magnitude (power) N(f) of the frame f using the following expression on the basis of the volume S(f) of the frame f, the volume S(f−1) of the previous frame (f−1), and the magnitude N(f−1) of noise in the previous frame (f−1).
Here, α and β are constants, and may be experimentally determined. For example, a may be 0.9 and β may be 2.0. In addition, the initial value N(−1) of noise power may be experimentally determined. In Math. 2, if the volume S(f) of the frame f does not change from the volume S(f−1) of the previous frame f−1 by the constant β or more, the noise power N(f) of the frame f is updated. On the other hand, if the volume S(f) of the frame f changes from the volume S(f−1) of the previous frame f−1 by the constant β or more, the noise power N(f−1) of the previous frame f−1 is determined to be the noise power N(f) of the frame f. The noise power N(f) may be used as the result of the estimation of noise.
(Second Method for Estimating Noise)
The noise estimation section 12 may update the magnitude of noise using the following Math. 3 on the basis of a ratio of the sound S(f) of the frame f to the noise power N(f−1) of the previous frame f−1.
Here, γ is a constant, and may be experimentally determined. For example, γ may be 2.0. In addition, the initial value N(−1) of noise power may be experimentally determined. In Math. 3, if the volume S(f) of the frame f is smaller than the product of the noise power N(f−1) of the previous frame f−1 and the constant γ, the noise power N(f) of the frame f is updated. On the other hand, if the volume S(f) of the frame f is equal to or larger than the product of the noise power N(f−1) of the previous frame f−1 and the constant γ, the noise power N(f−1) of the previous frame f−1 is determined to be the noise power N(f) of the frame f.
In
Here, L may be determined to a value larger than the general length of a geminate consonant, that is, L may be determined to be the number of frames corresponding to 0.5 ms.
The expiration period determination section 14 receives the average SNR from the average SNR calculation section 13. The expiration period determination section 14 includes a buffer or a cache, which is not illustrated, and a preprocessing frame generated by the expiration period determination section 14 holds a flag f_breath indicating whether or not the current time is in an expiration period. The expiration period determination section 14 detects a beginning tb of an expiration period using the following Math. 5 and an end te of the expiration period using the following Math. 6 on the basis of the average SNR and the flag f_breath.
tb=f×M(if f_breath=not in expiration period and SNR(f)>THSNR) Math. 5
te=f×M−1(if f_breath=in expiration period and SNR(f)<THSNR) Math. 6
Here, THSNR is a threshold (may be referred to as a second threshold) for determining that the frame f processed by the expiration period determination section 14 is not noise, and may be experimentally determined. The expiration period determination section 14 outputs the expiration period detected using Math. 5 and Math. 6 to the extraction unit 4 and the calculation unit 5 through the detection unit 3.
In
The calculation unit 5 is, for example, a hardware circuit adopting wired logic connection. Alternatively, the calculation unit 5 may be a function module realized by a computer program executed by the signal processing device 1. The calculation unit 5 receives the expiration period from the detection unit 3 and the number of phonemes included in the expiration period from the extraction unit 4. The calculation unit 5 calculates the speed of speech in the expiration period by dividing the number of phonemes included in the expiration period by the length of the expiration period. In the first embodiment, a value obtained by dividing the number of phonemes included in the expiration period by the length of the expiration period is defined as the speed of speech in the expiration period for convenience of description. The calculation unit 5 outputs the calculated speed of speech in the expiration period to the estimation unit 6.
Here, the speed of speech in the expiration period in the first embodiment and technical significance of calculation and extraction of the number of phonemes included in the expiration period will be described. In order to examine factors in changing the speed of speech in bidirectional speech communication in which the content of speech continuously changes, the inventors have focused upon a fact that speakers speak a unified set of contents such as a sentence in one expiration period, as disclosed in C. W. Wightman “Automatic Labeling of Prosodic Patterns”, IEEE, 1994. The inventors have then conducted a demonstration experiment while assuming that speakers unconsciously adjust their speed of speech in order to finish speaking a unified set of contents in one expiration period.
Now, a method for determining the first threshold will be described. The first threshold may be, for example, determined before a speaker uses the signal processing device 1 and saved in a cache or a memory, which is not illustrated, of the estimation unit 6 illustrated in
oy(x)={o(x)+y(x)}/2 Math. 7
Next, an estimation equation f(x) for estimating oy from x is determined using the method of least squares as follows.
f(x)=ax+b Math. 8
Here, a and b are values with which an error err represented by the following expression becomes smallest.
err=Σ
x=1
M
{oy(x)−f(x)}2 Math. 9
Here, the first threshold to be changed in accordance with the number of moras included in the expiration period may be calculated by substituting the number of moras extracted by the detection unit 3 for Math. 8, or a table representing the relationship between the number of moras in the expiration period and the first threshold may be saved, for example, in the memory or the cache, which is not illustrated, of the estimation unit 6 illustrated in
In
(First Method for Estimating Age)
θ=ax+b Math. 10
If the speed of speech in the expiration period calculated by the calculation unit 5 is denoted by s, an estimated age g may be obtained as follows using Math. 10.
(Second Method for Estimating Age)
If the speed of speech in the expiration period calculated by the calculation unit 5 is denoted by s, the estimated age g may be complementarily obtained as follows using Math. 10.
Here, th is an arbitrary threshold, and may be experimentally determined.
In
The control unit 8 is, for example, a hardware circuit adopting wired logic connection. Alternatively, the control unit 8 may be a function module realized by a computer program executed by the signal processing device 1. The control unit 8 receives the result of the estimation of the age according to the speed of speech of the speaker from the estimation unit 6 and the frequency spectrum from the FFT unit 7. The control unit 8 controls the frequency spectrum on the basis of the result of the estimation of the age according to the speed of speech of the speaker.
R′(f)=R(f)+G(f) Math. 13
Here, R′(f) denotes the corrected power of the frequency spectrum, R(f) denotes the power of the frequency spectrum before the correction, and G(f) denotes the decrease in hearing according to the estimated age determined by
Alternatively, the control unit 8 may convert the frequency spectrum received from the FFT unit 7 into the speed of speech according to the estimated age (that is, convert the frequency). In the conversion of the speed of speech, for example, a method disclosed in Japanese Patent No. 3619946 may be adopted. Alternatively, the control unit 8 may combine the above-described correction of power and conversion of the speed of speech.
In
The detection unit 3 receives the first signal from the input unit 2 and detects an expiration period included in the first signal (step S902). The detection unit 3 outputs the detected expiration period to the extraction unit 4 and the calculation unit 5.
The extraction unit 4 receives the first signal from the input unit 2 and the expiration period from the detection unit 3. The extraction unit 4 extracts the number of phonemes of the first signal included in the expiration period (step S903). The extraction unit 4 outputs the number of phonemes included in the expiration period to the calculation unit 5 and the estimation unit 6.
The calculation unit 5 receives the expiration period from the detection unit 3 and the number of phonemes included in the expiration period from the extraction unit 4. The calculation unit 5 calculates the speed of speech in the expiration period by dividing the number of phonemes included in the expiration period by the length of the expiration period (step S904). The calculation unit 5 outputs the calculated speed of speech in the expiration period to the estimation unit 6.
The estimation unit 6 receives the speed of speech in the expiration period from the calculation unit 5 and the number of phonemes included in the expiration period from the extraction unit 4. The estimation unit 6 estimates an age according to the speed of speech of the speaker on the basis of the speed of speech in the expiration period, the number of phonemes included in the expiration period, and the first threshold according to the number of phonemes included in the expiration period (step S905). The estimation unit 6 outputs a result of the estimation of the age according to the speed of speech of the speaker to the control unit 8.
The control unit 8 receives the result of the estimation of the age according to the speed of speech of the speaker from the estimation unit 6 and the frequency spectrum calculated by the FFT unit 7. The control unit 8 controls the frequency spectrum of the second signal on the basis of the result of the estimation of the age according to the speed of speech of the speaker (step S906). The control unit 8 outputs the controlled second signal to the outside through the IFFT unit 9.
The input unit 2 determines whether or not the second signal is still being received (step S907). If the input unit 2 determines that the second signal is not already being received (NO in step S907), the signal processing device 1 ends the signal processing illustrated in
According to the signal processing device according to the first embodiment, the age according to the speed of speech of the speaker may be accurately estimated. Furthermore, according to the signal processing device according to the first embodiment, since the far-end signal is controlled in accordance with the estimated age, the audibility for a user who is listening to the speech improves. Although the age of a speaker who generates the first signal, which is the near-end signal, is estimated in the first embodiment, the age of the speaker who generates the first signal may be estimated on a speech receiving side, on which the second signal, which is the far-end signal, is generated, and the second signal controlled on the basis of the estimated age according to the speed of speech of the speaker who has generated the first signal may be generated.
The input unit 2 is, for example, a hardware circuit adopting wired logic connection. Alternatively, the input unit 2 may be a function module realized by a computer program executed by the signal processing device 20. The input unit 2 obtains, from the outside, a first signal, which is a near-end signal, generated by speech uttered by a speaker. The input unit 2 may receive the first signal from, for example, a microphone, which is not illustrated, connected to or provided in the signal processing device 20. The input unit 2 outputs the input first signal to the detection unit 3, the extraction unit 4, and the response unit 15.
The response unit 15 is, for example, a hardware circuit adopting wired logic connection. Alternatively, the response unit 15 may be a function module realized by a computer program executed by the signal processing device 20. The response unit 15 includes a memory or a storage section such as a hard disk drive (HDD) or a solid-state drive (SSD), which is not illustrated. The memory or the storage section stores in advance a plurality of pieces of image information or speech information. Alternatively, the response unit 15 may receive the image information or the speech information from the outside. The response unit 15 receives the first signal from the input unit 2, selects a certain piece of image information or speech information associated with the first signal using a known speech recognition technique, and outputs the selected piece of image information or speech information to the control unit 8 as the second signal.
The control unit 8 is, for example, a hardware circuit adopting wired logic connection. Alternatively, the control unit 8 may be a function module realized by a computer program executed by the signal processing device 20. The control unit 8 receives the result of the estimation of the age according to the speed of speech of the speaker from the estimation unit 6 and the second signal from the response unit 15. The control unit 8 controls the second signal on the basis of the result of the estimation of the age according to the speed of speech of the speaker. The control unit 8 outputs the controlled second signal to the outside as an output signal. The control unit 8 may output the output signal to, for example, a speaker, which is not illustrated, connected to or provided in the signal processing device 20.
When the second signal includes speech information, the control unit 8 performs the control of power using Math. 13 or the control of the speed of speech disclosed in Japanese Patent No. 3619946. Alternatively, the control unit 8 may combine the correction of power of the second signal and the conversion of the speed of speech.
When the second signal includes image information, the control unit 8 controls pixels or regions of the image information. Here, the pixels in the second embodiment indicate information including at least luminance, tone, contrast, noise, sharpness, and resolution or information to be subjected to a filtering process. In addition, the regions in the second embodiment indicate information to be subjected to a geometric conversion process including at least the size and the shape of an image.
The control unit 8 may perform, for example, a process for enhancing edges included in image information disclosed in Japanese Laid-open Patent Publication No. 2011-134203 in accordance with the estimated age. In addition, the control unit 8 may perform a process for enlarging regions included in image information in accordance with the estimated age using the following expression.
I′(x,y)=I(x/(g×a),y/(g×a)) Math. 14
Here, I′(x, y) denotes corrected image information regarding a pixel x, y, I(x, y) denotes image information regarding the pixel x, y before the correction, and g×a denotes a parameter of decrease in eyesight due to aging.
According to the signal processing device according to the second embodiment, the age according to the speed of speech of the speaker may be accurately estimated. Furthermore, according to the signal processing device according to the second embodiment, since the image information or the speech information is controlled in accordance with the estimated age, the visibility and the audibility for the user improves.
The control unit 21 is a central processing unit (CPU) that controls the other components and that calculates and processes data in the computer. In addition, the control unit 21 is an arithmetic device that executes programs stored in the main storage unit 22 and the auxiliary storage unit 23, and receives data from the input unit 27 and a storage device, calculates and processes the data, and outputs the data to the display unit 28, the storage device, or the like.
The main storage unit 22 is a read-only memory (ROM), a random-access memory (RAM), or the like, and is a storage device that stores or temporarily saves programs and data such as an operating system (OS), which is basic software, and application software to be executed by the control unit 21.
The auxiliary storage unit 23 is an HDD or the like and is a storage device that stores data regarding application software and the like.
The drive device 24 reads a program from a recording medium 25, that is, for example, a flexible disk, and installs the program in the auxiliary storage unit 23.
A certain program is stored in the recording medium 25, and the certain program stored in the recording medium 25 is installed in the signal processing device 1 through the drive device 24. The installed certain program may be executed by the signal processing device 1.
The network I/F unit 26 is an interface between a peripheral device having a communication function connected through a network such as a local area network (LAN) or a wide area network (WAN) constructed by a data transmission path such as a wired and/or wireless line and the signal processing device 1.
The input unit 27 includes a cursor key, a keyboard including numeric input keys and various function keys, and a mouse or a touchpad for selecting a key in a screen displayed on the display unit 28. The input unit 27 is a user interface for enabling the user to issue an operation instruction to the control unit 21 and input data.
The display unit 28 includes a cathode ray tube (CRT) or a liquid crystal display (LCD), and performs display according to display data input from the control unit 21.
The above-described method for processing a signal may be realized as a program to be executed by the computer. By installing the program from a server or the like and causing the computer to execute the program, the above-described method for processing a signal may be realized.
Alternatively, the above-described signal processing may be realized by recording the program on the recording medium 25 and causing the computer or a mobile terminal (device) (for example, mobile telephone) to read the recording medium 25 on which the program is recorded. The recording medium 25 may be one of various types of recording media including a recording medium that optically, electrically, or magnetically records information, such as a compact disc read-only memory (CD-ROM), a flexible disk, or a magneto-optical disk, and a semiconductor memory that electrically records information, such as a ROM or a flash memory.
The antenna 31 transmits a wireless signal amplified by a transmission amplifier and receives a wireless signal from a base station. The radio unit 32 performs digital-to-analog conversion on a transmission signal spread by the baseband processing unit 33, converts the transmission signal into a high-frequency signal through quadrature modulation, and amplifies the signal using a power amplifier. The radio unit 32 amplifies a received wireless signal, performs analog-to-digital conversion on the signal, and transmits the signal to the baseband processing unit 33.
The baseband processing unit 33 performs baseband processing including addition of an error correction code to data to be transmitted, data modulation, spreading modulation, inverse spreading of a received signal, a determination as to a reception environment, a determination as to a threshold for each channel signal, and error correction decoding.
The control unit 21 performs wireless control such as transmission and reception of control signals. In addition, the control unit 21 executes a signal processing program stored in the auxiliary storage unit 23 or the like in order to, for example, perform the signal processing according to the first embodiment.
The main storage unit 22 is a ROM, a RAM, or the like, and is a storage device that stores or temporarily saves programs and data such as an OS, which is a basic software, and application software to be executed by the control unit 21.
The auxiliary storage unit 23 is an HDD, an SSD, or the like, and is a storage device that stores data regarding application software and the like.
The terminal interface unit 34 performs data adapter processing and interface processing between a handset and an external data terminal.
The microphone 35 receives ambient sound including speech uttered by a speaker, and outputs the ambient sound to the control unit 21 as a microphone signal. The speaker 36 outputs the signal output from the main storage unit 22 as an output signal.
The components of each device illustrated above do not have to be physically configured as illustrated. That is, specific modes of distributing or integrating each device are not limited to those illustrated, and all or part of each device may be functionally or physically distributed or integrated in an arbitrary unit in accordance with various loads, use conditions, and the like. Various processes described in the above embodiments may be realized by executing prepared programs using a computer such as a personal computer or a work station.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2012-256715 | Nov 2012 | JP | national |