The present invention relates to a speech imagery recognition device, a wearing fixture, a speech imagery recognition method, and a program.
A speech input device that is in practical use receives speech sound waves from a microphone or bone conduction vibrations from a vibration pickup to recognize speech information from the signals obtained.
Recently, using a huge amount of voice and language data, probability information about the sequence of phonemes (acoustic model) and the sequence of words (language model) is stored and used on a network, which achieves high-speed and high-performance speech recognition. Meanwhile, due to annoyance to others caused by speaking and the leakage of information as well as the increase in the number of patients with amyotrophic lateral sclerosis (ALS) who have difficulty in speaking, language recognition is desired to be carried out based on speech imagery without speech production in the field of brain-computer interface (BCI).
Speech recognition of spoken words has lately been attempted through speech imagery signals by monitoring 64 to 128-channel electrocorticographic (ECoG) recordings from the cerebral cortex (see Non-Patent Document 1). However, it is not realistic to use such a method involving craniotomy for other than critically ill patients. Besides, although a technology that uses electrodes placed along the scalp to record an electroencephalogram (EEG) will make an invaluable contribution to society when put to practical use, attempts to find meaningful speech signals in noise have not been successful so far.
In recent years, research has progressed on analyzing the brain during speech production with a high-resolution device such as PET and fMRI and monitoring ECoG when a patient speaks at the time of craniotomy, and it is becoming clearer which part of the brain processes speech. According to the results, after concept preparation in the left middle temporal gyrus (MTG), planning for speech takes place in the left superior temporal gyrus (STG) (see Non-Patent Document 2). This is followed by syllabication in the left inferior frontal gyms (IFG; Broca's area), and articulation occurs in the left precentral gyrus (PG; motor area) while speech is produced (see Non-Patent Document 3). Based on these research findings, it is expected that decoding silent or imagined speech is enabled if the linguistic representation that reaches Broca's area can be captured.
In addition, there has been proposed a technology in which brain waves are detected to extract signals related to a motor command from the brain waves (see Patent Document 1).
The biggest problem in speech recognition from EEG signals is that since it is unclear in what format the linguistic representation is expressed, a specific extraction method cannot be found out. Furthermore, without a method for converting the linguistic representation to phoneme units, efficient speech processing is hardly feasible because of many types of targets such as syllabic units. There are said to be thousands of syllables including many long syllables as well as short syllables; there are about 24 phonemes in Japanese and 44 phonemes in English (English vowels are classified into tense and lax vowels, while generally Japanese vowels are not).
The present invention has been made in view of the above problems. An object of the present invention is to provide a speech imagery recognition device, a wearing fixture, a speech imagery recognition method, and a program that enable speech recognition using EEG signals.
To achieve the object mentioned above, the present invention is mainly characterized in that, in order to recognize speech from EEG signals during speech imagery, line spectral components are extracted as a linguistic representation by a line spectral component extractor, and these components are passed through a phoneme-feature vector time sequence converter that uses phoneme-specific convolution operation or the like to thereby obtain a phoneme-feature vector time sequence.
According to the first aspect of the present invention, there is provided a speech imagery recognition device that recognizes speech from EEG signals during speech imagery. The speech imagery recognition device includes: an analysis processor that analyzes discrete signals, which are obtained from EEG signals received from a plurality of electrodes, for each of the electrodes and outputs a spectral time sequence; and an extractor that outputs a phoneme-feature vector time sequence based on the spectral time sequence.
According to the second aspect of the present invention, there is provided a wearing fixture for a speech imagery recognition device that recognizes speech from EEG signals during speech imagery. The wearing fixture includes a plurality of electrodes configured to be placed over Broca's area and an output unit that outputs signals from the electrodes. The speech imagery recognition device is configured to: analyze discrete signals, which are obtained from EEG signals output from the output unit, for each of the electrodes to output a spectral time sequence; and extract and output a phoneme-feature vector time sequence based on the spectral time sequence.
According to the third aspect of the present invention, there is provided a speech imagery recognition method for recognizing speech from EEG signals during speech imagery. The speech imagery recognition method includes: analyzing discrete signals, which are obtained from EEG signals received from a plurality of electrodes, for each of the electrodes to output a spectral time sequence; and extracting and outputting a phoneme-feature vector time sequence based on the spectral time sequence.
According to the fourth aspect of the present invention, there is provided a program that causes a computer to perform a speech imagery recognition process of recognizing speech from EEG signals during speech imagery. The program causes the computer to: analyze discrete signals, which are obtained from EEG signals received from a plurality of electrodes, for each of the electrodes to output a spectral component as a linguistic representation; and extract phoneme features based on the spectral component for each of the electrodes.
According to one aspect of the present invention, it is possible to provide a speech imagery recognition device, a wearing fixture, a speech imagery recognition method, and a program that enable speech recognition using EEG signals.
In the following, exemplary embodiments of a speech imagery recognition device according to the present invention will be described with reference to the accompanying drawings. Note that the drawings are used to illustrate the technical features of the invention, and are not intended to limit the configuration of the device as well as various processing procedures and the like to those aspects illustrated therein unless otherwise specifically mentioned. Incidentally, like parts are designated by like reference numerals or characters throughout the description of the embodiments.
The speech imagery recognition device 1 includes an EEG input unit 2, a preprocessor 3, an analysis processor 4, a linguistic feature extractor 5, a word/sentence recognizer 6, and a post-processing/output unit 7. The EEG input unit 2 is configured to convert EEG signals received from a plurality of electrodes placed on the scalp (not illustrated) into discrete signals. The preprocessor 3 is configured to remove noise from the discrete signals for each electrode. The analysis processor 4 is configured to analyze the discrete signals for each electrode and output a spectral time sequence. The linguistic feature extractor 5 is configured to extract and output a phoneme-feature vector time sequence based on the spectral time sequence of all the electrodes. The word/sentence recognizer 6 is configured to recognize words and sentences that constitute a spoken language from the phoneme-feature vector time sequence. The post-processing/output unit 7 is configured to display speech information or output the information in audio.
The EEG input unit 2 converts analog signals x(q, t) output from the EEG electrodes into discrete signals through A/D conversion or the like, and corrects the bias of the individual electrodes by using the average value of discrete signals of all the electrodes or the like. At the same time, the EEG input unit 2 removes unnecessary frequency components of 70 Hz or below by a low-frequency removal filter (high-pass filter) and unnecessary frequency components of 180 Hz or above by a high-frequency removal filter (low-pass filter) from discrete signals for each electrode, thereby outputting a signal x1(q, n).
The preprocessor 3 removes noise that has passed through the filters for each electrode. One example of this process will be described below. First, the discrete signal x1(q, n) (q: electrode number, n: time) of each electrode, which has undergone a series of processes in the EEG input unit 2, is multiplied by a certain time window, and then it is mapped from the time domain to the frequency domain using the Fast Fourier Transform (FFT). Thereafter, an amplitude spectral time sequence X1(q, f, n′) (f: frequency, n′: time frame number after windowing) is obtained from complex number components in the frequency domain as follows:
[Formula 1]
FFT: x1(q,n)→Re{X1(q,f,n′)}+jIm{X1(q,f,n′)} (1)
[Formula 2]
X
1(q,f,n′)=[Re{X1(q,f,n′)}2+Im{X1(q,f,n′)}2]1/2 (2)
where j represents an imaginary unit, and Re{ } and Im{ } represent a real part and an imaginary part, respectively. In noise subtraction, an average noise amplitude spectrum is obtained from the spectrum N(q, f, n′) of an EEG signal recorded prior to speech imagery by the following formula.
In the above formula, an average noise spectrum is calculated from 8 frames before and after time n′; however, it may be set as appropriate depending on the system. In the setting of time n′, generally, there may be the following two ways:
(a) The user performs speech imagery in response to a prompt signal (a signal that indicates the start of imagery) provided by a speech imagery recognition application system.
(b) The user performs speech imagery after providing the application system with a predetermined call (wake-up word) such as “Yamada-san”.
In both cases, N(q, f, n′) is calculated from EEG signals recorded in the section before or after the speech imagery.
Then, for each electrode q, Nav(q, f, n′) is subtracted from the speech imagery signal spectrum X1(q, f, n′) as represented by the following formula:
[Formula 4]
X
2(q,f,n′)=X1(q,f,n′)−Nav(q,f,n′) (4)
It should be noted that it is effective to perform the process of extracting a small number of independent information sources from signals of the 9 electrodes after noise removal, i.e., independent component analysis (ICA) (Non-Patent Document 4). This process can remove unnecessary components that cannot be removed by the filtering process and can also select a small number of effective information sources from discrete signals of the 9 electrodes. However, ICA has a drawback of so-called permutation in which the order of independent components varies in the result of each analysis. How this drawback is eliminated to incorporate ICA into this embodiment will be explained later.
While the analysis processor 4 may use the spectral time sequence X2(q, f, n′) of the speech imagery signal after noise removal (and after extraction of q independent components) obtained by the preprocessor 3, linear predictive analysis (LPA) is used as an analysis method that brings out better the effect of the present invention in an example described below. The analysis processor 4 can use a spectrum or a line spectrum.
Linear predictive coding (LPC) is currently a global standard method for speech communication. There are two information sources in speech: pulse waves at a constant frequency produced by the vocal cords and random waves produced by narrowing the vocal tract. For this reason, a complex process is required where sound sources are stored separately as a codebook, all the sound sources in the codebook are passed through the linear prediction coefficient of speech (which is responsible for the transfer function of the vocal tract), and then the synthesized speech is compared with the original speech.
On the other hand, the only source of information in brain waves is considered to be random waves as illustrated in
In the convolutional integration, EEG spectrum can be expressed as X(f)=W(f)S(f)=S(f) (note: W(f)=1), where S(f) represents the transfer (frequency) function of the impulse response s(n), which is responsible for spoken language information in the frequency domain. The function S(f) can be obtained from the Fourier transform of the linear prediction coefficient {αm} as represented by the following formula:
[Formula 5]
X(f)=S(f)=[s(n)]=[α0δ(n)+α1δ(n−1)+α2δ(n−2)+ . . . +αnδ(n−p)] (5)
where δ(n−p) is a function that represents the time n=p of a signal, and F[ ] is the Fourier transform. In linear predictive analysis (LPA) for EEG signals, as illustrated in
[Formula 6]
H(f)=σ/X(f)=σ/[α0δ(n)+α1δ(n−1)+α2δ(n−2)+ . . . +αnδ(n−p)] (6)
where σ is an amplitude bias value. This method of performing accurate analysis throughout the synthesis process is called “Analysis-by-Synthesis (AbS)” and is also effective in EEG analysis. In the Fourier transform F[ ] of the above formula, 0 points are added to p linear prediction coefficients (α0=1.0) (called zero padding), which enables the Fourier transform of any number of points such as, for example, 128 points, 256 points, . . . . With the zero padding, the frequency resolution accuracy can be arbitrarily adjusted to 64 points, 128 points, . . . to obtain a spectral component A(q, f, n′).
Through the LPA analysis, the spectrum of EEG during speech imagery is represented with a small number of spectral peaks. This suggests that in the brain (especially in Broca's area where linguistic information of speech imagery is produced), the linguistic representation is composed of short sine waves (tone burst), in other words, the linguistic representation is represented by a unique line spectrum.
The linguistic feature extractor 5 extracts line spectral components as “linguistic representation” from spectra with a spread and outputs a phoneme-likelihood vector time sequence, which is a linguistic feature, through a phoneme-specific convolution operator.
The processing procedure will be described below with reference to the flowchart of
For data within a certain time width (several frames before and after time n′) and a frequency width (adjacent frequencies f−1, f, f+1), the intermediate value of the whole is obtained and used as a representative value. This process can absorb frequency fluctuations as it can remove values deviating from the median value. The output of the nonlinear filter is generally smoothed by a Gaussian window or the like.
Next, the process of extracting a line spectrum (step S3) will be described. In this process, components derived from the peak that appears on the frequency axis is extracted as a line spectrum for each time frame (8 msec). Specifically, only the frequencies that satisfy the following conditions are defined as sinusoidal frequency components with the original amplitude, i.e., line spectrum components:
(i) Frequency at which the maximum value (detected at the first derivative) ΔΔf=0 on the frequency axis
(ii) When the inflection point (detected at the second derivative) ΔΔf=0
if Δf>0, frequency at which the value of ΔΔf changes from positive to negative
if Δf<0, frequency at which the value of ΔΔf changes from negative to positive
The linguistic feature extractor 5 is aimed at extracting phoneme features in the end. Specifically, it is aimed at extracting phoneme components, which are the smallest unit of speech information, in the form of a phoneme feature vector from line spectral components of each electrode. Speech information in EEG signals has the so-called tensor structure that spans three axes: line spectrum (frequency information), electrodes (spatial information), and frames (temporal information).
The flowchart of
Next, principal component analysis is performed for each syllable (step S12), and eigenvectors for each syllable are phoneme-grouped with respect to each relevant phoneme in a manner as follows: the phoneme /s/: {ψ/sa/(m), ψ/shi/(m), ψ/su/(m), ψ/se/(m), ψ/so/(m)}, the phoneme /a/: {ψ/a/(m), ψ/ka/(m), ψ/sa/(m), ψ/ta/(m), ψ/na/(m), . . . ,}. Then, the autocorrelation matrix is calculated from the eigenvectors of the same phoneme group and integrated into a phoneme-specific autocorrelation matrix Rs, Ra, . . . (step S13). From the phoneme-specific autocorrelation matrix, subspaces (eigenvectors) φ/s/(m), φ/a/(m) for respective phonemes can be obtained.
After that, by using eigenvectors obtained for each phoneme k as “phoneme-specific convolution operator”, the phoneme similarity (likelihood) L(k) is calculated for unknown 9-electrode (or a few after ICA) line spectral time sequences (step S4, step S14, step S15).
[Formula 7]
L(k)=Maxq<X(q,f,n′),ϕ(f,n′)>2,k=1,2, . . . ,K (7)
where Max means to take the maximum value for q electrodes or q ICA components, and < > represents an inner product operation. Note that X(q, f, n′) and φ(f, n′) are each normalized by a norm in advance.
A phoneme feature vector is defined as a vector composed of a series of K likelihoods L(k) of phoneme k; k=1, 2, . . . , K. In formula (7), the eigenvector φ(f, n′) of the phoneme is used to construct the phoneme-specific convolution operator, and a scalar value L(k) is obtained for each phoneme k as a likelihood. A vector of K scalar values is output from the linguistic feature extractor 5 as (phoneme-likelihood vector) time-sequence data as the time n′ of input X(f, n′) advances (step S5, step S16).
Since it is difficult to collect a large amount of speech imagery data at present, the problem is solved through a phoneme convolution operator in the example described herein. However, as the brain database related to speech imagery becomes more complete in the future, it will be possible to use a deep convolutional network (DCN), which have been widely used in such fields as image processing in recent years, instead of the phoneme-specific convolution operator.
The word/sentence recognizer 6 recognizes a word/sentence from the time-sequence data of the phoneme feature vector (more specifically, phoneme-likelihood vector time-sequence data). There are some methods that can be applied to word/sentence recognition such as a method that uses a hidden Markov model (HMM) (where a triphone including a sequence of three consecutive phonemes is used), which has been put to practical use in the field of speech recognition, and a method that uses a deep neural network (LSTM, etc.). In addition, linguistic information (probability as to word sequence), an advantage of current speech recognition, can be used as well. Furthermore, as the time axis shift is a concern in speech imagery, the use of “spotting”, which is performed in the current robust audio system to continuously search for words and sentences in the time direction, is also effective in improving quality in speech imagery.
The post-processing/output unit 7 receives a word (sequence) of the recognition result and produces a required display and audio output. Here, the post-processing/output unit 7 may have a function of providing the user with feedback on whether the multi-electrode EEG sensor is in the correct position based on the result of speech imagery recognition of a predetermined word/sentence so that the user can move the EEG sensor through the screen of a terminal such as a smartphone or a voice instruction, thereby helping the user to find the proper position.
The post-processing/output unit 7 displays a screen that helps the user adjust the optimal position of the electrodes while performing speech imagery.
As illustrated in
The speech imagery recognition device 1 illustrated in
While the speech imagery recognition device 1 has been described as including the EEG input unit 2, the preprocessor 3, the analysis processor 4, the linguistic feature extractor 5, the word/sentence recognizer 6, and the post-processing/output unit 7 as illustrated in
The processor 23 of the wearing fixture 11, the mobile terminal 12, and the server 13 comprise, for example, a computer that includes a central processing unit (CPU), a memory, a read-only memory (ROM), a hard disk, and the like. The mobile terminal 12 can perform part or all of the processing of the speech imagery recognition device 1 illustrated in
A speech imagery recognition method of recognizing speech from EEG signals during speech imagery is performed by the wearing fixture 11, the mobile terminal 12, and/or the server 13; the wearing fixture 11, the mobile terminal 12, and/or the server 13 can perform the method independently or in collaboration with one another. The speech imagery recognition method can be performed by the mobile terminal 12 and the server 13.
A program that causes a computer to perform a speech imagery recognition process of recognizing speech from EEG signals during speech imagery may be downloaded or stored in the hard disk or the like. The program causes the computer to perform the analysis process of analyzing discrete signals of EEG signals received from a plurality of electrodes for each electrode and outputting a spectral time sequence, and the extraction process of extracting a phoneme-feature vector time sequence based on spectral components of each electrode.
As described above, according to the embodiments, line spectral components as a linguistic representation are directly extracted from EEG signals during speech imagery, eigenvectors are obtained for each phoneme from the spectral components. Thereby, the linguistic representation and unknown input are converted into a vector of a series of phoneme features (phoneme likelihoods) by using the eigenvectors as a convolution operator (see Formula 7).
In the following, additional notes will be provided with respect to the above embodiments.
A speech imagery recognition method for recognizing speech from electroencephalogram (EEG) signals during speech imagery, the method comprising:
an analysis process of analyzing discrete signals, which are obtained from EEG signals received from a plurality of electrodes, for each of the electrodes and outputting a spectral time sequence; and
an extraction process of outputting a phoneme-feature vector time sequence based on the spectral time sequence.
The speech imagery recognition method as set forth in additional note 1, further comprising converting the EEG signals received from the electrodes to the discrete signals.
The speech imagery recognition method as set forth in additional note 1 or 2, further comprising preprocessing of subtracting an average noise amplitude spectrum from a spectrum of a speech imagery signal obtained by converting the discrete signals to a frequency domain to remove noise from the EEG signals.
The speech imagery recognition method as set forth in additional note 3, wherein the preprocessing includes performing an independent component analysis that extracts a small number of independent information sources from each electrode signal after noise removal.
The speech imagery recognition method as set forth in any one of additional notes 1 to 4, further comprising recognizing speech based on the phoneme-feature vector time sequence.
The speech imagery recognition method as set forth in any one of additional notes 1 to 5, further comprising outputting the speech recognized.
The speech imagery recognition method as set forth in additional note 6, further comprising displaying a screen that helps a user adjust the optimal position of the electrodes while performing speech imagery.
The speech imagery recognition method as set forth in any one of additional notes 1 to 7, wherein the analysis process includes extracting the spectral time sequence using a linear predictive analysis.
The speech imagery recognition method as set forth in any one of additional notes 1 to 8, wherein the analysis process includes a process of absorbing a frequency fluctuation based on the discrete signals.
The speech imagery recognition method as set forth in any one of additional notes 1 to 9, wherein the analysis process includes extracting a frequency derived from a peak on a frequency axis as a line spectrum component for each time frame.
The speech imagery recognition method as set forth in any one of additional notes 1 to 10, wherein the extraction process includes outputting a phoneme-likelihood vector time sequence, which is a linguistic feature, through a predetermined convolution operator.
The speech imagery recognition method as set forth in any one of additional notes 1 to 11, implemented by either or both of a mobile terminal and a server.
The speech imagery recognition method as set forth in any one of additional notes 1 to 12, further comprising outputting signals from a plurality of electrodes of a wearing fixture, which are placed over Broca's area.
With a speech imagery recognition device, a wearing fixture, a method, and a program according to the embodiments of the present invention, line spectral components as a linguistic representation can be directly extracted from EEG signals during speech imagery and converted into phoneme features. Thus, a brain-computer interface (BCI) can be incorporated into the current framework of speech recognition.
Number | Date | Country | Kind |
---|---|---|---|
2019-097202 | May 2019 | JP | national |
This application is a National Stage application of PCT/JP2020/020342, filed on May 22, 2020. This application claims priority to Japanese Application No. 2019-097202, filed on May 23, 2019, the entire contents of which are incorporated herein by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2020/020342 | 5/22/2020 | WO | 00 |