The present invention relates to technology for estimating an auditory attention state.
Humans selectively pay attention to specific input stimuli for senses such as vision, hearing, and touch to consciously or unconsciously select information to be perceived. In NPL 1, reduction and dilation of a pupil diameter (pupil vibration or pupil frequency tagging (PFT)) induced by ON/OFF of a light source are used to estimate a destination to which a user pays visual attention.
However, a relationship between a destination to which a user pays auditory attention and a change in pupil diameter is not known, and a method of estimating a destination to which a user pays auditory attention on the basis of a change in pupil diameter is not known.
The present invention provides a method of estimating a destination to which a user pays auditory attention on the basis of a change in pupil diameter.
A feature quantity based on the strength of a correlation between each of a plurality of different visual stimulus patterns corresponding to a plurality of different sound sources and a pupil diameter change amount of a user is obtained, and a destination to which the user pays auditory attention for a sound from the sound source is estimated using the feature quantity.
According to the present invention, it is possible to estimate a destination to which a user pays auditory attention on the basis of a change in pupil diameter.
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
First, a principle will be described. The present invention is based on a new natural law (physiological law) that a human exhibits a pupil reaction even when the human only pays auditory attention to a sound from a sound source without paying attention to a visual stimulus corresponding to the sound source. First, experimental results leading to this discovery will be shown.
As illustrated in
Sound source 140-1: “Then, a bear goes to white 3,” “Then, tiger goes to pink 8,” “Then, a cat goes to red 9,” “Then, a dog goes to black 6.”
Sound source 140-2: “Then, a deer goes to rainbow 2” four times
Sound source 140-3: “Then, a tiger goes to blue 9” four times
Sound source 140-4: “Then, a pig goes to pink 7” four times
The subject 100 sequentially executed a task of paying auditory attention to the sounds emitted from the sound sources 140-1, 2, 3, and 4, and a pupil diameter of the subject 100 executing the task was measured by a pupil diameter acquisition apparatus 150 (an eye tracker in this experiment). The execution of the task and the measurement of the pupil diameter were performed a plurality of times for a plurality of subjects 100, and a pupil diameter when each subject 100 paid auditory attention to the sound emitted from each sound source 140-I was measured.
As illustrated in
As described above, humans exhibit pupil responses according to the visual stimulus pattern corresponding to the sound source, even when the humans only pay auditory attention to the sound emitted from the sound source. In each embodiment, this natural law is used as follows: a feature quantity based on the strength of a correlation between each of a plurality of different visual stimulus patterns corresponding to a plurality of different sound sources and the pupil diameter change amount of the user is obtained, and the destination to which the user pays auditory attention for the sound from the sound source is estimated using the feature quantity.
Next, a first embodiment will be described.
As illustrated in
As illustrated in
As illustrated in
<Sound Source Apparatus 14-n>
The sound source apparatus 14-n (where n=1, . . . , N) of the present embodiment is a sound source that emits sound SO(Info-n). An example of the sound source apparatus 14-n is a speaker or the like, but this does not limit the present invention. Any apparatus may be used as the sound source apparatus 14-n as long as a sound source can be disposed at a desired spatial position. The sound source apparatuses 14-1, . . . , 14-N are different from each other, and the sound source apparatuses 14-1, . . . , 14-N dispose sound sources at different positions. For example, directions of the sound source apparatuses 14-1, . . . , 14-N from the user 10 or directions of sound sources disposed by sound source apparatuses 14-1, . . . , 14-N differ from each other. The sound source apparatuses 14-1, . . . , 14-N may emit sounds SO(Info-1), . . . , SO(Info-N) at the same time, or some of the sound source apparatuses 14-1, . . . , 14-N may emit sound SO(Info-n) at different timings than the other sound source apparatuses. However, it is preferable for sounds emitted simultaneously from different sound source apparatuses to differ from each other. The sounds SO(Info-1), . . . , SO(Info-N) emitted from the sound source apparatuses 14-1, . . . , 14-N may be vocal sounds, music, environmental sounds, ringing sounds, alarm sounds, or the like.
<Visual Stimulus Generation Apparatus 13-n>
The visual stimulus generation apparatus 13-n (where n=1, . . . , N) is an apparatus that presents (displays) a visual stimulus pattern VS(Sig-n) corresponding to the sound source apparatus 14-n. Any visual stimulus generation apparatus 13-n may be disposed or configured as long as the user 10 can perceive a correspondence relationship between the sound source apparatus 14-n and the visual stimulus generation apparatus 13-n. For example, the visual stimulus generation apparatus 13-n may be disposed near the sound source apparatus 14-n, may be disposed in contact with the sound source apparatus 14-n, may be fixed to the sound source apparatus 14-n, or may be configured integrally with the sound source apparatus 14-n. Each of the visual stimulus patterns VS(Sig-n) of the present embodiment is a periodically time-varying visual stimulus pattern. For example, the visual stimulus pattern VS(Sig-n) of the present embodiment may be a periodically time-varying pattern of luminance (brightness), may be a pattern that periodically repeatedly blinks (ON/OFF), may be a periodically time-varying pattern of color, may be a periodically time-varying pattern of a pattern, or may be a periodically time-varying pattern of a shape. That is, the visual stimulus generation apparatus 13-n may present periodically time-varying luminance (brightness), may present light that periodically repeatedly blinks, may present periodically time-varying color, may present a periodically time-varying pattern, or may present a periodically time-varying shape. Any visual stimulus generation apparatus 13-n may be used as long as the apparatus can visually present such a visual stimulus pattern VS(Sig-n). For example, the visual stimulus generation apparatus 13-n may be an LED light source, may be a laser light generator, may be a display, or may be a projector. Further, the visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N) presented from the N visual stimulus generation apparatuses 13-1, . . . , 13-N differ from each other. In the case of the present embodiment, the visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N) presented from the visual stimulus generation apparatuses 13-1, . . . 13-N have different frequency distributions. For example, frequencies of peaks (peak frequencies) of a signal (frequency domain visual stimulus signal) obtained by transforming (for example, through a Fourier transform) time-series signals (for example, a time-series signal of luminance, a time-series signal of color, a time-series signal of patterns, a time-series signal of shapes, and the like) indicating the visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N) presented from the N visual stimulus generation apparatuses 13-1, . . . , 13-N of the present embodiment into those in the frequency domain are different from each other.
The pupil diameter acquisition apparatus 15 of the present embodiment is an apparatus that measures the pupil diameter Pub of the user 10. For example, the pupil diameter acquisition apparatus 15 is a camera that photographs a movement of eyes of the user 10, and an apparatus that acquires and outputs a pupil diameter Pub of the user 10 from the image captured by the camera. An example of the pupil diameter acquisition apparatus 15 is a commercially available eye tracker or the like.
As illustrated in
The training data T is input to the input unit 111 of the learning apparatus 11 (
The training sound may be any sound such as a vocal sound, music, environmental sound, ringing sound, and alarm sound. A specific example of the training sound is the same as the sound emitted from the sound source apparatus 14-n described above.
A specific example of the N training sound sources is the same as those of the sound source apparatuses 14-1, . . . , 14-N described above. However, the N training sound sources may be N apparatuses that emit training sounds, and apparatuses different from the sound source apparatuses 14-1, . . . , 14-N may be N training sound sources.
The N training visual stimulus patterns TVS(Sig-1), . . . , TVS(Sig-N) are the same as the visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N) described above, and are periodically time-varying patterns of visual stimuli. For example, the N training visual stimulus patterns TVS(Sig-1), . . . , TVS(Sig-N) are the same as the patterns of visual stimuli presented by the visual stimulus generation apparatuses 13-1, . . . , 13-N described above, and specific examples of the N training visual stimulus patterns TVS(Sig-1), . . . , TVS(Sig-N) are the same as the specific examples of the visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N) described above.
The training pupil diameter change amount TVPub is a pupil diameter change amount of the training user when a training user to which the training sound and the training visual stimulus patterns TVS(Sig-1), TVS(Sig-N) are presented pays auditory attention to any training sound TSOn (where n∈{1, . . . , N}). A time when the training user pays auditory attention to any training sound TSOn is, for example, a time when the training user is executing the above-described task. Although the training pupil diameter change amount TVPub is obtained, for example, on the basis of the pupil diameter TPub of the training user measured by the pupil diameter acquisition apparatus 15, the training pupil diameter change amount TVPub may be obtained on the basis of the pupil diameter TPub of the training user measured by another apparatus. Hereinafter, a method of obtaining the training pupil diameter change amount from the pupil diameter TPub will be exemplified.
(1.1) First, preprocessing is performed on the time-series data of the pupil diameter TPub to obtain time-series data of a pupil diameter TPub′ after preprocessing. As the preprocessing, for example, linear interpolation, quadratic spline interpolation, or the like can be used to interpolate missing portions due to blinking or the like of the training user from the time-series data of the pupil diameter TPub. Further, a low-pass filter (for example, a low-pass filter that passes a band including all blinking frequencies) according to the blinking frequency of the N training visual stimulus patterns TVS(Sig-1), . . . , TVS(Sig-N) may be applied to time-series data of the pupil diameter TPub after complementation to perform noise reduction.
(1.2) Next, an average value of the pupil diameter TPub′ before the training user pays auditory attention is subtracted from the time-series data of the pupil diameter TPub′ when the training user pays auditory attention to any training sound TSOn, standardization is performed by a z value to obtain time series data of the training pupil diameter change amount TVPub.
The training feature quantity Tfj may be any quantity as long as the quantity is based on the strength of the correlation between each of the N training visual stimulus patterns TVS(Sig-1), . . . , TVS(Sig-N) and the training pupil diameter change amount TVPub. Hereinafter, the training feature quantity Tfj is exemplified.
(2.1) A training feature quantity Tfj indicating a magnitude of a training frequency domain pupil diameter change amount signal TFVPub (for example, a peak value of the magnitude of TFVPub) at a peak frequency of each of a plurality of the N training frequency domain visual stimulus patterns TFVS(Sig-1), . . . , TFVS(Sig-N) and/or near the peak frequency may be used. Here, each of a plurality of the N training frequency domain visual stimulus patterns TFVS(Sig-1), . . . , TFVS(Sig-N) is a signal obtained by transforming (for example, Fourier transform) the time-series signal indicating each of a plurality of training visual stimulus patterns TVS(Sig-1), . . . , TVS(Sig-N) into the frequency domain. Further, the training frequency domain pupil diameter change amount signal TFVPub is a signal obtained by transforming the time-series signal indicating the training pupil diameter change amount TVPub into a frequency domain. Further, a “magnitude of a” may be an absolute value of an amplitude of a, may be power of a (a square of the amplitude of a), or may be a monotonically increasing value with respect to the absolute value of a. For example, in the case of the example of
Here, when the magnitude of the training frequency domain pupil diameter change amount signal TFVPub (for example, the peak value of the magnitude of TFVPub) at the peak frequency and/or near the peak frequency is greater, a correlation between a stimulus pattern TVS(Sig-n) corresponding to the peak frequency and the training pupil diameter change amount TVPub is stronger.
(2.2) In addition to or instead of the magnitude of the training frequency domain pupil diameter change amount signal TFVPub at and/or near the peak frequency described above, other information may be included in the training feature quantity Tfj. For example, (1) a magnitude of a training frequency domain pupil diameter change amount signal TFVPub (for example, a peak value of the magnitude) at a multiple frequency of a peak frequency of each of a plurality of the N training frequency domain visual stimulus patterns TFVS(Sig-1), . . . , TFVS(Sig-N) or near the multiple frequency, (2) a degree of synchronization between a phase change of each of the N training visual stimulus patterns TVS(Sig-1), . . . , TVS(Sig-N) and a phase change of the training pupil diameter change amount TVPub, and (3) maximum values CCFmax (TSS(Sig-1), TPS), . . . , CCFmax (TSS(Sig-N), TPS) of a cross-correlation function for a series TSS(Sig-1), . . . , TSS(Sig-N) corresponding to a time-series signal indicating each of N training visual stimulus patterns TVS(Sig-1), . . . , TVS(Sig-N) and a series TPS corresponding to a time-series signal indicating the training pupil diameter change amount TVPub, and the like may be included in the training feature quantity Tfj. Here, the “sequence corresponding to the time-series signal” may be, for example, the time-series signal itself, may be a sequence obtained by transforming the time-series signal into a frequency domain, or may be a series of a function value of another time-series signal. Further, a maximum value of the cross-correlation function between a sequence ∝1 and a sequence ∝2 means a maximum value among the cross-correlation function values between the sequence ∝1 and the sequence ∝2 with respect to a variable delay amount τ.
Here, when the magnitude of the training frequency domain pupil diameter change amount signal TFVPub (for example, the peak value of the magnitude of TFVPub) at a double frequency of the peak frequency and/or near the peak frequency is greater, the correlation between a stimulus pattern TVS(Sig-n) corresponding to the peak frequency and the training pupil diameter change amount TVPub is stronger. Further, the training visual stimulus pattern TVS(Sig-n) having a phase change having a higher degree of synchronization with the training pupil diameter change amount TVPub has a stronger correlation with the training pupil diameter change amount TVPub. Further, the training visual stimulus pattern TVS(Sig-n) having a greater maximum value CCFmax (TSS(Sig-n), TPS) of the cross-correlation function has a stronger correlation with the training pupil diameter change amount TVPub.
As described above, the training pupil diameter change amount TVPub is the pupil diameter change amount of the training user when the training user pays auditory attention to the training sound TSOn, and the correct answer information Ta; is information indicating the training sound TSOn. That is, the correct answer information Ta indicates the training sound TSOn corresponding to the training pupil diameter change amount TVPub. For example, the correct answer information Taj may be information (for example, an index) indicating a sound source (for example, an apparatus that emits the training sound TSOn), may be information (for example, blinking frequency) indicating a training visual stimulus pattern TVS(Sig-n) corresponding to the training sound TSOn, or may be information indicating the training sound TSOn. As described above, since the training feature quantity Tfj corresponds to the training pupil diameter change amount TVPub, the correct answer information Taj is associated with the training feature quantity Tfj (step S111).
The learning unit 113 obtains the estimation model M(θ) by learning processing (machine learning) using the training data T read from the storage unit 112, and outputs a model parameter θ for specifying the estimation model M(θ). The estimation model M(θ) is a model for receiving the feature quantity f based on the strength of a correlation between each of the N (multiple) different visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N) corresponding to N (multiple) different sound sources and the pupil diameter change amount VPub of the user, and estimating the destination to which the user pays auditory attention (visual attention direction) for the sound from the sound source. The configuration of the feature quantity f is the same as the configuration of the training feature quantity Tfj described above except that the training sound is replaced with a sound, the training sound sources is replaced with a sound source, the training visual stimulus patterns TVS(Sig-1), TVS(Sig-N) are replaced with visual stimulus pattern VS(Sig-1), . . . , VS(Sig-N), the training user is replaced with a user, and the training pupil diameter change amount TVPub of the training user is replaced with a pupil diameter change amount VPub of the user.
The estimation model M(θ) is configured to estimate the sound emitted from the sound source corresponding to the visual stimulus pattern VS(Sig-n), which has a high correlation with the pupil diameter change amount VPub, as the destination to which the user pays auditory attention. The “destination to which the user pays auditory attention” estimated using such an estimation model M(θ) is, for example, at least one of the following (3.1), (3.2), and (3.3).
(3.1) A sound emitted from a sound source corresponding to the visual stimulus pattern VS(Sig-n) having a high correlation with the pupil diameter change amount VPub has a higher frequency (probability) being estimated to be a destination to which the user pays auditory attention.
(3.2) When the visual stimulus pattern VS(Sig-n) (first visual stimulus pattern) included in the visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N) corresponds to a sound source AS1 (first sound source), and the strength of the correlation between the visual stimulus pattern VS(Sig-n) and the pupil diameter change amount VPub of the user is equal to or greater than a predetermined value, the destination to which the user pays auditory attention is estimated to be the sound source AS1 or near the sound source AS1.
(3.3) The N (multiple) sound sources that are targets include the sound source AS1 (first sound source) and the sound source AS2 (second sound source), the visual stimulus pattern VS(Sig-n1) (first visual stimulus pattern) included in the visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N) corresponds to the sound source AS1, the visual stimulus pattern VS(Sig-n2) (second visual stimulus pattern) corresponds to the sound source AS2, and the strength of the correlation between the stimulus pattern VS(Sig-n1) and the pupil diameter change amount VPub is stronger than the strength of the correlation between the visual stimulus pattern VS(Sig-n2) and the pupil diameter change amount VPub, the destination to which the user pays auditory attention is estimated to be the sound source AS1 or the vicinity of the sound source AS1. Here, mi and n2 E {1, . . . , N}.
The “destination to which the user pays auditory attention” estimated using the estimation model M(θ) may be information (for example, an index) indicating a sound source (for example, an apparatus that emits sound), may be information indicating the visual stimulus pattern VS(Sig-n) corresponding to the sound SOn (for example, blinking frequency), may be information indicating the sound SOn, or may be information indicating directions thereof. Further, one “destination to which the user pays auditory attention” may be estimated by the estimation model M(θ), a plurality of “destination to which the user pays auditory attention” may be estimated, or a probability of the “destination to which the user pays auditory attention” may be estimated.
The estimation model M(θ) may be based on any scheme. For example, the estimation model M(θ) based on a k-nearest neighbor algorithm (K-NN), support vector machine (SVM), deep learning, a hidden Markov model, or the like can be exemplified. As a specific method of the learning processing, a known method according to a scheme of the estimation model M(θ) may be used. In general, an initial value of a provisional model parameter θ′ is set first, and then, processing of updating the provisional model parameter θ′ is repeated so that an error between a result obtained by applying the training feature quantity Tfj to the estimation model M(θ′) and the correct answer information Ta is made small, and the provisional model parameter θ′ at a point in time when a predetermined termination condition is satisfied is set as the model parameter θ.
The model parameter θ output from the learning unit 113 is sent to the output unit 114, and the output unit 114 sends the model parameter θ to the auditory attention state estimation apparatus 12 (step S113).
The model parameter θ sent from the learning apparatus 11 is input to the input unit 121 of the auditory attention state estimation apparatus 12 (
The auditory information control unit 124 reads the output information Info-1, . . . , Info-N from the storage unit 122 and sends the output information Info-1, Info-N to sound source apparatuses 14-1, . . . , 14-N, respectively. Each sound source apparatus 14-n presents (outputs) the sound SO(Info-n) based on the sent output information Info-n (step S124).
The visual stimulus control unit 123 reads the output information Sig-1, . . . , Sig-N from the storage unit 122 and sends the output information Sig-1, . . . , Sig-N to the visual stimulus generation apparatuses 13-1, . . . , 13-N. Each visual stimulus generation apparatus 13-n (n=1, . . . , N) presents (outputs) the visual stimulus pattern VS(Sig-n) based on the sent output information Sig-n (step S123).
The user 10 to which the sounds SO(Info-1), SO(Info-N) and the visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N) have been presented plays auditory attention to any sound SO(Info-n). For example, the user 10 pays auditory attention to any sound SO(Info-n) by executing the above task. The pupil diameter acquisition apparatus 15 measures the pupil diameter Pub of the user 10 and sends time-series data of the pupil diameter Pub to the feature quantity extraction unit 125.
The feature quantity extraction unit 125 further extracts the output information Sig-1, . . . , Sig-N corresponding to the visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N) from the storage unit 122. The feature quantity extraction unit 125 uses the time-series data of the pupil diameter Pub and the output information Sig-1, . . . , Sig-N to obtain a feature quantity f based on the strength of correlation between each of the visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N) and the pupil diameter change amount VPub of the user 10, and outputs the feature quantity f. The configuration of the feature quantity f is the same as that of the training feature quantity Tfj described above except that the training sound is replaced with a sound, the training sound sources is replaced with a sound source, the training visual stimulus patterns TVS(Sig-1), . . . , TVS(Sig-N) are replaced with visual stimulus pattern VS(Sig-1), . . . , VS(Sig-N), the training user is replaced with a user 10, and the training pupil diameter change amount TVPub of the training user is replaced with a pupil diameter change amount VPub of the user 10. For example, the feature quantity extraction unit 125 obtains the feature quantity f as follows.
(4.1) The feature quantity extraction unit 125 performs preprocessing on the time-series data of the pupil diameter Pub to obtain the time-series data of the pupil diameter Pub′ after preprocessing. Examples of the preprocessing can include processing for interpolating a missing portion due to, for example, blinking of the user 10 from the time-series data of the pupil diameter Pub using linear interpolation, quadratic spline interpolation, or the like. Further, a low pass filter (for example, a low pass filter passing through a band including all the blinking frequencies) corresponding to the blinking frequencies of the visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N) may be applied to time series data of the pupil diameter Pub after complementation to perform noise reduction.
(4.2) The feature quantity extraction unit 125 subtracts an average value of the pupil diameter Pub′ before the user 10 pays auditory attention from the time-series data of the pupil diameter Pub′ when the user 10 pays auditory attention to any training sound TSOn, and performs standardization by a z value to obtain time series data of the pupil diameter change amount VPub.
(4.3) The feature quantity extraction unit 125 obtains the feature quantity f on the basis of the time-series data of the pupil diameter change amount VPub and the N visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N). The feature quantity f is based on the strength of the correlation between each of the N the visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N) and the pupil diameter change amount VPub. Hereinafter, the feature quantity f is exemplified.
(4.3.1) The feature quantity f indicating the magnitude of the frequency domain pupil diameter change amount signal FVPub (for example, a peak value of the magnitude of FVPub) at peak frequencies and/or near the peak frequency of the plurality of frequency domain visual stimulus signals FVS(Sig-1), FVS(Sig-N) may be used. Here, each of the plurality of frequency domain visual stimulus signals FVS(Sig-1), . . . , FVS(Sig-N) is a signal obtained by transforming a time-series signal indicating each of a plurality of visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N) into a frequency domain. Further, the frequency domain pupil diameter change amount signal FVPub is a signal obtained by transforming the time-series signal indicating the pupil diameter change amount VPub into that in the frequency domain.
(4.3.2) In addition to or instead of the magnitude of the frequency domain pupil diameter change amount signal FVPub at the peak frequency and/or near the peak frequency described above, other information may be included in the feature quantity f. For example, (1) a magnitude of a frequency domain pupil diameter change amount signal FVPub (for example, a peak value of the magnitude) at a multiple frequency of a peak frequency of each of a plurality of the N frequency domain visual stimulus patterns FVS(Sig-1), FVS(Sig-N) or near the multiple frequency, (2) a degree of synchronization between a phase change of each of the N visual stimulus patterns VS(Sig-1), VS(Sig-N) and a phase change of the pupil diameter change amount VPub, (3) maximum values CCFmax (SS(Sig-1), PS), . . . , CCFmax (SS(Sig-N), PS) of a cross-correlation function for a series SS(Sig-1), . . . , TSS(Sig-N) corresponding to a time-series signal indicating each of N training visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N) and a sequence PS corresponding to a time-series signal indicating the training pupil diameter change amount VPub, and the like may be included in the training feature quantity f.
Here, when the magnitude of the frequency domain pupil diameter change amount signal FVPub (for example, the peak value of the magnitude of FVPub) at a double frequency of the peak frequency and/or near the peak frequency is greater, the correlation between a stimulus pattern VS(Sig-n) corresponding to the peak frequency and the pupil diameter change amount VPub is stronger. Further, the visual stimulus pattern VS(Sig-n) having a phase change having a higher degree of synchronization with the pupil diameter change amount VPub has a stronger correlation with the pupil diameter change amount VPub. Further, the visual stimulus pattern VS(Sig-n) having a greater maximum value CCFmax (TSS(Sig-n), TPS) of the cross-correlation function has a stronger correlation with the training pupil diameter change amount VPub. The feature quantity f is sent to the estimation unit 126 (step S125).
The estimation unit 126 reads the model parameter θ from the storage unit 122. The estimation unit 126 uses the feature quantity f to obtain an estimation result E=M(θ;f) of the destination to which the user 10 pays auditory attention for the sound, on the basis of the estimation model M(θ) specified by the model parameter θ, and outputs the estimation result. That is, the estimation unit 126 applies the feature quantity f to the estimation model M(θ) specified by the model parameter θ to obtain the estimation result E corresponding to the feature quantity f, and outputs the estimation result E. As described above, the estimation result E may be information (for example, index) indicating a sound source (for example, sound source apparatus 14-n), may be information indicating the visual stimulus pattern VS(Sig-n) corresponding to the sound Son (for example, blinking frequency), may be information indicating the sound Son, or may be information indicating directions thereof. Further, the estimation result E may represent one “destination to which the user pays auditory attention”, may represent a plurality of “destination to which the user pays auditory attention”, or may represent a probability of the “destination to which the user pays auditory attention” (step S126).
The visual stimulus pattern VS(Sig-n) of the first embodiment was a pattern of a visual stimulus that periodically varies over time. Here, the visual stimulus pattern VS(Sig-n) may be an aperiodically time-varying stimulus pattern. Hereinafter, differences from the first embodiment will be mainly described, and matters common to the first embodiment are denoted by the same reference signs, and description thereof will be omitted or simplified.
As illustrated in
As illustrated in
As illustrated in
Training data T={(Tf1, Ta1), . . . , (Tfj, Taj)} is input to the input unit 111 of the learning apparatus 21 (
Although there is no limitation on the training pupil diameter change amount TVPub, for example, the maximum values CCFmax (TSS(Sig-1), TPS), . . . , CCFmax (TSS(Sig-N), TPS) of the cross-correlation function for the series TSS(Sig-1), . . . , TSS(Sig-N) corresponding to the time-series signal indicating each of N training visual stimulus patterns TVS(Sig-1), . . . , TVS(Sig-N) and the series TPS corresponding to the time-series signal indicating the training pupil diameter change amount TVPub, or the like may be included in the training feature quantity Tfj. In addition to this, or instead of this, a degree of synchronization between a phase change of each of the N training visual stimulus patterns TVS(Sig-1), . . . , TVS(Sig-N) and a phase change of the training pupil diameter change amount TVPub, or the like may be included in the training feature quantity Tfj (step S211).
The learning unit 213 obtains the estimation model M(θ) by learning processing (machine learning) using the training data T read from the storage unit 112, and outputs a model parameter θ for specifying the estimation model M (e). A difference from the first embodiment of the processing of the learning unit 213 is only the training data T. The output unit 114 sends the model parameter θ to the auditory attention state estimation apparatus 22 (step S213).
The model parameter θ sent from the learning apparatus 11 is input to the input unit 121 of the auditory attention state estimation apparatus 22 (
The processing of the auditory information control unit 124 is the same as in the first embodiment (step S124).
The visual stimulus control unit 223 reads the output information Sig-1, . . . , Sig-N from the storage unit 122 and sends the output information Sig-1, . . . , Sig-N to the visual stimulus generation apparatuses 13-1, . . . , 13-N. Each visual stimulus generation apparatus 13-n (n=1, . . . , N) presents the visual stimulus pattern VS(Sig-n) based on the sent output information Sig-n. However, each of the visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N) is an aperiodically time-varying stimulus pattern (step S223).
The user 10 to which the sounds SO(Info-1), . . . , SO(Info-N) and the visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N) are presented pays auditory attention to any sound SO(Info-n). The pupil diameter acquisition apparatus 15 measures the pupil diameter Pub of the user 10 and sends time-series data of the pupil diameter Pub to the feature quantity extraction unit 225.
The feature quantity extraction unit 225 uses the time-series data of the pupil diameter Pub and the output information Sig-1, . . . , Sig-N extracted from the storage unit 122 to obtain a feature quantity f based on the strength of correlation between each of the visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N) and the pupil diameter change amount VPub of the user 10, and outputs the feature quantity f. The configuration of the feature quantity f is the same as that of the training feature quantity Tfj described above except that the training sound is replaced with a sound, the training sound sources is replaced with a sound source, the training visual stimulus patterns TVS(Sig-1), . . . , TVS(Sig-N) are replaced with visual stimulus pattern VS(Sig-1), . . . , VS(Sig-N), the training user is replaced with a user 10, and the training pupil diameter change amount TVPub of the training user is replaced with a pupil diameter change amount VPub of the user 10. Difference from the first embodiment is that the visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N) of the present embodiment are different aperiodically time-varying stimulus patterns, and the feature quantity f of the present embodiment is based on the strength of the correlation between each of such the visual stimulus patterns VS(Sig-1), VS(Sig-N) and the pupil diameter change amount VPub. For example, the feature quantity extraction unit 225 executes the processing (4.1) and (4.2) described in the first embodiment, and obtains the feature quantity f including, for example, the degree of synchronization between the phase change of each of the N visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N) and the phase change of the pupil diameter change amount VPub, and maximum values CCFmax (SS(Sig-1), PS), . . . , CCFmax (SS(Sig-N), PS) of a cross-correlation function for a sequence SS(Sig-1), . . . , SS(Sig-N) corresponding to time-series signals indicating N visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N) and a sequence PS corresponding to a time-series signal indicating the pupil diameter change amount VPub in (4.3). In addition to this, or instead of this, the degree of synchronization between the phase change of each of the N visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N) and the phase change of the pupil diameter change amount VPub, or the like may be included in the feature quantity f. The feature quantity f is sent to the estimation unit 226 (step S225).
The estimation unit 226 reads the model parameter θ from the storage unit 122. The estimation unit 226 uses the feature quantity f to obtain an estimation result E=M(θ; f) of the destination to which the user 10 pays auditory attention for the sound, on the basis of the estimation model M(θ) specified by the model parameter θ, and outputs the estimation result (step S226).
In the first and second embodiments, the sound source apparatus 14-n is a sound source that emits the n-th sound SO (Info-n). However, n-th (n=1, . . . , N) sounds SO(Info-n) are localized at different positions Yn in a space by the plurality of sound source apparatuses 14-1, . . . , 14-N. In this case, each position Yn in the space becomes a sound source that emits the n-th sound SO(Info-n), and the visual stimulus generation apparatus 13-n is disposed at a position corresponding to the position γn that is the sound source. For example, the visual stimulus generation apparatus 13-n is disposed at the position γn or near the position γn.
In the first and second embodiments and the modification examples thereof, the training visual stimulus pattern and the visual stimulus pattern are presented from a dedicated apparatus for presenting the visual stimulus pattern such as a visual stimulus generation apparatus (hereinafter referred to as a “visual stimulus dedicated apparatus”). However, temporal changes in images of apparatuses other than the visual stimulus dedicated apparatus, landscapes, machines, plants, animals, or the like may be used as the training visual stimulus pattern and the visual stimulus pattern. In this case, the training visual stimulus pattern is generated from a video obtained by filming such an image with a camera or the like. Further, the visual stimulus pattern at the time of estimating the auditory attention state is a pattern that the user 10 visually perceives directly from apparatuses other than the visual stimulus dedicated apparatus, landscapes, machines, plants, animals, and the like. In the case of this example, the visual stimulus control units 123 and 223 and the visual stimulus generation apparatuses 13-1, . . . , 13-N can be omitted.
Further, in the first and second embodiments and the modification examples thereof, training sounds and sounds are presented from a dedicated apparatus such as a sound source apparatus (hereinafter referred to as a “sound presentation dedicated apparatus”). However, sounds emitted from apparatuses other than the dedicated sound presentation apparatus, landscapes, machines, plants, animals, and the like may be used. The training sounds can be generated from an audio signal obtained by recording such sounds with a microphone or the like. Further, the sounds presented to the user 10 when estimating the auditory attention state are the sounds that the user 10 perceives auditorily directly from the apparatuses other than the sound presentation dedicated apparatus, landscapes, machines, plants, animals, and the like. In the case of this example, the auditory information control unit 124 and the sound source apparatuses 14-1, . . . , 14-N can be omitted.
Others are as described in the first and second embodiments.
The learning apparatuses 11 and 21 and the auditory attention state estimation apparatuses 12 and 22 in the respective embodiments is, for example, an apparatus configured by a general-purpose or dedicated computer including a processor (hardware processor) such as a central processing unit (CPU), a memory such as a random-access memory (RAM) and a read-only memory (ROM), and the like executing a predetermined program.
That is, the learning apparatuses 11 and 21 and the auditory attention state estimation apparatuses 12 and 22 in the respective embodiments include, for example, processing circuitry configured to implement respective units included in the apparatuses. This computer may include one processor or memory, or may include a plurality of processors or memories. This program may be installed in a computer or may be recorded in a ROM or the like in advance. Further, some or all of processing units may be configured by using an electronic circuit that realizes a processing function alone, instead of an electronic circuit (circuitry) that realizes a functional configuration by a program being read, like a CPU. Further, an electronic circuit constituting one apparatus may include a plurality of CPUS.
The above-described program can be recorded on a computer-readable recording medium. An example of the computer-readable recording medium is a non-transitory recording medium. Examples of such a recording medium are a magnetic recording apparatus, an optical disc, a photomagnetic recording medium, and a semiconductor memory.
Distribution of this program is performed, for example, by selling, transferring, or renting a portable recording medium such as a DVD or CD-ROM on which the program has been recorded. Further, this program may be distributed by being stored in a storage apparatus of a server computer and transferred from the server computer to another computer via a network. As described above, the computer that executes such a program first temporarily stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer in a storage apparatus of the computer. When the computer executes the processing, the computer reads the program stored in the storage apparatus of the computer and executes processing according to the read program. Further, as another embodiment of the program, the computer may directly read the program from the portable recording medium and execute the processing according to the program, and further, processing according to a received program may be sequentially executed each time the program is transferred from the server computer to the computer. Further, a configuration in which the above-described processing is executed by a so-called application service provider (ASP) type service for realizing a processing function according to only an execution instruction and result acquisition without transferring the program from the server computer to the computer may be adopted. It is assumed that the program in the present embodiment includes information provided for processing of an electronic calculator and being pursuant to the program (such as data that is not a direct command to the computer, but has properties defining processing of the computer).
In each embodiment, although the present apparatus is configured by a predetermined program being executed on the computer, at least a part of processing content of thereof may be realized by hardware.
The present disclosure is not limited to the above-described embodiment. For example, in each of the embodiments, the estimation model is obtained through learning, and the destination to which the user pays auditory attention is estimated using the estimation model. However, the destination to which the user pays auditory attention may be estimated using any method as long as the method uses a feature quantity based on the strength of a correlation between each of a plurality of different visual stimulus patterns corresponding to a plurality of different sound sources and a pupil diameter change amount of a user. For example, a threshold value may be determined by sampling a feature quantity obtained in the past, and the destination to which the user pays auditory attention may be estimated by comparing the threshold value with a newly obtained feature quantity.
For example, the various processing described above may not only be executed in chronological order according to the description, but may also be executed in parallel or individually according to processing capacity of an apparatus that executes the processing or as necessary. In addition, it is obvious that change can be made appropriately without departing from the spirit of the present disclosure.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2021/028988 | 8/4/2021 | WO |