The present application claims priority from Japanese patent application JP2008-037534 filed on Feb. 19, 2008, the content of which is hereby incorporated by reference into this application.
The present invention relates to a pointing device for a user to designate a spot or point on a screen of a display device of a computer, more specifically to a pointing device technique using acoustic information.
In general, a pointing device using a mouse is often used to manipulate objects on a computer screen. The mouse operation and the movement of a cursor of a pointing device on the computer screen interwork, so a user can select a desired point on the screen by moving the cursor onto the point and clicking the mouse button on the point.
In addition, pointing devices using a touch panel are already part of products for people's everyday life and widely used worldwide. In a touch panel, each point on the display is mounted with a detector to sense pressing pressure by a user against the screen, and the detectors decide which points are pressed.
Some pointing devices use acoustic information. For example, there is a device using a special pen to produce ultrasound when pressed against the screen (e.g., see JPA Laid-Open Publication No. 2002-351605).
Some devices generate ultrasonic waves as well as light, and detect a pointed position based on the time difference of ultrasonic wave and light arriving at the sound receiving element and the light receiving element, respectively (e.g., see JPA Laid-Open Publication No. 2002-132436).
Some devices detect a pointed position based on the direction of vibration which is detected by vibration detectors provided on the display as vibration is generated when a fingertip of a user touches the screen of the display (e.g., see JPA Laid-Open Publication No. 2002-351614).
The pointing device using a mouse to manipulate objects on a computer screen is not always convenient because there has to be a desk or something similar to put the mouse on. Meanwhile, the touch panel does not require such auxiliary equipment. However, the touch panel requires a special display, each element on the display has to be attached with a pressing pressure detector, and a touch should be done very close to the display.
According to the techniques disclosed in JPA Laid-Open Publication No. 2002-351605 and JPA Laid-Open Publication No. 2002-132436, a user needs to use a special pen or a coordinate input device. Also, according to the technique disclosed in JAP Laid-Open Publication No. 2002-351614, vibrations are generated when a user touches the screen and the generated vibrations are detected to find out a pointed position.
In view of foregoing problems, an object of the present invention is to provide an acoustic pointing device that enables pointing manipulation by the user based on acoustic information even from a remote place, without necessarily using auxiliary equipment on a desk for the manipulation of objects on a computer screen, a pointing method of a sound source position, and a computer system using the acoustic pointing device.
In accordance with an aspect of the present invention, there is provided an acoustic pointing device for detecting a sound source position of a sound to be detected and converting the sound source position into one point on a screen of a display device, including a microphone array that retains plural microphone elements; an A/D converter that converts analog sound pressure data obtained by the microphone array into digital sound pressure data; a direction of arrival estimation unit that executes estimation of a sound source direction of the sound to be detected based on a correlation of the sound between the microphone elements obtained by the digital sound pressure data; an output signal calculation unit that estimates a noise level in the digital sound pressure data and computes a signal component of the sound based on the noise level and the digital sound pressure data to output the signal component as an output signal; an integration unit that integrates the sound source direction with the output signal to specify the sound source position; and a control unit that converts the specified, sound source position into one point on the screen of the display device.
In the acoustic pointing device according to the present invention, the microphone array is constituted of plural sub microphone arrays, wherein the device further includes a triangulation unit that integrates, by triangulation, the sound source directions estimated from each of the sub microphone arrays by the direction of arrival estimation unit to obtain the sound source direction and compute a distance to the sound source position, and a direction decision unit that decides whether the sound source direction and the distance are within a predetermined area, wherein the integration unit integrates the output signal with the sound source direction and the distance within the area to specify the sound source position, and wherein the control unit converts the specified, sound source position into one point on the screen of the display device.
Moreover, in the acoustic pointing device according to another aspect of the present invention, the microphone array is constituted of plural sub microphone arrays, wherein the device further includes a converter that converts the digital sound pressure data into a signal in a time-frequency area, a triangulation unit that integrates, by triangulation, the sound source directions that are estimated from each of the sub microphone arrays by the direction of arrival estimation unit using the signal to obtain the sound source direction and compute a distance to the sound source position, and a direction decision unit that decides whether the sound source direction and the distance are within a predetermined area, wherein the integration unit integrates the output signal with the sound source direction and the distance within the area to specify the sound source position, and the control unit converts the specified sound source position into one point on the screen of the display device.
Furthermore, in the acoustic pointing device according to another aspect of the present invention, the microphone array is constituted of plural sub microphone arrays, the device further includes a converter that converts the digital sound pressure data into a signal in a time-frequency area, a triangulation unit that integrates, by triangulation, the sound source directions that are estimated from each of the sub microphone arrays by the direction of arrival estimation unit using the signal to obtain the sound source direction and compute a distance to the sound source position, a direction decision unit that decides whether the sound source direction and the distance are within a predetermined area, an output signal decision unit that decides whether the output signal from the output signal calculation unit is equal to or greater than a predetermined threshold, a database of sound source frequencies that prestores frequency characteristics of the sound to be detected, and a database of screen conversion that stores a conversion table capable of specifying the one point on the screen from the sound source position, wherein the integration unit performs weighting by the frequency characteristics upon the output signal which is equal to or greater than the threshold and integrates the sound source direction and the distance within the area to specify the sound source position, and wherein the control unit converts the specified sound source position into one point on the screen using information in the database of screen conversion.
Still another aspect of the present invention provides a pointing method of a sound source position for use with the acoustic pointing device, and a computer system mounted with the acoustic pointing device.
In the manipulation of objects on a computer screen, an acoustic pointing device in accordance with the present invention enables pointing manipulation by a user based on acoustic information even from a remote place, without necessarily using auxiliary equipment on a desk.
Also, it is possible to provide a pointing method of a sound source position for use with the acoustic pointing device.
Furthermore, it is possible to provide a computer system mounted with the acoustic pointing device.
Reference will now be made in detail to the preferred embodiments of the present invention, examples of which are illustrated in the accompanying drawings.
In addition, the acoustic pointing device includes a database (hereinafter it will be referred to as a “DB”) 214 of sound source frequencies, which stores in advance frequency characteristics of target sounds; and a DB 213 of screen conversion which matches the coordinates of a sound source with a specific point on the display screen.
In the case where only time signals are used for the digital sound pressure data, it is possible to specify the position of a sound source without the need of the STFT unit 202, the power decision unit 210, the SNR decision unit 208 and the DB 214 of sound source frequencies.
The following will now explain in detail about each constituent unit shown in
Multi-channel digital sound pressure data that have been converted by the A/D converter 102 are accumulated at a specific amount for each channel in the data buffering unit 201. Generally, the process in a time-frequency area is not carried out whenever a sample is obtained, but it is carried out collectively after plural samples are obtained. That is, the process is not executed at all until a specific digital sound pressure is accumulated.
The data buffering unit 201 has a function of accumulating such a specific amount of digital sound pressure data. Digital sound pressure data which is obtained from each microphone is processed distinguishably by an index (i) starting from 0 according to microphone. For ‘n’ as an integral, digital sound pressure data of the i-th microphone that is sampled on the n-th time is denoted as xi(n).
The STFT (Short Term Fourier Transform) unit 202 converts digital sound pressure data from each microphone into time-frequency signals by applying the following (Formula 1).
where j is defined as in Formula 2 as follows.
[Formula 2]
j=√{square root over (−1)}
Xi(f, τ) is the f-th frequency component of the i-th microphone. ‘f’ ranges from 0 to N/2. N is a data length of digital sound pressure data that is converted into a time-frequency signal. Typically, it is called a frame size. S is usually called a frame shift which indicates a shift amount of digital sound pressure data during its conversion into a time-frequency signal. The data buffering unit 201 continuously accumulates digital sound pressure data until a new S sample is acquired for each microphone, and once the S sample is acquired the STFT unit 202 converts it into a time-frequency signal.
‘τ’ is a frame index which corresponds to a count or the number of times digital sound pressure data is converted into a time-frequency signal. ‘τ’ starts from 0. ‘w(n)’ is a window function, and typical examples of such a function include Blackmann window, Hanning window, and Hamming window. By the use of a window function, high precision time-frequency resolution can be achieved.
Digital sound pressure data that is converted into a time-frequency signal is transferred to a direction of arrival estimation unit 203.
The direction of arrival estimation unit 203 divides a microphone array constituted by microphones into plural sub microphone arrays, and estimates a sound source direction of each sub microphone array in an individual coordinate system. Suppose that one microphone array is divided into R sub microphone arrays. Then, M microphones that constitute the microphone array are allocated to at least one of R sub microphone arrays. For instance, those M microphones can be allocated to two or more sub microphone arrays, and in this case plural sub microphone arrays have the same microphones.
When two microphones of a sub microphone array are aligned in parallel on the surface of a desk, the angle (θ) is estimated as an azimuth angle in the horizontal direction. Meanwhile, when two microphones of a sub microphone array are aligned perpendicularly to the surface of a desk, the angle (θ) is estimated as an elevation angle in the vertical direction. In this manner, azimuth and elevation angles are estimated.
Suppose that a sub microphone array has at least two microphones. Then, angle (θ) can be estimated by applying Formula 3, provided that there are two microphones in each sub microphone array.
Here, ρ is a phase difference in frame (τ) and frequency index (f) of input signals of two microphones. F is a frequency of the frequency index (f), i.e., F=(f+0.5)/N×Fs/2. Fs is a sampling rate of the A/D converter 102. d is a physical space (m) between two microphones. c is the speed of sound (m/s). Technically, sound speed varies with temperature and density of a medium, but 340 m/s is universally recognized as the sound speed.
The internal process of the direction of arrival estimation unit 203 is the same for any time-frequency, so the suffix (f, τ) of the time-frequency will be omitted in the description that follows. As aforementioned, the direction of arrival estimation unit 203 carries out the same process on each time-frequency area. If a sub microphone array has three or more microphones which are aligned on the same line, the direction can be computed very accurately by SPIRE algorithm in the linear alignment. More details on the SPIRE algorithm are described in M. Togami, T. Sumiyoshi, and A. Amano, “Stepwise phase difference restoration method for sound source localization using multiple microphone pairs”, ICASSP 2007, vol. I, pp. 117-120, 2007.
In the SPIRE algorithm, since multiple microphone pairs of different spaces between neighboring microphones (hereinafter they are referred to as “microphone spaces”, it is desirable to align microphones that constitute a sub microphone array at different microphone spaces from each other. A microphone pair of a smaller microphone space is sorted out first, in an increasing order. For p as an index for specifying one microphone pair, a microphone pair with the smallest microphone space is where p=1, while a microphone pair with the largest microphone space is where p=P. The following process is executed sequentially from p=1 to p=P. First, an integral np that satisfies the following condition (Formula 4) is obtained.
Since the term at the center surrounded by inequality signs falls within a range of 2π, only one solution is found. And, the following (Formula 5) is executed.
[Formula 5]
{circumflex over (ρ)}p-1=ρp+2πnp
Before executing the above process for p=1, the following (Formula 6) is given as an initial value.
[Formula 6]
{circumflex over (ρ)}0=0
Also, note that dp is a space between microphones in the p-th microphone pair. The above process is executed until p=P, and then a sound source direction is estimated by the following (Formula 7).
Accuracy of the estimation of a sound source direction is known to increase along with a larger microphone space. If the microphone space is longer than a half wavelength of a signal for direction estimation, it is impossible to specify one direction from the phase difference between microphones so there exist two or more directions that have the same phase difference (spatial aliasing). The SPIRE algorithm has a mechanism to select a direction with a smaller microphone space out of two or more estimated directions that are generated with a large microphone space as the direction close to the sound source direction. Therefore, the SPIRE algorithm is advantageous in that a sound source direction can be estimated at high precision even with a large microphone space that causes special aliasing. If microphone pairs are aligned non-linearly, the SPIRE algorithm for non-linear alignment makes it possible to compute an azimuth angle and sometimes even an elevation angle.
Meanwhile, if the digital sound pressure data is not a time-frequency signal, i.e., data of a time area only, the SPIRE algorithm cannot be used. As long as the data in a time area only is concerned, GCC-PHAT (Generalized Cross Correlation PHAse Transform) method is used for direction estimation.
The noise estimation unit 204 estimates a background noise level of an output signal from the STFT unit 202. For estimation of a noise level, MCRA (Minima Controlled Recursive Averaging) may be used. MCRA noise estimation process is based on a minimum statistics method. The minimum statistics method sets a minimum power among many frames as an estimate for the noise power per frequency. In general, voice or beating sound on a desk often has a transient power per frequency, yet hardly maintains that large power for a long period of time. Therefore, a component that takes a minimum power among many frames can be approximated with a component containing only noise, and a noise power even in a voice utterance section can be estimated at high precision. An estimated microphone and a noise power per frequency are denoted as M(f, τ). Index for a microphone is denoted as ‘i’, and a noise power is estimated for every microphone. Because the noise power is updated per frame, it varies by τ. The noise estimation unit 204 outputs an estimated microphone and a noise power Ni(f, τ) per frequency.
If data in a time area only is concerned, noise, compared with a transient sound, has a low output power but tends to stay for a longer period of time, thereby making it possible to estimate a noise power.
The SNR estimation unit 205 estimates an SNR (Signal To Noise Ratio) by the following (Formula 8) using an estimated noise power and an input signal Xi(f, τ) of a microphone array being given.
SNRi(f, τ) is an SNR of frame (τ) and frequency index (f) of the microphone index (i). The SNR estimation unit 205 outputs an estimated SNR. The SNR estimation unit 205 may smooth an input power in the time direction. In so doing, stable SNR estimation which is strong against noise can be achieved.
The triangulation unit 206 integrates sound source directions, each being obtained from a sub microphone array, so as to measure azimuth angle, elevation angle, and distance to a sound source. A sound source direction obtained from the i-th sub microphone array with respect to a sound source direction obtained from a coordinate system for each sub microphone array is denoted as follows:
[Formula 9]
θi(f,τ)
For instance, as shown in
Normally, there is more than one cross-over in the sound source direction Pi. If this is the case, a cross-over for two sound source directions is obtained by combination of all sub microphone arrays, and an average of those crossings is outputted as the position of a sound source. By averaging, robustness for non-uniformity of crossing positions is improved.
In some cases, two sound source directions may not have a crossing at all. In this case, a solution that is obtained by combination of sub microphone arrays with no crossing may not be used for estimation of the position of a sound source in a time-frequency area, or estimation of the position of a sound source in a relevant time-frequency area may not be executed at all. Having no cross-over implies that there is another sound source besides the observation target sound source, so noise is included in the phase difference information. Because a sound source position having been estimated in such a time-frequency area is not used, the position of a sound source can be estimated at higher precision.
Moreover, if a sub microphone array is aligned linearly, it is not always possible to estimate both azimuth and elevation angles, so only the angle between the array direction of the sub microphone array and the sound source can be estimated. In this case, a sound source exists on the plane which is the estimate of an angle between the array direction of the sub microphone array and the sound source. A cross-over on such a plane, which is obtained from each sub microphone array, is then outputted as a sound source position or a sound source direction. However, if all the sub microphone arrays are aligned linearly, an average of crossovers on the plane obtained by combination of all sub microphone arrays is outputted as the position of a sound source. By averaging, robustness for non-uniformity of cross-over positions is somewhat improved.
Meanwhile, if some sub microphone arrays are aligned linearly and other sub microphone arrays are aligned non-linearly, one of linearly aligned sub microphone arrays and one of non-linearly aligned sub microphone arrays are combined to get an estimate of the sound source position. In the case of combining the linear alignment and the non-linear alignment, a minimum number of sub microphone arrays with one cross-over being determined is designated as one unit, and an average of crossovers obtained by combination of all sub microphone arrays is outputted as a final estimate of the position of a sound source.
The direction decision unit 207 decides whether a sound source position obtained by the triangulation unit 206 is on a desk or within a predetermined beating area. If two aspects or conditions, concerning whether an absolute value of height of a sound source from the desk, the sound source having been calculated from information on the sound source position obtained by the triangulation unit 206, is not larger than a predetermined threshold and whether planar coordinates of a sound source that has been calculated from information on the sound source are within a beating area, are satisfied, the direction decision unit 207 outputs a sound source direction and a distance to the sound source as the information on the sound source position. Also, it may output a sound source direction and a distance to the sound source as an azimuth angle and an elevation angle. Given that the above-described two conditions are met at the same time, the direction decision unit 207 outputs a plus decision result, while it outputs a negative decision result if the conditions are not met at the same time. The integration unit 211 (to be described) integrates the plus decision result with the sound source direction and distance outputted from the triangulation unit 206. The definition of a beating area will be explained later on.
The SNR decision unit 208 outputs a time-frequency component for which an SNR estimate per time-frequency outputted from the SNR estimation unit 205 is equal to or greater than a predetermined threshold. With a given SNR per time-frequency outputted from the SNR estimation unit 205, the power calculation unit 209 calculates a signal power Ps by applying the following (Formula 10).
where Px is power of an input signal.
The power decision unit 210 outputs a time-frequency component for which signal power per time-frequency outputted from the power calculation unit 209 is equal to or greater than a predetermined threshold. The integration unit 211 increases power, which is outputted from the power calculation unit 209 of a time-frequency component that has been specified by both the power decision unit 210 and the SNR decision unit 208 at the same time, as a weight per frequency that is kept in the DB 214 of sound source frequencies. That is to say, if frequency characteristics of a target sound (e.g., beating sound on the desk) can be measured in advance, the frequency characteristics are stored in the DB 214 of sound source frequencies. And through the increased by the power stored in the DB 214 of sound source frequencies, it becomes possible to execute the position estimation at higher precision.
The power decision unit 210 and the SNR decision unit 208 both give a zero weight to a non-specific time-frequency component. Also, they give a zero weight to a time-frequency component that turned out to be not within the beating area according to the direction decision unit 207.
In this embodiment, the output signal decision module indicates the SNR decision unit 208 and the power decision unit 210.
Suppose that a beating area is cut into a grid of several centimeters for each side and that the estimation result of a sound source position of a relevant component per time-frequency is included within the i-th grid. A weight power corresponding to the power Pi of the grid is then added. This power addition process of the grid is performed for every time-frequency. A grid with a maximum power after the addition process is then outputted as the final position of a sound source. The size or quantity of grids is predefined.
Duration of the power addition process of the grid can also be predefined, or the above-described addition process may be carried out only for a time zone that is decided as a voice section by VAD (Voice Activity Detection). By making duration of the addition process short, one can reduce reaction time taken until the position of a sound source is decided after a beating sound is given. However, shorter reaction time creates a problem of weakness at noise.
On the other hand, if duration of the addition process is made long, reaction time taken until the position of a sound source is decided after a beating sound is given also increases, yet robustness is enhanced against noise. Thus, duration of the addition process should be set in consideration of such a trade-off relationship. Usually a beating sound lasts about 100 ms, so the addition process should preferably last about the same amount of time. If the maximum power of grid is smaller than a predetermined threshold, it is decided that no beating sound was made so the result is discarded. Meanwhile, if the maximum power of grid is greater than a predetermined threshold, a sound source position thereof is outputted and the process in the integration unit 211 is terminated.
The control unit 212 converts the coordinates of a sound source position of a beating sound having been outputted from the integration unit 211 into a particular point on a screen, based on the information from the DB 213 of screen conversion.
The DB 213 of screen conversion retains a table for converting the input coordinates of a sound source position into a particular position on a screen. Any conversion method (e.g., linear conversion by a 2×2 matrix) is acceptable as long as a sound source position of a beating sound can be converted into a point on a screen. For instance, disregard information obtained from the position estimation of a sound source about the height of the sound source, and control the PC as if a point on a conversion screen that is obtained by matching position information of the sound source on a plane with a point on the screen had been clicked or dragged. Also, height information can be interpreted in different ways. For instance, if the height information says that a sound is being produced from a certain height above a given level, it is regarded that one point on the screen must have been double clicked. Meanwhile, if the height information says that a sound is being produced from a certain height below a given level, it is regarded that one point on the screen must have been clicked. In so doing, user manipulation can become more diverse in manner.
After the system starts, in step 501 for a stopping decision, it is decided how a user is going to end the program such as either by shutting down the computer or by pressing the end button of the beaten position detection program on the desk.
If a stopping decision is made in step 501 for a stopping decision, the program is ended and the process is terminated. If a stopping decision is not made, however, the process goes to step 502 for digital conversion where analog sound pressure data called out of a microphone array is converted into digital sound pressure data. The conversion is executed in the A/D converter. The digital sound pressure data after the conversion is then called into the computer. Digital conversion can be done on each sample, or plural samples having a matching minimum process length of a beating sound on the desk can be called into the computer at once. In step 503 for time-frequency conversion, the digital data being called in is decomposed into a time-frequency component by SFFT. With the use of SFFT, it becomes possible to estimate a sound source direction per frequency component.
Under the environment using the desk beating sound program, human voice often exists as noise in addition to the desk beating sound. Human voice is a sparse signal in the time-frequency area, and known to be widespread in part of a particular frequency band. Therefore, by estimating a sound source direction in the time-frequency area, it becomes easier to reject frequency components where human voice is widespread and the beating sound detection can be done with improved precision.
In step 505 for a decision of rejection, it is decided whether the detected beating sound is really a beating sound within the beating area of the desk. If the detected beating sound is not within the beating area of the desk, the stopping decision in step 501 is carried out. However, if the detected beating sound is within the beating area of the desk, mapping between each point in the beating area and a point on the screen is defined in advance, and a decision of holding down position is made in step 506 to discern a button holding down position and thus to specify one point on the screen based on information on the beaten position according to the mapping. In step 507 for a decision of button existence, it is decided whether the button exists in a position of the beating area. If it is decided no such button exists, the process returns to step 501 for the stopping decision. However, if it is decided the button exists in the beating area, a button action in step 508 is executed in the same manner as clicking the button on the screen with a mouse or other pointing device.
In step 602 for comparison of noise power, the power decision unit 210 decides whether the size of the beating sound is greater, compared with a noise power that is estimated by the MCRA method. The MCRA method is for estimating power of the background noise among mixed sounds of voice and background noise. The MCRA method is based on minimum statistics. The minimum statistics regards a minimum power within several frames as the power of the background noise, assuming that voice has a transient large volume. Meanwhile, one should note that the power of the background noise estimated by the minimum statistics tends to be smaller than the power of the actual background noise. The MCRA method smoothes the background noise power that is estimated by the minimum statistics in the time direction for correction, and computes a value close to the actual background noise power. From an aspect that a beating sound, although not a voice, has a transient large power and has the same statistical nature as the voice, a method for estimation of background noise power such as the MCRA method can be applied.
If the noise power is greater than the power of the beating sound, an SNR of the power of next background noise and the power of a beating sound is calculated. In step 603 for an SNR decision, the SNR decision unit 208 decides whether the beating sound power is greater than the calculated SNR, and if so, it decides a time-frequency component thereof as a beating sound component.
The integration unit 211 divides a beating area into a grip in advance. The time-frequency component that has been decided as the beating sound component is allocated into a grid corresponding to the estimates of azimuth and elevation angles of the component. At the time of allocation, a frequency-dependent weight is added to the power of the beating sound component corresponding to the grid. This process is carried out on a predetermined frequency band and for a predetermined duration. In step 604 for grid detection, a grid with a maximum power is detected, and the azimuth and elevation angles of the grid are outputted as the azimuth and elevation angles of a beating sound, thereby specifying a sound source. Here, if the power of the grid with a maximum power is below a predetermined threshold, it is decided that a beating sound does not exist.
The process sequence for the direction decision unit 207, the power decision unit 210, and the SNR decision unit 208 is not limited to the order shown in
The integration unit 211 decides whether the azimuth and elevation angles are within a beating area and regards a sound as a beating sound only if the angles are within the beating area. By making such a decision, it becomes possible to reject part of the time-frequency area where the voice components are widespread.
The integration unit 211 operates to output a grid with the maximum power. To do so, it obtains a direction along which the power in each of the sub microphone arrays is a maximum, integrates the maximum directions, and estimates a sound source direction of the beating sound by triangulation.
Number | Date | Country | Kind |
---|---|---|---|
2008-037534 | Feb 2008 | JP | national |