The present application claims priority from Japanese application JP 2005-108145 filed on Apr. 5, 2005, the content of which is hereby incorporated by reference into this application.
The present invention relates to a method and a device for providing information according to the taste of users mainly by images in public or private spaces and a method and a device for providing general information such as advertisement in the same way.
The most common means for providing information in the form of image information at public spaces such as railway stations, airports, department stores, museums or amusement parks consist of either maintaining a unilateral flow of information without regard to the will of users or allowing the users to choose expressly the information they want by operating a button.
There is, however, an attempt to acquire automatically the subject of interest or the attributes of the users and to change the information to be provided accordingly. For example, the Patent Document 1 (Japanese Patent Application Laid Open 2004-280673) discloses a method of taking the image of users with a camera and estimating the degree of interest they have by detecting the direction of their attention.
In an occasion of providing the general public or individuals with information mainly in the form of image, if it is possible to detect whether the users who are at a place allowing them to view the image are watching at the image or not, the convenience for the users can be enhanced by providing more detailed information on the subject matter being displayed at the time. And it will be possible to make the information reflect the marketing of the information provider by finding the taste of the users. In the past, the method of accepting the subjective choice of the users by installing a selecting device such as button in the information providing device has been used. However, this method is ineffective for the users who have no will strong enough to take the trouble of pressing on the button. And many of the users are not aware of the possibility of operating the information system by pressing on the button. Thus, if it is possible to detect automatically whether the users are watching the image or not and to change automatically the image displayed according to the result obtained, it will be possible to respond to the taste of a wider range of users.
The voice data obtained by the voice inputting unit, the image data now being provided and information added to the image data are compared, and the degree of attention paid by the subjects is estimated based on the degree of similitude. It is possible to estimate the degree of attention of the subjects by detecting the agreement of the dividing lines between scenes for both voice data and image data, the similitude of sound frequency patterns, and the detection of key words representing the contents of the image in the voice and other similar phenomena. And efforts will be made to provide information that is likely to be easily accepted by the users by providing the image information acquired by optimizing the information acquired from voice information by estimating the language used by the subjects by means of a language identifying device and by using the language for the information provided.
The present invention enables to provide information that will attract the interest of a larger number of users. And because of the possibility of finding more details about the taste of the users, it will be possible to collect information for bringing the sales program and the like to the taste of the users.
We will describe in details below an embodiment of the present invention with reference to drawings.
The subjects' attribute analyzing unit estimates the language used, sex, spatial position and other attributes of the users. On the other hand, the voice and image correlation analyzing unit compares the voice data sent from the voice inputting unit with the image data sent from the image outputting unit described later to determine the correlation between them. If there is any information sent from the image inputting unit, the precision of estimating the correlation will be raised by using the information by a method described later. If the correlation between them is found to be high by the voice—image correlation analyzing unit, it is possible to estimate that the users are highly likely to be talking on a subject related to the contents of the output image, and therefore it is possible to consider that the users are interested in the current image. If the correlation is low on the contrary, it is possible that the users are not watching the image or not interested in it even if they are watching it, and that they are talking of something unrelated with the image.
The results of analyses by the subjects' attribute analyzing unit and the voice—image correlation analyzing unit will be sent to the output image selecting unit 114. Here, the following image to be outputted will be determined based on the analysis results of the preceding stage. For example, if the voice and image correlation analyzing unit finds that the image and voice are strongly correlated, the users are considered to be interested in the contents of the current image, and therefore more detailed information relating to the contents will be provided. If the correlation is weak on the contrary, the flow of the summary-type information will be continued, or the subject of the image will be changed. And if the information on the language used sent from the subjects' attribute analyzing unit is different from the language used in the sub-title of the image currently displayed, the language used in the sub-title will be changed to the language used by the users. Based on the result of selection thus obtained, the image outputting unit 116 generates the following image and displays the same on the displaying device. And the same output image data 118 as the one displayed will be sent to the voice—image correlation analyzing unit to be used in the following operation.
The analysis results of the subjects' attribute analyzing unit and the voice—image correlation analyzing unit will be sent at the same time to attention information arranging unit 110. Here, the statistical information relating to the attributes of and the degree of attention paid by the users having seen the image displayed will be arranged in order. The statistical information obtained will be provided by the communicating unit 112 to the source of distribution of the image and will be used for the elaboration of the future image distribution program.
The computing device analyzes the attributes of the subjects, analyzes the correlation between the voice and image, arranges in order the information on watchful eyes, selects the output images and performs other similar operations by executing the respective prescribed program.
The word spotting module 316 compares the key word information 308 that had been sent in accompaniment of the output image data 118 with the voice data and judges whether the voice data contain the key word.
The scene-splitting module 318 splits the voice data into different scenes based on information such as amplitude, spectrum and the like. The simplest method is that of judging that a scene has ended when the time during which amplitude remains below a certain fixed value has continued for more than a fixed length of time. A more sophisticated method of splitting scene can be that wherein the result of study in the field called “Auditory Scene Analysis” is applied. The scene-splitting method based on the auditory scene analysis is described in details in Bregman: “Auditory Scene Analysis: Perceptual Organization of Sound (MIT Press, 1994, ISBNO-262-5219 5-4) (Non-patent Document 1) and other similar literature.
On the other hand, the output image data 118 sent from the image outputting unit 116 is similarly split into different scenes. Generally, images output by the image outputting unit are those created in advance by devoting much time and work, and it is possible to provide information on the dividing lines between different scenes. In such a case, different scenes can be split simply by having this information read. And if scenes are not split in advance for some reasons, it is possible to split them automatically. As the method for automatically splitting images recorded on video tapes and the like into different scenes, those described in Ueda, et al.: IMPACT: An Interactive Natural Motion Picture Dedicated Multimedia Authoring System (CH I' 91, ACM, pp. 343-350, 1991) (Non-patent Document 2) and other similar literature can be used. And if image data 302 can be used, it is possible to split images into different scenes by applying similar methods to these data.
Based on the result of scene splitting in the image data, voice data and output image data thus obtained respectively, these relationships of collation will be examined by a scene collating module 322. The method of examining the relationship of collation will be described in details later on. The voice data 304 will also be sent to a frequency analyzing module 320, where various parameters of voice will be extracted. The parameters here include for example, power of the whole voice, power limited to a specific frequency zone, the fundamental frequency and the like. On the other hand, data corresponding thereto are assigned in advance to the output image data, and both of them are compared by the frequency collating module 324 to estimate correlation. The results acquired by the attention direction estimating module 314, the word spotting module 316, the scene collating module 322 and the frequency collating module 324 will be sent to the correlation judging module 326, which consolidates various results and renders the final judgment.
The spatial attribute analysis will be conducted on the inputs from a plurality of microphones by two modules, i.e. the amplitude detecting module 910 and the phase difference detecting module 912, and the position judging module 914 estimates the position of users based on the result obtained thereby. At this time, reference will be made to the equipment arrangement information DB 916 showing how equipment such as microphones are actually arranged by what positional relationship. As the simplest operating method for judging position, there is for example a method of choosing the microphone showing the maximum amplitude from the results of amplitude detection by ignoring the result of detecting phase difference, and confirming the position of the microphone by the equipment arrangement information DB. A more precise method can be that of estimating the distance between various microphones and the sound source from the result of amplitude detection by taking into account the principle that the energy of sound is inversely proportional to the square of the distance from the sound source. It is also possible to estimate the direction of the sound source by detecting the phase difference of the sound that has arrived between two microphones and by comparing the wavelength of the sound. Although the values obtained by these methods are not necessarily precise due to the impacts of noises, it is possible to raise the degree of reliability by combining a plurality of estimated results. In addition, the algorithm of estimating the position of sound source by the use of a plurality of microphones is described in details in such documents as Kobayashi et al., “Estimation of the position of a plurality of speakers by the free arrangement of a plurality of microphones” (Journal of Electronic Information Communication Society A. Vol. J82 A, No. 2, pp. 193-200, 1999) (Non-patent Document 3). Incidentally, when image data 302 can be used, the determination of the position of users by directly using them can be used at the same time.
On the other hand, the personal attribute analysis leads to the acquisition of information belonging to each individual user by analyzing the features of voice. As examples of information belonging to each individual user, information such as the language used, gender, age and the like can be mentioned. These analyses can be executed by the method of comparing the language-based model 924, the sex-based model 926, and age-based model 928 previously created with the input voice in the language identification module 918, the sex identification module 920 and the age identification module 922, by computing the degree of similarity to each model, and by choosing the category with the highest degree of similarity. At the time of comparison, it is possible to raise precision by estimating at the same time the phonemic pattern included in the voice. In other words, the method consists of, at the time of recognizing voice by the generally frequently used Hidden Markov Model, using in parallel a plurality of sound models such as the Japanese sound model and the English sound model, the masculine sound model and the feminine sound model, the teen-age sound model and the persons in the twentieth sound model and the persons in the thirtieth sound model and the like for selecting the category of language, sex and age corresponding to the model acquiring the highest reliability score for the result of recognition. In order to acquire a high degree of precision in the identification of language, it is necessary to refine the method. The algorithm of language identification is described in details in such literature as Zissman: “Comparison of four approaches to automatic language identification of telephone speech” (IEEE Transactions on Speech and Audio Processing, Vol. 1.4, No. 1, pp. 31-44, 1996) (Non-patent Document 4).
We will describe below in details the operation of the output image selecting unit 116. Here, a method of presenting image for providing most efficiently information to the users is selected based on the result obtained by the subjects' attribute analyzing unit and the voice—image correlation analyzing unit. To begin with, when the language used is found as the first example, the language information included in the image will be changed to the language. And when voice is outputted in addition to image, it is possible to add the sub-title in the language used by the users provided that the language of the output voice is different from the language used by the users. Then, when the users' voice and the image are found to be strongly correlated, the users are considered to be interested in the current image, and more detailed information will be provided relating to the matters shown therein. On the contrary, when the users are not interested in the current image, the provision of only summary-type information will be continued, or images relating to some other topics will be provided. Here, if it is possible to estimate to some extent the sex and age of the user when selecting another topic, it will be possible to provide information highly likely to attract the interest of a specific class of users shown from the same.
It is possible to not only select a single image displayed in this way on the whole screen but also to divide a large display and use it efficiently.
In order to control display image based on the degree of attention of users, it is enough to use the data stored in the storage device accessible from the output image selecting unit 114 previously correlated with the default output image as information and image data to be displayed additionally (or displayed being transformed into a default image). And in order to control display image in response to the users' attributes, it is enough to store the information and image data to be displayed additionally (or displayed being transformed into a default image) in the storage device by correlating the data with each attribute.
As it is expected that wrong results may be obtained always at a certain ratio in the voice—image correlation analyzing unit and the subjects attribute analyzing unit, it is desirable that there is a function of preventing the users from receiving any bad impression in such a case.
, “English” and
. And such a button is often realized as a button on the screen having a touch panel function. Therefore, in such a case, when a language different from the currently set language is detected by the identification of language, the displayed language will be changed and at the same time the size of the language selection button will be enlarged for displaying the same. In this way, the user will easily realize that the language has been automatically changed and that, if he or she is not happy with the change, the language can be changed again by operating the button. Thus, even if the user is unhappy with the automatically changed language, he or she can quickly revert to the desired language. Incidentally, as in the case of the example of
We will then describe below in details the function of the attention information arranging unit 110 and the communication unit 112. The implementation of the present invention enables to acquire information on which user showed his or her interest in which part of the displayed image. This information can be obtained by comparing the output of both the subjects' attribute analyzing unit and the voice—image correlation analyzing unit. Such information is very useful for the provider of the image. For example, when an advertisement image is displayed for the purpose of selling a product, it is possible to find out whether the user or users is or are interested in it or not, and to have the fact reflected on the future development of products. And as the value of display as an advertisement medium can be expressed numerically in details, it is possible to have the result reflected on the price of advertisement. In order to use the present system for such purpose, the attention information arranging unit extracts the information on which part of the image and how many users showed their interest, and after removing useless information from the same and arranging the same in order, the information thus obtained is sent to the Management Department through the communication unit.
The present invention can be used in devices for providing efficiently guidance information in public spaces and the like. And the present invention can also be used for improving the efficiency of providing advertisement information through images.
Number | Date | Country | Kind |
---|---|---|---|
2005-108145 | Apr 2005 | JP | national |