The present invention relates to a speech recognition apparatus for recognizing input speech and its method, and a program.
The conventional implementation of the speech recognition technique is typically conducted by creating a program. In recent years, however, implementation of the speech recognition technique is conducted by using a hypertext document such as VoiceXML. As described in Japanese Patent Applications Laid-Open No. 2001-166915 and No. 10-154063 in the VoiceXML, only speech is basically used as input and output means (strictly speaking, DTMF or the like is used). However, it is also contrived to use a hypertext document not only for speech inputting and outputting but also for description of a UI using GUI as well.
In such a scheme, a markup language such as HTML is used for description of GUI, and in addition, some tags corresponding to the speech input and speech output are added in order to make possible speech inputting and outputting.
On the other hand, in the so-called multi-modal user interface using GUI with the speech inputting and outputting, it becomes necessary to describe how modalities, such as speech inputting using speech recognition, speech outputting using speech synthesis, inputting from the user using GUI, and presentation of information using graphics, are linked. For example, in Japanese Patent Application Laid-Open No. 2001-042890, there is disclosed a method in which a button is associated with an input column and a speech input and depression of the button causes the associated input column to be selected and a speech recognition result to be input into the column.
In an apparatus according to Japanese Patent Application Laid-Open No. 2001-042890, however, selection of any one item with a button can cause speech to be input into an input column associated therewith. It has a feature that in speech recognition not only words but also free speech such as a sentence can be input. For example, if one utterance “From Tokyo to Osaka, one adult” is conducted in a ticket sales system using the multi-modal user interface, then four pieces of information in the one utterance, i.e., a departure station, a destination station, a kind of a ticket, and the number of tickets can be input in a lump.
Furthermore, it is also possible to utter them separately and input them. When it is attempted to associate such a continuous input with an input column of GUI, association having a degree of freedom becomes necessary. For example, it is necessary that one utterance is not limited to one input column, but it fills a plurality of input columns simultaneously. The above described proposal cannot cope with such an input method.
The present invention has been made in order to solve the above described problem. An object of the present invention is to provide a speech recognition apparatus capable of implementing speech inputting having a degree of freedom and its method, and a program.
A speech recognition apparatus according to the present invention achieving the object has a following configuration including:
Hereafter, preferred embodiments of the present invention will be described in detail with reference to the drawings.
First Embodiment
The speech recognition system can conduct data communication via a network such as a public line, a radio LAN, or the like, and includes standard components (such as a CPU, a RAM, a ROM, a hard disk, an external storage device, a network interface, a display, a keyboard, and a mouse), which are mounted on a general purpose computer and a mobile terminal. Furthermore, various functions implemented in the speech recognition system described hereafter may be implemented by a program stored in the ROM or the external storage device in the system and executed by the CPU, or may be implemented by dedicated hardware.
First at step S100, document data 100 is read by using a document reading section 101. Document data is a hypertext document formed of descriptions of a description language such as the markup language. The document data contains descriptions that represent a GUI design, operation of speech recognition and synthesis, a location of a speech recognition grammar (storage location), and text data of a display subject/speech output subject.
Subsequently, at step S101, an analysis of the read document data 100 is effected by using a document analysis section 102. At this step, an analysis of the markup language in the document data 100 is effected, and an analysis as to what structure the document data 100 has is effected.
An example of the document data 100 to be analyzed is shown in
“Input” tags 402 and 403 shown in
At step S102, a control section 109 derives correspondence relations between input columns and grammars on the basis of the analysis result of the document analysis section 102. In the first embodiment, grammar “http://temp/long.grm#keiro” corresponds to a “form” having a name “keiro”, grammar “http://temp/station.grm#station” corresponds to an “input” having a name “departure”, and grammar “http://temp/station.grm#station” corresponds to an “input” having a name “destination”. These correspondence relations are held in a grammar/input column correspondence holding section 130 in a storage device 103 in, for example, a form shown in
At step S103, grammar data 110 is read by the document reading section 101 and stored in a storage device 103. The grammar data 110 thus read are all grammars described in the document data 100. In the first embodiment, in the tags 401, 402 and 403 shown in
At step S104, an image based upon the analysis result of the document analysis section 102 is displayed in a display section/input section 104. A display example at this time is shown in
At step S105, a speech input order from the user is in standby state. The speech input order from the user is given by the display section/input section 104. As for the speech input order, for example, an input order for indicating whether the input is an input to an input element, such as the frame 501, the input column 502, or the input column 503 in
For example, in the case where it is desired to select the frame 501, a part thereof should be pressed with a pointing device. In the case where it is desired to select the input column 502 or 503, a part thereof should be depressed with the pointing device. If an input order is given from the user as heretofore described, the processing proceeds to step S106.
At the step S106, a grammar corresponding to the column selected by the input order is activated. “An activation of grammar” means that the grammar is made usable (made valid) in a speech recognition section 106. The correspondence relation between the selected column and the grammar is acquired in accordance with a correspondence relation held in the grammar/input column correspondence holding section 130.
For example, in the case where the frame 501 is selected by the user, a grammar “long.grm” becomes active. Furthermore, in the same way, in the case where the input column 502 has been selected, the grammar “station.grm” becomes active. Also in the case where the input column 503 has been selected, the grammar “station.grm” becomes active. A description example of the grammar “long.grm” is shown in
In the grammar “long.grm” of
At step S107, the speech recognition section 106 conducts speech recognition on speech input by the user with the microphone 105, by using the active grammar.
At step S108, display and holding of a result of the speech recognition are conducted. The speech recognition result is basically displayed in the input column selected by the user at the step S105. If a plurality of input columns have been selected, then from those input columns, on the basis of grammar data 110 corresponding to the plurality of input columns, input columns of input destinations respectively of word groups obtained from the speech recognition result are determined and displayed in the corresponding input columns.
For example, if the user selects the input column 502 and utters “Tokyo”, then text data (Tokyo) corresponding to the utterance is displayed in the input column 502. If utterance is effected with the frame 501 represented by the “form” tag, then the frame 501 includes a plurality of input columns, i.e., the input columns 502 and 503, and consequently an input column for displaying text data corresponding to utterance is determined in accordance with the following method. The method will now be described according to grammar description in
First, in the grammar description, a portion put in { } is analyzed, and inputting is conducted on a column in the { }. For example, if one utterance “from Tokyo to Osaka” is conducted, then “Tokyo” corresponds to {departure} and “Osaka” corresponds to {destination}. On the basis of this correspondence relation, “Tokyo” is displayed in the input column 502 named “departure” and “Osaka” is displayed in the input column 503 named “destination”. In addition, if “from Nagoya” is uttered, then it is associated with {departure} and consequently it is displayed in the input column 502. If “to Tokyo” is uttered, then it is associated with {destination} and consequently it is displayed in the input column 503.
In other words, if the user has selected the input column 501, then text data corresponding to uttered content is displayed in the input column 502 and then in the input column 503, or simultaneously in the input columns 502 and 503, in accordance with the uttered content. In addition, input data (text data) of respective columns are held in an input holding section 131 together with correspondence relations of the input columns. For example, if “from Tokyo to Osaka” is uttered, then an example of input data held in the input data holding section 131 is shown in
At step S109, at a time point when an order of input data transmission is given by the user, input data held in the input data holding section 131 is transmitted to an application 108 by an input data transmission section 107. In this case, for example, input data shown in
At step S110, operation of the application 108 is conducted on the basis of the received input data. For example, retrieval of railroad routes from Tokyo to Osaka is conducted, and a result of the retrieval is displayed in the display section/input section 104.
According to the first embodiment, even if a plurality pieces of information are input in lump with speech in the multi-modal interface using the GUI with speech recognition, the pieces of information can be input into optimum input columns in the GUI as heretofore described. In addition, since the multi-modal interface is provided in a description language such as a markup language, the UI can be customized simply.
Second Embodiment
In the first embodiment, the case where an input column is selected by the user has been described. However, a method in which the user does not effect a selection is also possible. An example of the document data 100 in this case is shown in
As for the grammars described in 603 and 604 in
In
At step S202, the control section 109 derives a correspondence relation between the input column and grammar on the basis of an analysis result of the document analysis section 102. However, the correspondence relation differs from
At step S203, the grammar data 110 is read by the document reading section 101. In the second embodiment, all grammars described in the document data 100, inclusive of “http://temp/long.grm#keiro” in
At step S204, an image based upon an analysis result of the document analysis section 102 is displayed in the display section/input section 104. An example of display at this time is shown in
At step S205, a speech input order from the user is in standby state. Here, in the same way as the first embodiment, the user can select the input columns 702 and 703. However, the user cannot select the input columns 702 and 703 in a lump. If there is an input order from the user, the processing proceeds to step S206.
At the step S206, a grammar corresponding to the column selected by the input order is activated. A correspondence relation between the selected column and the grammar is acquired in accordance with a correspondence relation held in the grammar/input column correspondence holding section 130. By the way, if a tag name corresponding to a grammar is blank, then the grammar is always made active. In other words, in the second embodiment, “http://temp/long.grm#keiro” becomes active.
Thereafter, steps S207 to S210 correspond to the steps S107 to S110 in
As heretofore described, according to the second embodiment, if in a multi-modal interface using the GUI with speech recognition an input location is previously fixed or it is intentionally desired to prohibit the user from selecting an input column, respective pieces of information can be input to optimum input columns in the GUI even if selection of an input column is prohibited and a plurality of pieces of information are input in a lump with speech.
Third Embodiment
As for which input column displays the speech recognition result in the first embodiment, a portion put in { } in grammar description is analyzed and inputting is conducted to the column described in { }. Even if there is no description in { }, however, the same can be implemented. For example, if the grammar in
Subsequently, the markup language description in
Fourth Embodiment
In the first embodiment, corresponding grammar is prepared in order to specify the grammar for inputting speech inputs to a plurality of input columns in a lump. In the case where a combination of input columns or a word order is altered, however, it is necessary to newly generate a corresponding grammar.
In a fourth embodiment, therefore, there will now be described as an application example of the first embodiment a configuration for facilitating the alteration of a combination of input items or a word order by automatically generating a grammar for inputting items in a lump in the case where a grammar is prepared for each input column.
In
First, an example of the document data 100 to be analyzed at step S301 of the fourth embodiment is shown in
At step S302, the control section 1209 derives correspondence relations between input columns and grammars on the basis of the analysis result of the document analysis section 1202. Processing on the “input” tags 1402 and 1403 is the same as the processing on the “input” tags 402 and 403 of the first embodiment, and consequently description thereof will be omitted. Especially in the fourth embodiment, “merge” is specified for an attribute “grammar” of a “form” having a name of “keiro”. If the “merge” is specified, then in the ensuing processing a grammar for a “form” created by using a grammar described in the “form” is associated. At this stage, the grammar for the “form” does not exist. And correspondence relations held in the grammar/input column correspondence holding section 1230 are held in, for example, a form shown in
At step S303, grammar data 1210 is read by the document reading section 1201 and stored in the storage device 103. The grammar data 1210 thus read are all grammars described in the document data 100.
If as a result of the analysis effected by the document analysis section 1202 “merge” is specified in the attribute “grammar” of the “form”, a grammar merge section 1211 newly creates a grammar for the “form” that accepts individual inputs to respective “inputs” in the “form” and a lump input of all inputs. By using attribute information of an “input” tag described in the “form”, for example, a grammar for the “form” as shown in
It is now supposed that individually read grammar data 1210 and grammar data created at the step S304 are 1221, 1222, . . . , 122n. Assuming that grammar data “keiro.grm” created at the step S304 corresponds to the grammar “long.grm”, which corresponds to the “form” described in the first embodiment, and “keiro.grm” is a grammar corresponding to the “form”, processing of subsequent steps S307 to step S311 corresponds to the steps S106 to the step S110 of the first embodiment shown in
According to the fourth embodiment, it is possible to automatically generate the grammar for the “form” from grammars used in “inputs” in the “form” as heretofore described, even if the grammar corresponding to the “form” is not previously prepared and specified. Furthermore, if a previously created grammar is specified as in the document data in
In other words, in the multi-modal interface using the GUI with speech recognition, lamp inputting of a plurality of items can be implemented without previously preparing a corresponding grammar, by automatically generating a grammar for inputting a plurality of items in a lump with speech, from grammars associated with respective items. In addition, since the multi-modal interface is provided in a description language such as a markup language, the UI can be customized simply.
Fifth Embodiment
In the fourth embodiment, in the case where there is explicitly a description (“merge” in the fourth embodiment) of merging grammars in the attribute “grammar” of the “form” when the document 1200 is analyzed at the step S301, merging of the grammar data is conducted. However, merging of the grammar data is not restricted to this. For example, in the case where there is no specification of the attribute “grammar” of the “form”, merging of grammars may be automatically conducted.
Sixth Embodiment
In the fourth embodiment, grammar data in which all grammar data described in the “form” are merged is generated by referring to values of the attribute “grammar” of the “form”. However, this is not restrictive. For example, it is also possible to previously determine tags that specify the start position and end position of a range in which grammars are merged, and merge the grammars only in the range interposed between the tags. An example of document data in this case is shown in
In 1701, “merge” is specified in the “grammar” in the same way as the fourth embodiment. In the sixth embodiment, a grammar obtained by merging all grammars used in the “form” is associated with the “form”. Furthermore, a start point and an end point of a range in which grammars are partially merged are specified by 1702 and 1705. A grammar obtained by merging grammars described in the range interposed between “<merge-grammar>” and “</merge-grammar>” is created and used as a grammar to be used in the corresponding input range. An example in which
Input columns corresponding to “inputs” described in 1703, 1704 and 1706 are 1801, 1802 and 1803, respectively. Furthermore, a range in which grammars interposed between “<merge-grammar>” and “</merge-grammar>” is surrounded by a frame 1804. In addition, a region that belongs to the “form” is displayed by a frame 1805. In the same way as the first embodiment, an activated grammar is altered depending upon which region the user selects. For example, in the case where the input column 1804 is selected, it becomes possible to conduct inputting in forms “from ◯◯”, “to XX”, and “from ◯◯, to XX”. In the case where the whole “form” (1805) is selected, it becomes possible to conduct inputting in forms “Δ tickets” and “from ◯◯ to XX, Δ tickets” besides.
Seventh Embodiment
There will now be described an example (
In either case, the grammar for the “form” generated in accordance with a result of analysis of the document data 1200 becomes the same as the grammar shown in
By the way, the present invention includes the case where the present invention is achieved by supplying a software program for implementing the function of the above described embodiments (a program corresponding to the illustrated flow chart in the embodiment) directly or remotely to a system or an apparatus and by a computer of the system or apparatus that reads out and executes the supplied program code. In that case, the form need not be a program so long as it has a function of the program.
Therefore, a program code itself installed in the computer in order to implement the function processing of the present invention in the computer also implements the present invention. In other words, the present invention includes the computer program itself for implementing the function processing of the present invention as well.
In that case, the program may have any form, such as an object code, a program executed by an interpreter, or script data supplied to the OS, so long as it has a function of the program.
As a recording medium for supplying the program, there is, for example, a floppy disk, a hard disk, an optical disk, an optical magnetic disk, an MO, a CD-ROM, a CD-R. a CD-RW, magnetic tape, a nonvolatile memory card, a ROM, a DVD (DVD-ROM or DVD-R) or the like.
Besides, as a method for supplying the program, the program can also be supplied by connecting a client computer to a home page of the Internet by means of a browser of the client computer and downloading the computer program itself of the present invention or a file compressed and including an automatic installing function onto a recording medium such as hard disk from the homepage. It can also be implemented by dividing a program code forming the program of the present invention into a plurality of files and downloading respective files from different home pages. In other words, a WWW server that downloads a program file for implementing the function processing of the present invention in a computer to a plurality of users is also included in the present invention.
Furthermore, it is also possible to encrypt the program of the present invention, store the encrypted program in a storage medium such as a CD-ROM, distribute the program to users, making a user who has cleared a predetermined condition download key information for solving the encryption from a home page via the Internet, execute the encrypted program by using the key information, make the program installed in the computer, and implement it.
The computer executes the read program, and consequently implements the function of the embodiments is implemented. Besides, an OS running on a computer conducts a part or whole of the actual processing on the basis of the order from the program, and consequently the function of the embodiments can also be implemented by the processing.
In addition, a program read out from a recording medium is written into a memory included in a function expansion board inserted into a computer or included in a function expansion unit connected to a computer, and then a CPU included in the function expansion board or the function expansion unit conducts a part or whole of actual processing, and consequently the function of the embodiments can also be implemented by the processing.
As heretofore described, according to the present invention, it is possible to provide a speech recognition apparatus capable of speech inputting having a degree of freedom and its method, and a program.
Number | Date | Country | Kind |
---|---|---|---|
2001-357746 | Nov 2001 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP02/11822 | 11/13/2002 | WO |