This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2011-044430, filed on Mar. 1, 2011; the entire contents of which are incorporated herein by reference.
Embodiments described herein relate generally to a television apparatus and a remote operation apparatus each operable by a speech.
As a conventional technique, a user's utterance is recognized and used for operating a device. If the device (as an operation target) outputs a sound (a broadcasted speech, an artificial speech, and so on), this sound is a noise for recognizing the user's speech. Furthermore, from an input signal mixing the sound (outputted by the device) with a speech uttered by a speaker (user), by using an echo canceller to cancel the sound, a technique to improve an accuracy of speech recognition is proposed. However, in this case, computing processing for the echo canceller is necessary. Accordingly, as to a device having restricted throughput, this technique is difficult to be realized.
On the other hand, a device to mute the sound during recognizing a user's speech is utilized. As to this device, while the user's speech is being recognized, the sound does not exist. Accordingly, the user's speech is recognized without influence of the sound. However, if the device (as the operation target) is a television set, the user (viewer) cannot listen to the sound (speech) broadcasted from the television set during recognizing the speech.
According to one embodiment, a television apparatus includes a speech input unit, an indication input unit, a speech recognition unit, and a control unit. The speech input unit is configured to input a speech. The indication input unit is configured to input an indication to start speech recognition from a user. The speech recognition unit is configured to recognize the user's speech inputted after the indication is inputted. The control unit is configured to execute an operation command corresponding to a recognition result of the user's speech. The control unit, if a volume of the television apparatus at a timing when the indication is inputted is larger than or equal to a threshold, temporarily sets the volume to a value smaller than the threshold while the speech recognition unit is recognizing.
Various embodiments will be described hereinafter with reference to the accompanying drawings.
In the first embodiment, the speech recognition apparatus 100 includes a microphone 101, a speech input unit 102, a speech recognition start-detection unit 103, an utterance detection unit 104, a speech recognition completion-detection unit 105, an echo canceller 106, a speech recognition unit 107, and a signal sending unit 108. The speech input unit 102 inputs a speech from the microphone 101. The speech recognition completion-detection unit 103 detects a predetermined sign to start speech recognition from a user. The utterance detection unit 104 detects existence (or non-existence) of the user's utterance. The speech recognition completion-detection unit 105 detects completion of speech recognition by detecting non-existence. The speech recognition unit 107 recognizes a speech inputted from the speech input unit 102. The signal sending unit 108 sends a predetermined signal based on a speech recognition result. After inputting the user's sign to start speech recognition, the utterance detection unit 104 detects existence (or non-existence) of the user's utterance. As to a speech (sound) outputted from a speaker 115 of the television set 110 and inputted to the speech input unit 102 via the microphone 101, the echo canceller 106 cancels the speech.
The television set 110 includes a television control unit 111, a calculation resource-monitor unit 112, a video replay unit 113, a recording unit 114, a speaker 115, and a display unit 116. The television control unit 111 controls television-volume and executes various operations of television based on signals sent from the signal sending unit 108. The calculation resource-monitor unit 112 monitors a calculation resource of a main processor of the television set 110. The recording unit 114 records a program being broadcasted. The speaker 115 outputs a sound of the program being viewed. The display unit 116 displays a video of the program being viewed. The video replay unit 113 replays a program content being broadcasted, a program content recorded, or a video content recorded in a recording medium. As the recording medium, for example, DVD (Digital Versatile Disc) or BD (Blue-ray Disc) is used.
First, the speech recognition start-detection unit 103 waits an input of speech recognition start-indication from a user (S1). As the speech recognition start-indication, a predetermined sound is used. For example, by continuously clapping two times with hands, this sound is used as the indication. In this case, from sounds (speeches) inputted to the microphone 101, a clap sound (continuous two claps by hands) is detected.
As another example, a specific word uttered by a user may be used. In this case, a sign-recognition dictionary for recognizing a word used for a sign and a command-recognition dictionary for recognizing a word used for a television operation command are prepared. Regularly, the speech recognition unit 107 performs speech recognition by using the sign-recognition dictionary. When the word of the sign is recognized, the speech recognition unit 107 switches the sign-recognition dictionary to the command-recognition dictionary.
As another example, by providing a speech recognition-start button on a remote controller (in
When the speech recognition start-detection unit 103 detects a sign of speech recognition-start (S2), the signal sending unit 108 sends a speech recognition-start signal to the television control unit 111 of the television set 110 (S3). In this case, in order to feedback start of speech recognition to the user, this purport may be displayed by lighting of a LED (Light Emitting Diode) or an OSD (On-Screen Display).
The television set 110 waits a signal from the signal sending unit 108 of the speech recognition apparatus 100 (S101). When any signal is received from the signal sending unit 108, the television control unit 111 decides whether this signal is a speech recognition start-command (S102). If this signal is the speech recognition start-command, the video replay unit 113 of the television set 110 decides whether a video (being displayed) is a broadcast content or a stored content (S103). The broadcast content is a video broadcasted by digital terrestrial television broadcast, BS digital broadcast, CS digital broadcast, and CATV. The stored content is a program recorded by the recording unit 114 and a video recorded in a media (DVD, BD).
If the video (being viewed) is a broadcast wave, the calculation-monitor unit 112 measures a calculation load of CPU in the control unit 130 of the television set 110 (S104), and decides the calculation load is larger than a predetermined threshold (S105). In this case, this decision may be based on a ratio of the calculation load to all resources of CPU. Furthermore, by defining a calculation quantity of each processing to be executed by the television set 110, the decision may be based on the sum of calculation quantity of processing being presently executed.
By previously examining a calculation quantity required for echo cancel processing, the threshold is determined based on whether the CPU has a performance to execute the echo cancel processing. Accordingly, if a calculation load of the CPU is smaller than the threshold, the CPU has a performance to execute the echo cancel processing. When the calculation load is smaller than the threshold, the echo cancel processing is executed (S106), and the speech recognition unit 107 starts to input a speech signal as a target of speech recognition (S4). In this case, the television control unit 111 does not change the television-volume.
When the calculation load is larger than the threshold, the television control unit 111 reads the present value of television-volume (S107). By deciding whether this value is larger than a predetermined value, operation to change the television-volume is changed.
In
In
In
In
In
In this way, after executing volume-change based on the television-volume, the speech recognition unit 107 executes input of a speech to be recognized (34).
On the other hand, if the stored content is viewed, the video replay unit 113 temporarily stops a video being replayed (S109), and the speech recognition unit 107 executes input of a speech to be recognized (S4). The stored content is, for example, a program recorded by the recording unit 114, or a video recorded in a medium such as DVD or BD.
The utterance detection unit 104 of the speech recognition apparatus 100 detects whether a user starts to utter. For the case that the user erroneously utters a sign to start speech recognition or the speech recognition start-detection unit 103 erroneously detects the sign, a time-out to automatically return to original status had better be set. Furthermore, as shown in a display 1101 of
The speech recognition completion-detection unit 105 decides whether the speech recognition is completed (S5). For example, “a silent period continues over a predetermined time” is one condition of speech recognition completion. The speech recognition unit 107 executes speech recognition, and obtains a recognition result of the speech recognition (S6). Based on the recognition result, the signal sending unit 108 sends an operation command of the television set 110 to the television control unit 111 (S7).
In this case, the operation command corresponding to a specific speech command (the recognition result) such as “channel-change”, “volume-change”, “input-switch” and “screen mode-switch”, is sent. Examples of correspondence between the operation command and the speech command are shown in a table 1300 of
When the television set 110 receives an operation command except for the speech recognition start-command (No at S102), the television control unit 111 decides whether the operation command is a cancel command (S110). If the operation command is the cancel command (Yes at S110), the television control unit 111 resets the television-volume to a value prior to the speech recognition start without executing television-operation (S112). If the operation command is not the cancel command (No at S110), the television control unit 111 executes television-operation corresponding to the operation command received (S111), and resets the television-volume to a value prior to the speech recognition start (S112).
As mentioned-above, in the television apparatus of the first embodiment, based on a television-volume prior to the speech recognition start, the television-volume during speech recognition processing is temporarily controlled. As a result, while the speech recognition is accurately executed with little calculation load, disturbance of viewing by the speech operation is avoided.
Furthermore, when the stored content is replayed, this replay is temporarily stopped during the speech recognition. As a result, during operation by the user's speech, viewing of the stored content under incomplete condition is avoided.
The television apparatus of the second embodiment is explained by referring to Figs. As to the same processing/component as the first embodiment, the same sign is assigned and explanation thereof is omitted, and parts different from the first embodiment are only explained.
After receiving the speech recognition start-command, the television set 110 changes processing operation based on the present viewing media (S103). If the present viewing media is broadcast, the television control unit 111 makes the screen be static and the sound mute (S201). Afterwards, the recording unit 114 begins to record the program immediately (S202).
After the speech recognition is completed, the television control unit 111 receives an operation command based on a speech recognition result, and executes television operation corresponding to the operation command (S111). The television control unit 111 decides whether following two conditions are satisfied (S203).
(1) The viewing media before starting the speech recognition is broadcast.
(2) The television operation executed by the television control unit 111 is not channel-change of broadcasting wave.
If two conditions (1) and (2) are satisfied, the television control unit 111 starts a chasing playback from the screen at the static timing (S203). Typically, this is the case that operation not channel-change (For example, volume-change) is executed.
On the other hand, if at least one of the two conditions (1) and (2) is not satisfied, the television control unit 111 resets the volume to a value prior to the speech recognition start without the chasing playback (S112). When the recording is executed (S202), and after that, if the viewing-channel is changed, the recording may be stopped. If the recording is stopped, recorded data may be erased.
In the television set of the second embodiment, the speech recognition is executed under a condition that the sound is muted. As a result, the speech recognition can be accurately executed by little calculation cost. Furthermore, a broadcast content during the speech recognition is recorded, and, after the speech recognition, the broadcast content is replayed by chasing. As a result, even if a user operates the television by his/her speech, the user's viewing is not disturbed.
The television apparatus of the third embodiment is explained by referring to Figs. As to the same processing/component as the first and second embodiments, the same sign is assigned and explanation thereof is omitted, and parts different from the first and second embodiments are only explained.
As shown in
In the third embodiment, in order to estimate an ambient sound at a position where the speech recognition apparatus 100 is located, the speech recognition apparatus 100 includes a television-volume estimation unit 120. The television-volume estimation unit 120 estimates a television-volume from an averaged volume of the ambient sound inputted for a past predetermined period by the speech input unit 102.
The signal sending unit 108 changes a volume-level of the television set 110 during the speech recognition, based on the television-volume estimated by the television-volume estimation unit 120. Briefly, based on the volume-level estimated, the signal sending unit 108 calculates a volume-level during the speech recognition. As a correspondence relationship between the volume level estimated and the volume level during the speech recognition, for example, setting examples shown in
The signal sending unit 108 sends an operation command to set the volume level (calculated) to the television set 110. The signal sending unit 108 may repeatedly send an operation command to lower the volume-level, and may send an operation command (direct code) to directly indicate a value of the volume level. Furthermore, the signal sending unit 108 may send a special operation command to set the volume level to a half value (½ mute). Only if the volume level used during the speech recognition is lower than a specific level, another operation command may be sent.
When the speech recognition start-detection unit 103 detects the speech recognition start, the television-volume estimation unit 120 estimates a television-volume from an averaged volume of the ambient sound inputted for a past predetermined period by the speech input unit 102 (S10). Based on the television-volume, the signal sending unit 108 sends an operation command to change the television volume during the speech recognition (S11). After that, the speech recognition unit 107 recognizes a speech, and acquires a recognition result of the speech (S4, S5, S6). The signal sending unit 108 sends an operation command based on the recognition result (S7). After that, the signal sending unit 108 sends an operation command (such as a mute release command) to reset the volume to a value prior to the speech recognition (S12).
As mentioned-above, in the third embodiment, a television-volume during the speech recognition is controlled based on a television-volume measured by the speech recognition unit 107. As a result, the television-volume can be controlled within a range necessary for the speech recognition.
The television apparatus of the fourth embodiment is explained by referring to Figs. As to the same processing/component as the first, second and third embodiments, the same sign is assigned and explanation thereof is omitted. Parts different from the first, second and third embodiments are only explained.
As mentioned-above, in the television apparatus of the fourth embodiment, based on a television-volume prior to the speech recognition start, the television-volume during speech recognition processing is temporarily controlled. As a result, while the speech recognition is accurately executed with little calculation load, disturbance of viewing by the speech operation is avoided.
While certain embodiments have been described, these embodiments have been presented by way of examples only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
P2011-044430 | Mar 2011 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
5987106 | Kitamura | Nov 1999 | A |
6396544 | Schindler et al. | May 2002 | B1 |
6665645 | Ibaraki et al. | Dec 2003 | B1 |
7023498 | Ishihara | Apr 2006 | B2 |
7272232 | Donaldson et al. | Sep 2007 | B1 |
8106750 | Cho et al. | Jan 2012 | B2 |
8165641 | Koike et al. | Apr 2012 | B2 |
8187093 | Hideya | May 2012 | B2 |
8212707 | Haga et al. | Jul 2012 | B2 |
20050043948 | Kashihara et al. | Feb 2005 | A1 |
20070050832 | Wright et al. | Mar 2007 | A1 |
20090148135 | Ishino et al. | Jun 2009 | A1 |
20100076763 | Ouchi et al. | Mar 2010 | A1 |
20100333163 | Daly | Dec 2010 | A1 |
20110051016 | Malode | Mar 2011 | A1 |
20110091031 | Taniguchi et al. | Apr 2011 | A1 |
20110301950 | Ouchi et al. | Dec 2011 | A1 |
20120162540 | Ouchi et al. | Jun 2012 | A1 |
20120245932 | Ouchi et al. | Sep 2012 | A1 |
Number | Date | Country |
---|---|---|
03-203796 | Sep 1991 | JP |
11-015494 | Jan 1999 | JP |
2001-236090 | Aug 2001 | JP |
2006-065981 | Mar 2006 | JP |
2006-119520 | May 2006 | JP |
2009-109536 | May 2009 | JP |
2009-150776 | Dec 2009 | WO |
2011055410 | May 2011 | WO |
Entry |
---|
Office Action of Notice of Reasons for Refusal for Japanese Patent Application No. 2011-044430 Dated Oct. 17, 2014, 6 pages. |
Number | Date | Country | |
---|---|---|---|
20120226502 A1 | Sep 2012 | US |