Mobile computing devices, such as mobile phones, have available auxiliary devices which can offer additional functionality. For example, some earbuds offer a translation experience where a user can touch one earbud, speak in a first language, and the spoken phrase is translated into a second language. The user may hear the translated phrase, for example, from speakers in the mobile computing device. Similarly, phrases spoken into the mobile phone in the second language may be translated to the first language and output through the earbuds. This exchange is performed serially, such that each participant must wait for the other's speech to be completed, translated, and output through speakers of the mobile computing device or auxiliary device before speaking a response. The serial nature of the exchange adds significant delays to conversational patterns, and introduces uncomfortable social cues. The interaction flow requires both parties to pay careful attention to which person is speaking, whether the translation is being output on each device, and as a result is unintuitive.
The present disclosure provides for a more natural, conversational exchange between a user and a foreign language speaker using translation features of a system for providing translations. For example, the system may include a mobile computing device such as a mobile phone and an auxiliary device such as a pair of earbuds. Microphones of the system may always be listening for speech input and, as speech input is received, may determine whether the speech input is from the user or the foreign language speaker. The mobile device and auxiliary device can automatically determine when a spoken phrase is complete and ready for translation, send the spoken phrase for translation, and immediately begin listening again. In this regard, acknowledgements or other social cues, such as “OK,” “Yes,” “Oh no,” etc. may be captured, translated, and output throughout a conversation in response to the other user's translated speech, even if the other user is still speaking.
One aspect of the disclosure provides a method comprising: 1) listening, by a system for providing translations, for speech input spoken in a first language or a second language; 2) receiving, in response to the listening for the speech input, first speech input spoken in the first language; 3) determining that the first speech input is spoken in the first language; 4) generating and presenting, in association with the receiving of the first speech input, a translation of the first speech input in the second language; 5) detecting an endpoint in the first speech input; and 6) in response to detecting the endpoint, listening for additional speech input spoken in the first language or in the second language.
Another aspect of the disclosure provides a system comprising a memory storing instructions and one or more processors communicatively coupled to the memory and configured to execute the instructions to perform a process. For example, this process may include: 1) listening for speech input spoken in a first language or a second language; 2) receiving, in response to the listening for the speech input, first speech input spoken in the first language; 3) determining that the first speech input is spoken in the first language; 4) generating and presenting, in association with the receiving of the first speech input, a translation of the first speech input in the second language; 5) detecting an endpoint in the first speech input; 6) in response to detecting the endpoint, listening for additional speech input spoken in the first language or in the second language; 7) receiving, in response to the listening for the additional speech input, second speech input spoken in the second language; 8) determining that the second speech input is spoken in the second language; and 9) generating and presenting, in association with the receiving of the second speech input, a translation of the second speech input in the first language.
Yet another aspect of the disclosure provides a non-transitory computer-readable medium storing instructions that, when executed, cause a processor of a computing device to perform a process. For example, this process may include: 1) listening for speech input spoken in a first language or a second language; 2) receiving, in response to the listening for the speech input, first speech input spoken in the first language; 3) determining that the first speech input is spoken in the first language; 4) generating and presenting, in association with the receiving of the first speech input, a translation of the first speech input in the second language; 5) detecting an endpoint in the first speech input; and 6) in response to detecting the endpoint, listening for additional speech input spoken in the first language or in the second language.
The present disclosure provides for an improved translation experience between a first user and a second user using a mobile computing device and an auxiliary device, such as a pair of earbuds. The first user may be, for example, a foreign language speaker, and the second user may be the owner of the mobile computing device and auxiliary device. Microphones on both the mobile device and the auxiliary device simultaneously capture input from the first user and the second user, respectively, rather than alternating between the mobile device and the auxiliary device. Each device may determine when to endpoint, or send a block of speech for translation, for example based on pauses in the speech. Each device may accordingly send the received speech up to the endpoint for translation and output, such that it is provided in a natural flow of communication. Listening by the microphones automatically resumes immediately after endpointing, and therefore speech will not be lost.
In the example shown, the auxiliary device 180 is a pair of wireless earbuds. However, it should be understood that the auxiliary device 180 may be any of a number of different types of auxiliary devices. For example, the auxiliary device 180 may be a pair of wired earbuds, a headset, a head-mounted display, a smart watch, a mobile assistant, etc.
The mobile computing device 170 may be, for example, a mobile phone, tablet, laptop, gaming system, or any other type of mobile computing device. In some examples, the mobile computing device 170 may be coupled to a network, such as a cellular network, wireless Internet network, etc. Translations capabilities may be stored on the mobile computing device 170, or accessed from a remote source by the mobile computing device 170. For example, the mobile device 170 may interface with a cloud computing environment in which the speech translations from a first language to a second language are performed and provided back to the mobile device 170.
In this example, the auxiliary device 180 is illustrated as a pair of earbuds, which may include, for example, speaker portion 187 adjacent an inner ear-engaging surface 188, and input portion 185 adjacent an outer surface. In some examples, a user may enter input by pressing the input portion 185 while speaking, or by tapping the input portion 185 prior to speaking. In other examples, manual input by a user is not required, and the user may simply begin speaking. The user's speech may be received by a microphone in the earbuds (not shown) or in the mobile device 170. The user may hear translated speech from another person through the speaker portion 187.
The auxiliary device 180 is wirelessly coupled to the mobile device 170. The wireless connections between the devices may include, for example, a short range pairing connection, such as Bluetooth. Other types of wireless connections are also possible.
In some examples, such as shown in
As mentioned above, the auxiliary device 180 can be any of various types of devices, such as earbuds, head-mounted device, smart watch, etc. The mobile device 170 can also take a variety of forms, such as smart phone, tablet, laptop, game console, etc.
The one or more processors 371, 381 may be any conventional processors, such as commercially available microprocessors. Alternatively, the one or more processors may be a dedicated device such as an application specific integrated circuit (ASIC) or other hardware-based processor. Although
Memory 382 may store information that is accessible by the processors 381, including instructions 383 that may be executed by the processors 381, and data 384. The memory 382 may be of a type of memory operative to store information accessible by the processors 381, including a non-transitory computer-readable medium, or other medium that stores data that may be read with the aid of an electronic device, such as a hard-drive, memory card, read-only memory (“ROM”), random access memory (“RAM”), optical disks, as well as other write-capable and read-only memories. The subject matter disclosed herein may include different combinations of the foregoing, whereby different portions of the instructions 383 and data 384 are stored on different types of media.
Data 384 may be retrieved, stored or modified by processors 381 in accordance with the instructions 383. For instance, although the present disclosure is not limited by a particular data structure, the data 384 may be stored in computer registers, in a relational database as a table having a plurality of different fields and records, XML documents, or flat files. The data 384 may also be formatted in a computer-readable format such as, but not limited to, binary values, ASCII or Unicode. By further way of example only, the data 384 may be stored as bitmaps comprised of pixels that are stored in compressed or uncompressed, or various image formats (e.g., JPEG), vector-based formats (e.g., SVG) or computer instructions for drawing graphics. Moreover, the data 384 may comprise information sufficient to identify the relevant information, such as numbers, descriptive text, proprietary codes, pointers, references to data stored in other memories (including other network locations) or information that is used by a function to calculate the relevant data.
The instructions 383 may be executed to facilitate translations performed by a mobile computing device. For example, the instructions 383 provide for listening for and receiving user speech, for example, through microphone 388. The microphone 388 may be beamformed, such that it is directed to receive audio coming from a direction of the user's mouth. In this regard, the auxiliary device 180 may recognize received speech as being that of the user, as opposed to a foreign language speaker that is not wearing the auxiliary device 180.
The instructions 383 may further provide for detecting an endpoint in the received speech. For example, the endpoint may be automatically determined based on a pause in speech, key words, intonation, inflection, or any of a combination of these or other factors. Once the endpoint is detected, the auxiliary device 180 may buffer the received speech while immediately resuming listening. In other examples, alternatively or additionally to buffering, the auxiliary device 180 may transmit the received speech to the mobile device 170 for translation. For example, the auxiliary device 180 may transmit the speech via an RFComm or other communication link, as discussed above in connection with
While the auxiliary device 180 is executing the instructions 383, the mobile device 170 may also be executing instructions 373 stored in memory 372 along with data 374. For example, similar to the auxiliary device 180, the mobile device 170 may also include memory 372 storing data 374 and instructions 373 executable by the one or more processors 371. The memory 372 may be any of a variety of types, and the data 374 may be any of a variety of formats, similar to the memory 382 and data 384 of the auxiliary device 180. While the auxiliary device 180 is listening for and receiving speech from the user wearing the auxiliary device 180, the mobile device 170 may be listening for and receiving speech as well through microphone 378. The microphone 378 may not be beamformed, and may receive audio input from both the foreign language speaker (e.g., first user 101 of
Any of a variety of voice recognition techniques may be used. As one example, the mobile device 170 may cross reference a volume level between the auxiliary device microphone 388 and the mobile device microphone 378. If the sound received through microphone 388 is quiet and the sounds received through the microphone 378 is loud, then it may be determined that the foreign language speaker is providing speech input. Conversely, if the sounds received through both microphones 388, 378 is loud, then it may be determined that the owner/user is speaking. As another example technique, a voice recognition unit may be used. The voice recognition unit may be trained to recognize a voice of the user/owner of the auxiliary device 180 and mobile device 170. Accordingly, if the voice recognition unit detects the owner's voice, it may ignore it. Similarly, the voice recognition unit may be trained to detect a language primarily spoken by the owner/user, and may filter out speech detected in that language. As yet another example technique, audio echo cancellation techniques may be used. For example, the mobile device 170 may listen to both microphone 388, 378, detect overlapping audio, and, recognize that the overlapping audio belongs to the owner. The overlapping audio may be detected by identifying similar waveforms or patterns of sound input, or detecting similar plosives or transient attacks. In some examples, any combination of the foregoing or other techniques may be used.
When the detected speech is from the foreign language speaker, the instructions 373 may further provide for continued listening until an endpoint is detected. As mentioned above, the endpoint may be detected based on a pause, keyword, inflection, or other factor. The received speech from the foreign language speaker is buffered, and the microphone 378 resumes listening.
The mobile device 170 may perform translations of both foreign language speaker input received through the microphone 378, as well as owner input received through communication from the auxiliary device 180. Such translations may be performed on the mobile device 170 itself, or may be performed using one or more remote computing devices, such as the cloud. For example, the mobile device 170 may upload speech for translation to a remote computing network which performs the translation, and receive a response including translated speech from the remote computing network. Translations of speech from the foreign language speaker may be provided to the auxiliary device 180 for output through output 387. Translations of speech from the owner may be output through output 377 of mobile device 170. The outputs 377, 387 may each include, for example, one or more speakers adapted to provide audible output. In some examples, the outputs 377, 387 may also include one or more other types, such as displays, tactile feedback, etc.
It should be understood that the auxiliary device 180 and mobile device 170 may each include other components which are not shown, such charging input for the battery, signals processing components, etc. Such components may also be utilized in execution of the instructions 383, 373.
In addition to the operations described above and illustrated in the figures, various operations will now be described. It should be understood that the following operations do not have to be performed in the precise order described below. Rather, various steps can be handled in a different order or simultaneously, and steps may also be added or omitted.
In block 420, the mobile device determines whether the received voice input is from the mobile device owner or the foreign language speaker. For example, the mobile device may use voice recognition, language recognition, etc. As another example, the mobile device may cross reference a volume of sound received at the mobile device with a volume of sound received at the auxiliary device and relayed to the mobile device.
If in block 430 it is determined that the input is from the mobile device owner, the mobile device may ignore the input. Accordingly, the method returns to block 410 to keep listening for input from the foreign language speaker. If, however, the input is determined to be from the foreign language speaker, the method proceeds to block 440.
In block 440, the mobile device determines whether an endpoint in speech is detected. For example, if there is a pause in the speech input for a predetermined period of time, such as half a second, one second, two seconds, etc., the mobile device may determine that an endpoint in speech has been reached. Other examples of detecting endpoints may include detecting changes in intonation or inflection, or detecting keywords. Detecting the endpoint helps to ensure proper translation of complete phrases. For example, translations of each individual word are typically inaccurate. By way of example only, while in English adjectives are typically spoken before their associated nouns, in Spanish the adjectives are spoken after the noun. By endpointing the speech after a complete statement, phrases may be translated as a whole and the words rearranged as appropriate for the translated language.
If no endpoint is detected, the mobile device keeps listening in block 445 and waiting for an endpoint. If an endpoint is detected, however, the mobile device buffers the received input in block 450. It should be understood that the buffering may be performed while the input is received, prior to detection of the endpoint.
In block 460, the mobile device translates the buffered voice input up until the endpoint. The translation may be performed at the mobile device, or through remote computing devices.
In block 470, the translated input is provided to the auxiliary device for output. For example, the translated input may be provided through one of the two communication channels described in
Though not shown, the mobile device also receives speech from the auxiliary device. For example, the auxiliary device receives speech from the owner as described below in connection with
In block 510, the auxiliary device listens for and receives voice input from the owner. Where a microphone in the auxiliary device is beamformed, it may determine that any input received is from the owner. In other examples, voice recognition techniques such as those described above may be used.
In block 520, the auxiliary device determines whether an endpoint was reached. The endpoint may be automatically detected based on patterns in the speech. In other examples, the endpoint may be manually input by the owner, such as by pressing a button. If no endpoint is detected, the device continues listening in block 525 until an endpoint is detected. The received speech may be buffered at the auxiliary device during and/or after receipt. In some alternatives, rather than endpointing being performed by the auxiliary device, the auxiliary device streams the received audio continuously to the mobile device. The mobile device may run two voice recognizers simultaneously in order to detect voice and endpoint accordingly.
In block 530, the received speech from the owner is transmitted to the mobile device for translation and output. For example, the speech may be sent through a second communication channel, as discussed above in connection with
By keeping microphones on both the auxiliary device and the mobile device continually open, speech from both the owner and the foreign language speaker is continually received. In this regard, the owner and foreign language speaker may have a more natural conversation, including interjections, affirmations, acknowledgements, etc., and without awkward pauses while waiting for translation. By automatic endpointing, phrases or other blocks of speech may be detected and translated without requiring manual input from a user. Moreover, voice recognition techniques may be used to determine which user is providing the input and thus how it should be translated. Accordingly, less user manipulation of devices is required. Rather, the users may converse naturally, and the auxiliary and mobile devices may automatically provide assistance, providing for a near real-time translation.
Unless otherwise stated, the foregoing alternative examples are not mutually exclusive, but may be implemented in various combinations to achieve unique advantages. As these and other variations and combinations of the features discussed above can be utilized without departing from the subject matter defined by the claims, the foregoing description of the embodiments should be taken by way of illustration rather than by way of limitation of the subject matter defined by the claims. In addition, the provision of the examples described herein, as well as clauses phrased as “such as,” “including” and the like, should not be interpreted as limiting the subject matter of the claims to the specific examples; rather, the examples are intended to illustrate only one of many possible embodiments. Further, the same reference numbers in different drawings can identify the same or similar elements.
This application is a continuation of U.S. application Ser. No. 17/493,239, filed Oct. 4, 2021, which is a continuation of U.S. application Ser. No. 16/269,207, filed Feb. 6, 2019, which claims the benefit of U.S. Provisional Application No. 62/628,421 filed Feb. 9, 2018, the disclosures of which are incorporated herein by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
62628421 | Feb 2018 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 17493239 | Oct 2021 | US |
Child | 18766157 | US | |
Parent | 16269207 | Feb 2019 | US |
Child | 17493239 | US |