Translation services have become widely used throughout the world to facilitate communication across language barriers. Advancements in machine translation have increased the accuracy of translations, including using punctuation, slang, idioms, colloquialisms, and so forth. On mobile devices, translation services are generally built inside an application to function only within that application, including a web browser or virtual assistant. These conventional translation services typically communicate with a backend server via a network connection to allow the backend server to compute the translations. Accordingly, conventional translation services are generally limited to specific contexts within an application on the mobile device.
This document describes methods and systems for on-device real-time translation of media content on a mobile electronic device. The translation is managed and executed by an operating system (OS) of the electronic device rather than within a particular application (app) executing on the electronic device. The OS can translate media content, including text displayed on a display device of the electronic device or audio output by the electronic device. Because the translation is at the OS level, the translation can be implemented across a variety of (e.g., all) applications and a variety of (e.g., all) content on the electronic device to provide a consistent translation experience. The OS-level translation can be provided via a system user interface (UI) overlay that displays translated text corresponding to the media content. The system UI overlay may be applied over on-screen text to re-render the text as translated text (in a user-preferred language), which appears similar to native content in the application. Further, the system UI overlay may be usable on virtually any application on the electronic device, including first-party (1P) applications and third-party (3P) applications, without requiring special integration.
In some aspects, a method is disclosed for on-device real-time translation of media content on a mobile electronic device. The method includes identifying, at an operating-system level of the mobile electronic device, an original human language of media content that is output by an application running on the electronic device. In an example, the original human language is different than a target human language defined by a user of the mobile electronic device. Further, the method includes translating, at the operating-system level, the media content from the original human language of the media content into translated text in the target human language. The media content may be translated based on translation models stored in a memory of the mobile electronic device. In addition, the method includes generating, at the operating-system level, a system UI overlay for display via a display device of the mobile electronic device. The method also includes rendering, at the operating-system level, the system UI overlay over a portion of displayed content corresponding to the application, where the system UI overlay includes the translated text.
In other aspects, a mobile electronic device is disclosed. The mobile electronic device includes a display device, one or more processors, and memory. The memory stores translation models usable for translation of text from an original human language to a target human language. In addition, the memory stores instructions that, when executed by the one or more processors, cause the one or more processors to implement a translation-manager module to provide on-device real-time translation of media content that is output by the electronic device by performing the method disclosed above.
This summary is provided to introduce simplified concepts concerning on-device real-time translation of media content on a mobile electronic device, which is further described below in the Detailed Description and Drawings. This summary is not intended to identify essential features of the claimed subject matter, nor is it intended for use in determining the scope of the claimed subject matter.
The details of one or more aspects of on-device real-time translation of media content are described in this document with reference to the following drawings. The same numbers are used throughout the drawings to reference like features and components:
This document describes methods and systems for on-device real-time translation of media content on a mobile device. The techniques described herein provide OS-level translation that can be implemented across a variety of (e.g., all) applications executed on the device, which provides a consistent user experience. These methods and systems can enable a user of the device to watch media in nearly any language, read nearly any text, and message another person in nearly any language. With a system user interface (UI) overlay, translations can be applied to video content (e.g., recorded or live) and audio content (e.g., a podcast) with a box of translated subtitles that the user can resize and move around the screen. Similarly, the user can apply the system UI overlay to on-screen text to re-render the text as translated text in another language, with the re-rendering being near-invisible and appearing as native content within an application. Providing the system UI overlay over the on-screen text may ensure that the oftentimes limited screen space of the device is efficiently utilized, and may ensure that there is minimal change to a user's experience interacting with the device. Further, the system UI overlay can be applied to a chat conversation, where incoming text can be translated and re-rendered in the user's preferred language, and outgoing text can be translated and sent in the recipient's preferred language. Because the OS-level translation can be implemented using a system UI overlay outside of a particular application, the translation can apply to first-person and third-person applications without requiring special integration. In addition, because the translation is performed on-device and not over a network, the translation functionality is privacy-friendly and does not require encryption for transmission. Managing and executing the translation at the operating-system level of the electronic device rather than within the particular applications executing on the electronic device can mean it is not necessary for each individual application on the electronic device to have its own respective translation service built inside. This can result in the applications being simpler, smaller, and therefore taking up less storage space in the memory of the electronic device.
While features and concepts of the described methods and systems for on-device real-time translation of media content on a mobile device can be implemented in any number of different environments, aspects are described in the context of the following examples.
As described herein, these techniques for real-time translation can be implemented across different applications running on the electronic device 102, including instant-messaging applications, audio or video players, and live-stream video applications. In implementations of video playback, live-stream video rendering, or audio playback, the translated text may be rendered as captions or subtitles.
In more detail, consider
The electronic device 102 also includes one or more computer processors 202 and one or more computer-readable media 204, which includes memory media 206 and storage media 208. Applications 210 and/or the operating system 104 implemented as computer-readable instructions on the computer-readable media 204 can be executed by the computer processors 202 to provide some or all of the functionalities described herein. For example, the computer-readable media 204 can include the translation-manager module 106, which is described in
The electronic device 102 may also include a network interface 214. The electronic device 102 can use the network interface 214 for communicating data over wired, wireless, or optical networks. By way of example and not limitation, the network interface 214 may communicate data over a local-area-network (LAN), a wireless local-area-network (WLAN), a personal-area-network (PAN), a wide-area-network (WAN), an intranet, the Internet, a peer-to-peer network, point-to-point network, or a mesh network.
Various implementations of the translation-manager module 106 can include a System-on-Chip (SoC), one or more Integrated Circuits (ICs), a processor with embedded processor instructions or configured to access processor instructions stored in memory, hardware with embedded firmware, a printed circuit board with various hardware components, or any combination thereof.
The electronic device 102 also includes one or more sensors 216, which can include any of a variety of sensors, including an audio sensor (e.g., a microphone), a touch-input sensor (e.g., a touchscreen), an image-capture device (e.g., a camera or video-camera), proximity sensors (e.g., capacitive sensors), or an ambient light sensor (e.g., photodetector).
The electronic device 102 can also include a display device, e.g., the display device 108. The display device 108 can include any suitable display device, e.g., a touchscreen, a liquid crystal display (LCD), thin film transistor (TFT) LCD, an in-place switching (IPS) LCD, a capacitive touchscreen display, an organic light-emitting diode (OLED) display, an active-matrix organic light-emitting diode (AMOLED) display, super AMOLED display, and so forth. The display device 108 may be referred to as a screen, such that content may be displayed on-screen.
In
The translation-manager module 106 may also include an automatic speech recognition (ASR)-transcription module 308, optical character recognition (OCR) module 310, a language-identifier module 312, a model-manager module 314, a translation-control module 316, translation models 318, the system UI overlay 120, and rendering models 320.
The ASR-transcription module 308 is configured to transcribe the audio content 304 captured by the content-capture module 302. The language-identifier module 312 is configured to determine a language of the audio content 304 and/or the visual content 306. In some aspects, the language-identifier module 312 provides an indication (e.g., language ID) that identifies the human language of the audio content 304 to enable the ASR-transcription module 308 to transcribe the audio content 304 into visual content in the corresponding human language. The language-identifier module 312 can also provide the language ID to the translation-control module 316 to enable the translation-control module 316 to identify the original human language of the media content and initiate the translation.
The OCR module 310 is configured to convert images of text into machine-encoded text. For example, the OCR module 310 can convert the visual content 306 into a form usable by the translation-control module 316 for translation. Using OCR results output by the OCR module 310, the language-identifier module 312 can identify the language of the visual content 306 and provide the language ID to the translation-control module 316.
The translation models 318 (e.g., cascaded set of models) include machine learning models trained on human languages and translations between the human languages. The translation models 318 may include models trained on a particular pair of human languages (e.g., German, French, English, Spanish, Portuguese, Mandarin Chinese, Japanese, Arabic, Hindi, Armenian) as they translate from one to the other. The translation models 318 may also include models trained on semantic natural-language understanding (e.g., sentence fragments, slang, colloquialisms, and context from phrase to phrase) of a particular human language. Some human languages have pronoun drop in which the pronoun (e.g., he, she, we, I, you) can be dropped. As such, a sentence in isolation may not provide sufficient information to know if the pronoun is “he” or “she,” for example, which may result in translation errors and deficiencies. When translating from a first language with pronoun drop (e.g., Spanish) to a second language that requires the presence of the pronoun (e.g., English), the pronoun may need to be predicted and added (or restored) to the translated text. Accordingly, some of the translation models 318 may be trained to analyze and determine the context of one or more preceding phrases to enable a pronoun to be restored in a translated phrase, making the translation a contextual translation.
In addition, the translation models 318 may include models trained on punctuation. In some aspects, punctuation models may be trained to determine, predict, and provide punctuation corresponding to unspoken punctuation in the audio content 304, e.g., for transcription. The punctuation models also analyze the punctuation of the visual content 306 to provide appropriate punctuation in the translated text for improved accuracy of the translation.
The model-manager module 314 is configured to manage the translation models 318. For example, the model-manager module 314 can, based on user input (e.g., at device setup, at setup of translation services, or at the time of a translation request), retrieve, from one or more remote sources over a network, appropriate translation models 318 for one or more user-selected human languages). Further, the model-manager module 314 can aggregate the translation models 318 and bring them to one place for use on the electronic device 102. The model-manager module 314 can also manage updates to the translation models 318 and provide access to one or more of the translation models 318 to assist with transcription and/or translation. The model-manager module 314 can also indicate whether a requested translation model is missing, e.g., not included, in the translation models 318 and therefore needs to be downloaded or otherwise retrieved from a remote source.
The translation-control module 316 is configured to manage the real-time translation of the captured media content. In aspects, the translation-control module 316 communicates with the model-manager module 314 to access the translation models 318 for translation. The access is based, at least in part, on the language ID(s) provided by the language-identifier module 312. In addition to the language ID identifying the language (e.g., the original human language 114) of the captured media content, the language-identifier module 312 can also provide a target-language ID identifying a target language (e.g., user-preferred or user-selected language) for translation. In aspects, the target-language ID is obtained from system settings (e.g., the system settings 212 from
In an example, the user 116 may select one or more human languages to make available for on-device real-time translation. Based on the user selection, the model-manager module 314 may initiate a download of appropriate translation models 318 corresponding to the selected human language(s). In addition, the user 116 may select a preferred language, which may be used for automatic translation or, alternatively, as a first-suggested language when prompting the user for translation. The translation settings may be accessible in the device settings and may have a toggle control to toggle the auto-translation services on and off. Shortcuts may also be provided on the electronic device 102 to opt-in or dismiss translation, toggle translation on and off, or access preferences. These shortcuts are provided at the OS level and are not built within, and therefore limited to, a particular application (“app”) on the electronic device 102. Thus, a consistent user experience flow and implementation can be provided across applications and scenarios presented on the electronic device 102.
Using the captured media content (e.g., the audio content 304 or the visual content 306), the translation models 318, the system settings 212, and input from one or more of the model-manager module 314 and the language-identifier module 312, the translation-control module 316 can translate the captured media content into translated text (e.g., the translated text 122) in the target human language 118.
The translation-manager module 106 (or the translation-control module 316) is configured to generate an overlay (e.g., system UI overlay 120) for display on the display device 108. The overlay includes the translated text 122. In aspects, the overlay may include a user-selectable control to change the translated text 122 to a different target language or revert back to the original human language 114. Further, the translation-control module 316 may access the rendering models 320 to present the translated text 122 in a substantially similar style and format as that of the originally-displayed text in the original human language 114. In an example, the rendering models 320 are used to cause the translated text to substantially match one or more visual characteristics (e.g., size, font, style, format, color) of native content of the application 210.
These and other capabilities and configurations, as well as ways in which entities of
As shown in the instance 402-2, the electronic device 102 may generate one or more system UI overlays 408 (e.g., an overlay for each individual message or a single overlay having multiple (including all) the translated messages) on top of the chat application to re-render the chat messages 404 as translated text 410 in English. In addition, the overlay 406 may indicate the original human language 114 of the chat messages and the target human language 118 of the translated text. For example, the overlay 406 shows “Portuguese→English” to indicate that the original chat messages were in Portuguese and the displayed text (e.g., the translated text 410 in the system UI overlays 408) is currently in English, which is emphasized in bold and underlined. Any suitable emphasis can be used, including highlighting, italics, color, size, font, and so on. In aspects, the overlay 406 may act as a toggle control to switch, based on user selection, back and forth between the original human language 114 and the target human language 118. In an example, if the user selects the overlay 406 or the original human language 114 (e.g., “Portuguese”) in the overlay 406, the electronic device 102 can revert the displayed text to Portuguese, as shown in instance 402-3. The displayed text in the instance 402-3 may be displayed in the original human language 114 in the system UI overlay. In another example, the system UI overlay may be removed to display the underlying chat messages 404 in the chat application in the original human language 114. The overlay 406 can also emphasize the original human language 114 (e.g., by showing “Portuguese→English”) to indicate that the displayed text (e.g., the chat messages 404) is currently in Portuguese. Using the overlay 406, the user can toggle the display back and forth (e.g., between instances 402-2 and 402-3) between the target human language 118 and the original human language 114.
As shown in
The electronic device 102 can also translate a single word based on the copy-and-translate command described above and based on the translation settings being set for word-by-word translation. For example, the user may select an individual word, for example, in one of the chat messages 504. The selected word can be copied and translated to the target human language 118 either automatically in response to the user selection of the word or in response to an additional user input initiating the copy and translation. The translated word can then be presented in the overlay 506 or in a separate overlay that may be positioned proximate to the selected word. Accordingly, based on the user selection, the on-device real-time translation can be applied to a single term, multiple terms, a phrase, multiple phrases, or all text displayed on the display device 108.
If the user selects the overlay 618 with the translation 616, the electronic device 102 can replace the draft message 612 with the translation 616 prior to transmitting the outgoing message. In an example, the draft message 612 is replaced by the translation 616 in the input box 608, as shown in the instance 602-3. Then, the user can trigger a “send” button 620 to send the translation 616 as the outgoing message. In this way, the user can send outgoing messages in the native or preferred language of a recipient. In addition, the user can select a toggle command 622 to switch between the original human language 114 and the target human language 118. In some aspects, the user can select the toggle command 622 to change the target human language 118 of the outgoing message (e.g., the translated text 610 that is replacing the draft message 612) to a new target human language.
In the example illustrated in
The methods 1000, 1100, and 1200 are shown as a set of blocks that specify operations performed but are not necessarily limited to the order or combinations shown for performing the operations by the respective blocks. Further, any of one or more of the operations may be repeated, combined, reorganized, or linked to provide a wide array of additional and/or alternate methods. In portions of the following discussion, reference may be made to the example implementation 100 of
At 1002, an original human language of media content that is output by an application running on the electronic device is identified at an OS level of the mobile electronic device, where the original human language is different than a target human language defined by a user of the mobile electronic device. In aspects, the translation-manager module 106 of the electronic device 102 can identify the original human language 114 of visual text generated by the application 210 running on the electronic device 102. Optionally, the media content may be captured based on a user input, as described with respect to
At 1004, a target human language is identified for translation. For example, the translation-manager module 106 identifies the target human language 118 based on a user selection of a user-preferred human language. In some aspects, the user selection is received based on a prompt. In another example, the user selection was previously received as part of a user input selecting device settings.
At 1006, the media content is translated into translated text in the target human language. In an example, the translation-manager module 106 utilizes the translation models 318, stored in the memory (e.g., storage media 208) of the electronic device 102 to translate the media content into the translated text.
At 1008, a system UI overlay is generated for display via a display device of the mobile electronic device. For example, the translation-manager module 106 may generate the system UI overlay 120 for use in rendering the translated text.
At 1010, the system UI overlay is rendered over a portion of displayed content corresponding to the application, where the system UI overlay includes the translated text. In an example, the translation-manager module 106 renders the system UI overlay 120 over, or in front of, the display generated by the application 210, and the translated text is rendered within the system UI overlay 120. In some aspects, the electronic device 102 appears to visually replace visual content (e.g., incoming and outgoing text messages, captions to video) in the original human language with translated text in the target human language.
As mentioned, the media content may optionally be captured based on an optional method 1100 described with respect to
At 1104, the electronic device copies text of the selected text message. This copying of the text of the selected text message may be responsive to a second user input, which may be a copy command (e.g., selection of a “copy” option or button). The electronic device 102 copies the visual content of the selected text message at the OS level.
At 1106, the electronic device uses the copied text as the media content for translation. This may be responsive to a third user input, which may be a translate command (e.g., selection of a “translate” option or button) to confirm that translation is intended for the copied text. Although 1104 and 1106 are described as actions performed based on user separate user inputs (e.g., the second user input and the third user input), 1104 and 1106 may be performed automatically and sequentially in response to the first user input, which may include a single command to copy and translate. After 1106, the optional method 1100 proceeds to 1004 of
As mentioned above, the method 1000 may optionally proceed from 1002 to
At 1204, the user selection is received based on a user input associated with the prompt. For example, a user input is received that confirms the user's desire to translate the media content. In aspects, the user input may initiate the translation of the media content by causing the method 1200 to proceed to 1004 of
Generally, any of the components, modules, methods, and operations described herein can be implemented using software, firmware, hardware (e.g., fixed logic circuitry), manual processing, or any combination thereof. Some operations of the example methods may be described in the general context of executable instructions stored on computer-readable storage memory that is local and/or remote to a computer processing system, and implementations can include software applications, programs, functions, and the like. Alternatively or in addition, any of the functionality described herein can be performed, at least in part, by one or more hardware logic components, including, and without limitation, Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), System-on-a-chip systems (SoCs), Complex Programmable Logic Devices (CPLDs), and the like.
Some examples are described below:
A method for on-device real-time translation of media content on a mobile electronic device, the method including identifying, at an operating-system level of the mobile electronic device, an original human language of media content that is output by an application running on the electronic device, the original human language being different than a target human language defined by a user of the mobile electronic device; translating, at the operating-system level, the media content from the original human language of the media content into translated text in the target human language, the media content translated based on translation models stored in a memory of the mobile electronic device; generating, at the operating-system level, a system UI overlay for display via a display device of the mobile electronic device; and rendering, at the operating-system level, the system UI overlay over a portion of displayed content corresponding to the application, the system UI overlay including the translated text.
The method may further comprise one or more of resizing and moving the system UI overlay on the display device based on user input.
The method may further comprise identifying the target human language for translation based on a user selection of a user-preferred human language.
The user selection may define one or more device settings of the mobile electronic device.
The method may further comprise: after identifying the original human language of the media content and prior to identifying the target human language, generating a prompt to request the user selection of the user-preferred human language; and receiving the user selection based on an additional user input associated with the prompt.
The media content may include text messages of a chat conversation conducted through an instant-messaging application, and the translating of the media content may include automatically translating the text messages of the chat conversation into the target human language.
The method may further comprise, prior to identifying the original human language: selecting, responsive to a first user input, a text message from a plurality of incoming text messages in a chat conversation conducted through an instant-messaging application; copying, responsive to a second user input, the selected text message; and using, responsive to a third user input, the selected text message as the media content for translation.
Based on the device settings being set for word-by-word translation, the method may further comprise, prior to identifying the original human language: selecting, based on a first user input, a word from a plurality of words displayed on the display device as part of the media content output by the application; copying the selected word; and using the selected word as the media content for translation.
The translating of the media content may include automatically translating one or more outgoing text messages of a chat conversation, conducted through an instant-messaging application, into a preferred human language of a recipient of the one or more outgoing text messages.
The media content may include text entered by the user via a keyboard of the mobile electronic device or via transcription by the mobile electronic device from audio spoken by the user; the target human language may correspond to a preferred human language of an intended recipient of the text entered by the user; and the translated text may be included in the system UI overlay may be selectable to send as an outgoing text message to the intended recipient via the application.
The rendering may include using rendering models stored in the memory to cause the translated text to substantially match one or more visual characteristics of native content of the application.
The media content may include audio content; the method may further comprise transcribing, using an automatic speech recognition transcription module, the audio content into visual text in the original human language; and the translating of the media content may include translating the visual text into the target human language for display in the system UI overlay.
The audio content may be part of video content being played back or live-streamed via the application; and the system UI overlay may be rendered to include the translated text as captions to the video content as the video content is played back or live-streamed.
The translation models may include semantic natural-language understanding.
A mobile electronic device comprising: a display device; one or more processors; and memory storing: translation models usable for translation of text from an original human language to a target human language; and instructions that, when executed by the one or more processors, cause the one or more processors to implement a translation-manager module to provide on-device real-time translation of media content that is output by the electronic device by performing the method disclosed above.
Although aspects of the on-device real-time translation of media content on a mobile electronic device have been described in language specific to features and/or methods, the subject of the appended claims is not necessarily limited to the specific features or methods described. Rather, the specific features and methods are disclosed as example implementations of the claimed on-device real-time translation of media content on a mobile electronic device or a corresponding electronic device, and other equivalent features and methods are intended to be within the scope of the appended claims. Further, various different aspects are described, and it is to be appreciated that each described aspect can be implemented independently or in connection with one or more other described aspects.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2020/065983 | 12/18/2020 | WO |