This application claims priority to Chinese Patent Application No. 202311222742.4, filed Sep. 20, 2023, which is hereby incorporated by reference herein as if set forth in its entirety.
The present disclosure generally relates to the technical field of smart glasses, and in particular to a translation system, smart glasses for translation, and a translation method based on artificial intelligence.
With the development of computer technology, smart glasses are becoming more and more popular. However, the existing smart glasses are expensive, and in addition to their own functions as smart glasses, they usually only have the functions of listening to music and making or answering calls. Hence, the function of existing smart glasses is relatively simple, and the user stickiness is low in the product.
The present disclosure provides a translation system, smart glasses for translation, and a translation method based on artificial intelligence (AI), which aim to realize the high precision translation based on the smart glasses, thereby improving the practicality, interactivity and intelligence of the smart glasses, and increasing the user stickiness in the product.
An embodiment of the present disclosure provides a translation system based on artificial intelligence (AI), including: first smart glasses, a first smart mobile terminal and a cloud server, wherein the cloud server is configured with a large language model (LLM), and the LLM comprises a generative artificial intelligence large language model or a multimodal large language model;
An embodiment of the present disclosure provides smart glasses for translation based on AI, including: a speech pickup device, an output device, a processor and a memory, wherein the processor is electrically connected to the speech pickup device, the output device and the memory;
An embodiment of the present disclosure provides a translation method based on AI, applied to a smart mobile terminal, including:
In each of the embodiments of the present disclosure, by combining the smart glasses, the smart mobile terminal and the server, the intelligent translation based on the smart glasses is realized, thereby improving the practicality, interactivity and intelligence of the smart glasses, and increasing the user stickiness in the product. Furthermore, due to the scalability and self-creativity of the LLM, the accuracy of the translation can be further improved.
In order to more clearly illustrate the technical solutions in this embodiment, the drawings used in the embodiments or the description of the prior art will be briefly introduced below. It should be understood that, the drawings in the following description are only examples of the present disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative works.
In order to make the objects, features and advantages of the present disclosure more obvious and easier to understand, the technical solutions in this embodiment will be clearly and completely described below with reference to the drawings. Apparently, the described embodiments are part of the embodiments of the present disclosure, not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present disclosure without creative efforts are within the scope of the present disclosure.
In the following descriptions, the terms “including”, “comprising”, “having” and their cognates that are used in the embodiments of the present disclosure are only intended to represent specific features, numbers, steps, operations, elements, components, or combinations of the foregoing items, and should not be understood as excluding the possibilities of the existence of one or more other features, numbers, steps, operations, elements, components or combinations of the foregoing items or adding one or more features, numbers, steps, operations, elements, components or combinations of the foregoing items.
In addition, in the present disclosure, the terms “first”, “second”, “third”, and the like are only used for distinguishing, and cannot be understood as indicating or implying relative importance.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meanings as commonly understood by those skilled in the art to which the embodiments of the present disclosure belong. The terms (e.g., the terms those defined in commonly used dictionaries) will be interpreted as having the same meaning as the contextual meaning in the relevant technology and will not be interpreted as having idealized or overly formal meanings, unless clearly defined in the embodiments of the present disclosure.
Referring to
In each embodiment of the present disclosure, a corresponding application (APP) is installed on each smart glasses or each smart mobile terminal. The smart glasses and the smart mobile terminal establish data connection with each other through Bluetooth, and use the APP for data interaction.
The smart glasses may be open-ear smart glasses, and the specific structure of the smart glasses refer to the relevant descriptions in the following embodiments shown in
The smart mobile terminal may be include, but is not limited to: cellular phone, smart phone, other wireless communication device, personal digital assistant, audio player, other media player, music recorder, video recorder, camera, other media recorder, smart radio, Laptop computer, personal digital assistant (PDA), portable multimedia player (PMP), Moving Picture Experts Group (MPEG-1 or MPEG-2) audio layer 3 (MP3) player, digital camera, and smart wearable device (such as smart watch, smart bracelet, etc.). An Android, iOS, or operating system is further installed on the smart mobile terminal.
The cloud server 130 may be a single server or a distributed server cluster composed of multiple servers, and is configured with at least one Large Language Model (LLM). The LLM may include at least one Generative Artificial Intelligence Large Language Model (GAILLM), or at least one Multimodal Large Language Model (MLLM), or other large language models with similar functions.
The GAILLM may be, for example but not limited to: ChatGPT of Open AI, Bard of Google, and other models with similar functions. The LLM integrates multiple functions of: for example, obtaining the task execution command(s) corresponding to the voice command(s) spoken by the user through natural language (as shown in
Specifically, the first smart glasses 110 are used to: pick up a first speech in a first language (i.e., source language) from a preset direction, and send the first speech to the first smart mobile terminal 120 via Bluetooth. Optionally, the preset direction may point to the front of a user of the first smart glasses 110, that is, a listening mode, and in the listening mode, the language spoken by the chat partner of the user of the first smart glasses 110 is the source language. Alternatively, the preset direction may point to the mouth of the user of the first smart glasses 110, that is, a talking mode, and in the talking mode, the language spoken by the user of the first smart glasses 110 is the source language. In response to a voice command of the user, or an action of the user pressing the control button on the first smart glasses 110 which is used for switching the translation mode, or action(s) of the user in a User Interface (UI) of the first smart mobile terminal 120, the first smart glasses 110 switch the current translation mode to a corresponding mode. Furthermore, the source language and target language are determined according to the current translation mode.
The first smart mobile terminal 120 is used to: convert the first speech into a first text using a speech-to-text engine, and send the first text to the cloud server 130. The speech-to-text engine is configured in the first smart mobile terminal 120 or cloud (such as, a speech-to-text server, a model server, a translation server, or other cloud server).
The cloud server 130 is used to: translate, through the LLM, the first text to obtain a second text in a second language (i.e., the target language) according to a translation prompt, and send the second text to the first smart mobile terminal 120.
The first smart mobile terminal 120 is further used to: convert the second text into a second speech using a text-to-speech engine, and send the second speech to the first smart glasses 110 via Bluetooth. The text-to-speech engine is configured in the first smart mobile terminal 120 or cloud (such as, a text-to-speech server, the model server, the translation server, or other cloud server).
The first smart glasses 110 is further used to play the second speech.
The translation prompt may include instruction information for instructing the LLM to perform a translation task. Furthermore, the translation prompt further includes: identification information of the first language and/or the second language. The translation prompt may be generated by a prompt generator. The prompt generator may be a software module or a micro-controller configured with the software module, and is used to generate corresponding prompt information according to the parsed semantics in the user speech, where the semantics are obtained by parsing each user speech. Alternatively, the prompt generator may generate the translation prompt according to other preset rules, for example, receiving a translation instruction, receiving a text sent by a specific object, receiving data containing identification information of the language, etc.
The prompt generator may be configured in the smart glasses 110, the first smart mobile terminal 120, or any server in the cloud.
Optionally, in one embodiment of the present disclosure, the prompt generator obtains identification information of the first language and/or the second language from the smart glasses 110 or the first smart mobile terminal 120.
Optionally, in other embodiments of the present disclosure, the first smart mobile terminal 120 is further used to generate the translation prompt, and send the translation prompt to the cloud server 130.
Optionally, in other embodiments of the present disclosure, a mobile application (APP) is installed on the first smart mobile terminal 120.
The first smart mobile terminal 120 is further used to: in response to a preset action of a user, send, through the mobile APP, a translation instruction to the first smart glasses 110. The preset action may be, for example, but not limited to, any one of the following actions: the user pressing a virtual button preset on the UI of the mobile APP which is used for starting translation, the user performing one or more preset sliding gestures on the screen of the first smart glasses 110 or the first smart mobile terminal 120, the user pressing a translation control button provided on the temple(s) of the first smart glasses 110, and the user speaking a preset translation voice command, etc.
The first smart glasses 110 is further used to: enter a translation mode in response to the translation instruction, and pick up the first speech from the preset direction in the translation mode.
Optionally, in other embodiments of the present disclosure, as shown in
The first smart mobile terminal 120 is further used to, in response to a user pressing a button on a screen of the first smart mobile terminal 120, send a speech picking up instruction to the first smart glasses 110 through the mobile APP. The button is generated by the mobile APP.
The first smart glasses 110 is further used to pick up the first speech from the preset direction in response to the speech picking up instruction.
The first smart mobile terminal 120 is further used to, in response to the user releasing the button, send the first text to the translation server 131 through the mobile APP.
The translation server 131 is used to: generate the translation prompt, and send the first text and the translation prompt to the model server 132.
The model server 132 is used to: translate, through the LLM, the first text to obtain the second text according to the translation prompt, and send the second text to the translation server 131.
The translation server 131 is further used to send the second text to the first smart mobile terminal 120.
Optionally, in other embodiments of the present disclosure, as shown in
The first smart glasses 110 are further used to, in response to a user pressing a virtual button on a temple of the first smart glasses 110, pick up the first speech from the preset direction.
The first smart glasses 110 are further used to, in response to the user releasing the virtual button, send a notification message to the mobile APP via Bluetooth.
The first smart mobile terminal 120 is further used to send, through the mobile APP, the first text to the translation server 131 according to the notification message.
The translation server 131 is used to: generate the translation prompt, and send the first text and the translation prompt to the model server 132.
The model server 132 is used to: translate, through the LLM, the first text to obtain the second text according to the translation prompt, and send the second text to the translation server 131.
The translation server 131 is further used to send the second text to the first smart mobile terminal 120.
Optionally, in other embodiments of the present disclosure, a mobile APP is installed on the first smart mobile terminal 120, and the first smart mobile terminal 120 is further used to: generate, through the mobile APP, translation transcript(s) according to the second text, or the first text and the second text, and display the translation transcript(s) on a screen of the first smart mobile terminal 120; and save, through the mobile APP, the translation transcript(s) on the first smart mobile terminal 120 or a storage server.
Optionally, in other embodiments of the present disclosure, the first smart mobile terminal 120 is further used to: in response to an action of a user performed on a sharing button on the UI of the mobile APP being detected, sharing, through the mobile APP, the translation transcript(s) according to a sharing manner indicated by the action.
The sharing may be, for example but not limited to: sharing the translation transcript(s) with the friend(s) of the first smart glasses 110 registered in instant messaging software such as WeChat; or sharing the translation transcript(s) with other terminal equipment via the email; or sharing the translation transcript(s) on social networking platform(s), such as Facebook, Instagram, etc. The user can view the translation transcript(s) on a sharing terminal in the form of full translation or comparison between the original text and the translation. Some or all of the translation transcript(s) generated during one or more translation periods specified by the user can be shared. A process of smart glasses from entering translation mode to exiting translation mode can be regarded as one translation period.
Optionally, in other embodiments of the present disclosure, a mobile APP is installed on the first smart mobile terminal 120, and the first smart mobile terminal 120 is further used to: generate, through the mobile APP, the second speech with a preset playback speed, and send the second speech with the preset playback speed to the first smart glasses 110 for playback via the Bluetooth.
Optionally, in other embodiments of the present disclosure, a mobile APP is installed on the first smart mobile terminal 120.
The first smart mobile terminal 120 is further used to set up the second language according to a setting action performed by a user through the mobile APP.
The first smart glasses 110 are further used to: perform a speech picking up action, and send the picked-up speech to the first smart mobile terminal 120 in real time via Bluetooth in a form of data stream.
The first smart mobile terminal 120 is further used to: detect whether a language of the received speech is the first language using the speech-to-text engine; in response to the language being the first language, send a starting translation instruction to the first smart glasses 110 via the Bluetooth, convert the received speech as the first speech into a first text through the speech-to-text engine, and send the first text and the translation prompt to the model server 132, wherein the translation prompt includes information of the first language and the second language; and in response to a silence of a preset duration being detected, send a stopping translation instruction to the first smart glasses 110 via the Bluetooth,
The first smart glasses 110 are further used to: enter a translation mode in response to the starting translation instruction; and in response to the stopping translation instruction, exit the translation mode and end the speech picking up action.
Specifically, after smart glasses are powered on or according to a listening instruction triggered by a user, the smart glasses perform a speech picking up action to listen a user speech in real time, and send the real-time listened speech to the smart mobile terminal in the form of data stream. The real-time listened speech is cached in the memory of the smart glasses. During the listening, the smart glasses use the multiple threads to overwrite the old speech with the newly picked-up speech of the preset duration continuously in the memory (for example, overwrite the previously picked-up voice of 3 seconds with the currently picked-up speech of 3 seconds), and simultaneously send the cached newly picked-up speech to the smart mobile terminal. The smart mobile terminal stores the received speech in the local memory, and determines whether the language of the received speech is the first language. On the one hand, when the language is not the first language, the smart mobile terminal deletes the received speech from the local memory, and continues to perform the action of determining whether the language of the received speech is the first language based on the newly received speech, until no new speech sent by the smart glasses is received for a preset time period, or until receiving a notification message of stopping listen sent by the smart glasses On the other hand, when the language is the first language, the smart mobile terminal sends a starting translation instruction to the smart glasses, and performs the translation action on the received speech. The smart glasses enter the translation mode according to the starting translation instruction, and continue to perform the speech picking up action. Furthermore, when a silence of a preset duration is detected in the speech stored in the local memory, the smart mobile terminal sends a stopping translation instruction to the smart glasses, so that the smart glasses exit the translation mode and end the speech picking up action according to the stopping translation instruction.
Optionally, in other embodiments of the present disclosure, as shown in
The first smart glasses 110 are further used to send the first speech to the translation server 131 via a wireless network.
The translation server 131 is used to: convert the first speech into a first text using a speech-to-text engine, generate a translation prompt, and send the first text and the generated translation prompt to the model server 132.
The model server 132 is used to: translate, through the LLM, the first text to obtain the second text according to the translation prompt, and send the second text to the translation server 131.
The translation server 131 is further used to: convert the second text into a second speech using a text-to-speech engine, and send the second speech to the first smart glasses 110 via the wireless network.
Optionally, in other embodiments of the present disclosure, the preset direction points to the mouth of the user of the first smart glasses 110. As shown in
The first smart mobile terminal 120 is further used to send the second speech to the second smart mobile terminal 150.
The second smart mobile terminal 150 is used to send the second speech to the second smart glasses 140.
The second smart glasses 140 is used to play the second speech.
Optionally, in other embodiments of the present disclosure, the first smart mobile terminal 120 is further used to: generate, through a mobile APP on the first smart mobile terminal 120, translation transcript(s) in real time according to the first text and the second text, store the translation transcript(s) in the first smart mobile terminal 120, display the translation transcript, and synchronize the translation transcript(s) to the second smart mobile terminal 150.
The second smart mobile terminal 150 is further used to: store the translation transcript(s) in the second smart mobile terminal 150, and display the translation transcript(s) through a mobile APP on the second smart mobile terminal 150.
Specifically, the mobile APP on the second smart mobile terminal 150 displays the translation transcript(s) in a form of list on the screen of the second smart mobile terminal 150. The user of the second smart mobile terminal 150 views different parts of the translation transcript(s) by: performing a preset scrolling gesture on the screen, or using physical or virtual button(s) on the second smart mobile terminal 150, or the voice control instruction(s).
Optionally, in other embodiments of the present disclosure, as shown in
The first smart glasses 110 is further used to: in response to a user of the first smart glasses 110 pressing a button on a temple of the first smart glasses 110, pick up a first speech of a user of the first smart mobile terminal 120, and send the first speech to the first smart mobile terminal 120 via Bluetooth.
The first smart mobile terminal 120 is further used to convert, through a mobile APP on the first smart mobile terminal 120, the first speech into a first text using a speech-to-text engine, and send the first text to the translation server 131.
The translation server 131 is used to: generate a first translation prompt, translate, through the LLM, the first text to obtain a plurality of third texts according to the first translation prompt, and distribute each of the third texts to corresponding second smart glasses, wherein a language of each of the third texts corresponds to a language spoken by a user of each of the second smart glasses 140.
The second smart glasses 140 are used to: convert the received third text into a third speech using a text-to-speech engine, and play the third speech.
Optionally, in other embodiments of the present disclosure, the second smart glasses 140 are further used to: in response to a user of the second smart glasses 140 pressing a virtual button on a temple of the second smart glasses 140, pick up a fourth speech of a user of the second smart glasses 140, convert the fourth speech into a fourth text through a speech-to-text engine, and send the fourth text to the translation server 131.
The translation server 131 is used to: generate a second translation prompt, translate, through the LLM, the fourth text to obtain a fifth text according to the second translation prompt, and send the fifth text to the first smart glasses 110, wherein a language of the fifth text corresponds to a language spoken by a user of the first smart glasses 110.
The first smart glasses 110 are further used to: convert the fifth text into a fifth speech using a text-to-speech engine, and play the fifth speech.
Optionally, in other embodiments of the present disclosure, as shown in
Referring to
In steps 1 to 6 of
Referring to
In steps 1 to 5 of
It is understandable that the aforementioned text-to-speech engine and speech-to-text engine can also be configured in one or more cloud servers, and the mobile APP performs the actions of text-to-speech and speech-to-text by data interaction with the one or more cloud servers.
Referring to
In steps 1 to 6 of
Furthermore, the client APP further generates translation transcript(s) based on the native text and the translated text, and allows the user to view the translation transcript(s) on the screen of smart glasses 401 by performing preset action(s). Furthermore, the client APP further shares the translation transcript(s) with other terminal(s) or social networking platform(s) specified by the user according to the sharing voice command of the user.
It is understandable that the aforementioned text-to-speech engine, speech-to-text engine and prompt generator can also be configured in one or more cloud servers, and the client APP performs the actions of text-to-speech, speech-to-text and translation prompt generation by data interaction with the one or more cloud servers.
Referring to
In steps 1 to 6 of
Furthermore, when the mobile APP determines that the smart glasses 401 has a screen based on a device information record, the mobile APP sends the translated text to the smart glasses 401 for display on the screen of the smart glasses 401.
Furthermore, the mobile APP further generates translation transcript(s) based on the native text and the translated text, and allows the user to view the translation transcript(s) on the screen of the smart glasses 401 or the screen of the smart mobile terminal 402 by performing preset action(s). Furthermore, the mobile APP further shares the translation transcript(s) with other terminal(s) or social networking platform(s) specified by the user, according to the sharing voice command of the user, or the preset sharing action(s) of the user performed on the UI of the mobile APP.
It is understandable that the aforementioned text-to-speech engine, speech-to-text engine and prompt generator can also be configured in one or more cloud servers, and the mobile APP performs the actions of text-to-speech, speech-to-text and translation prompt generation by data interaction with the one or more cloud servers.
In the embodiment, by combining the smart glasses, smart mobile terminal and server, the intelligent translation based on the smart glasses is realized, thereby improving the practicality, interactivity and intelligence of the smart glasses, and increasing the user stickiness in the product. Moreover, due to the scalability and self-creativity of the LLM, the accuracy of translation can be further improved.
Referring to
The at least one processor 203 is electrically connected to the speech pickup device 201, the output device 202 and the at least one memory 204.
The speech pickup device 201 includes at least one microphone. Optionally, the speech pickup device 201 includes at least one directional microphone and/or at least one omni-directional microphone. Specifically, the speech pickup device 201 may be a single microphone or a microphone array composed of multiple microphones, for example, a microphone array consisting of at least one directional microphone and/or at least one omni-directional microphone. The output device 202 includes at least one speaker.
The processor 203 includes: a CPU (Central Processing Unit/Processor) and a DSP (Digital Signal Processing). The DSP is used to process the speech data (or the voice data) obtained by the microphone 203. Preferably, the CPU is a MCU (Micro-controller Unit).
The memory 204 is a non-transitory memory, and specifically may include: a RAM (Random Access Memory) and a flash memory component. One or more computer programs executable on the processor 203 are stored in the memory 204, and the one or more computer programs include instructions. The instructions are used to: pick up a first speech in a first language from a preset direction through the speech pickup device 201; translate, through a LLM configured in the local or cloud, the first speech to obtain a second speech in a second language; and output, through the output device 202, the second speech.
Optionally, in other embodiments of the present disclosure, the preset direction points to the front of a user.
Optionally, in other embodiments of the present disclosure, a client program is installed on the smart glasses 200, and the instructions are further used to set up the first language and/or the second language according to a first preset action of the user performed on the client program
Optionally, in other embodiments of the present disclosure, the instructions are further used to set up a playback speed of the second speech according to a second preset action of the user performed on the client program.
Optionally, in other embodiments of the present disclosure, the smart glasses 200 further include a Bluetooth component 211 electrically connected to the processor 203, and the instructions are further used to: receive, by the Bluetooth component 211, a translation instruction sent by a mobile application on a smart mobile terminal, enter a translation mode in response to the translation instruction, and pick up the first speech from the preset direction through the speech pickup device 201 in the translation mode.
The Bluetooth component 211 includes a Bluetooth signal transceiver and its peripheral circuits, and may be specifically provided in an inner cavity of the front frame 209 and/or at least one temple 205. The Bluetooth component 211 can be linked with smart mobile terminals such as smartphones or smart watches, to take phone calls, and music and data communications.
Optionally, in other embodiments of the present disclosure, as shown in
The at least one temple 205 may include, for example, a left temple and a right temple.
Optionally, in other embodiments of the present disclosure, the instructions are further used to: perform, through a speech-to-text engine configured in the local or cloud, a language detection on the first speech to determine whether the first language is a preset language; in response to the first language being the preset language, enter a translation mode, convert, through the speech-to-text engine, the first speech into a first text; generate a translation prompt, and translate, through the LLM, the first text to obtain the second text in the second language, wherein the translation prompt includes information of the first language and the second language; convert, through a text-to-speech engine configured in the local or cloud, the second text into the second speech; and in response to a silence of a preset duration being detected by the speech pickup device 201, exit the translation mode.
Optionally, in other embodiments of the present disclosure, the smart glasses 200 further include a wireless communication component 207 electrically connected to the processor 203, and the instructions are further used to: send, through the wireless communication component 207, the first speech to a translation server, so as to translate, by the translation server, the first speech into the second speech using a LLM configured in the translation server or a model server; and receive, through the wireless communication component 207, the second speech sent by the translation server.
The wireless communication component 207 includes a wireless signal transceiver and its peripheral circuits, and may be specifically provided in an inner cavity of the front frame 209 and/or at least one temple 205 of the smart glasses 200. The wireless signal transceiver may, but is not limited to, use at least one of the WiFi (Wireless Fidelity) protocol, the NFC (Near Field Communication) protocol, the ZigBee protocol, the UWB (Ultra-Wide Band) protocol, the RFID (Radio Frequency Identification) protocol, and the cellular mobile communication protocol (such as 3G/4G/5G, etc.) to perform the data transmission.
Optionally, in other embodiments of the present disclosure, the output device 202 includes front speaker(s), the preset direction points to the front of a user, and the instructions are further used to play the second speech through the front speaker(s).
Optionally, in other embodiments of the present disclosure, the smart glasses 200 further include a Bluetooth component 211 electrically connected to the processor 203, the preset direction points to the mouth of a user, and the instructions are further used to send, through the Bluetooth component 211, the second speech to an external player for playback.
Optionally, in other embodiments of the present disclosure, the smart glasses 200 further include an inertial measurement unit (IMU) sensor 208 electrically connected to the processor 203, and the instructions are further used to: detect whether the first speech ends using a VAD (voice activity detection) algorithm and/or sensing data of the IMU; and in response to the end of the first speech being detected, translate, through the LLM, the first speech to obtain the second speech.
The working principle of the VAD algorithm is to detect whether the user is speaking by detecting changes in the energy of speech. When the user speaks, the energy of the speech will increase, and hence the VAD algorithm can distinguish whether there is voice or no voice by detecting the changes in the energy of speech.
Detecting whether the first speech ends using the sensing data of the IMU sensor mainly includes: using the IMU sensor to detect the vibration generated when the user speaks (such as, accelerator), and when the vibration stops for longer than the preset time, it means that the user has finished speaking.
Optionally, in addition to IMU sensor 208, the smart glasses 200 may further include other data sensing components electrically connected to processor 203, such as at least one of: a positioning component, a touch sensor, a temperature sensor, a proximity sensor, a humidity sensor, an electronic compass, a timer, a camera and a pedometer. The positioning component may be, but not limited to, the positioning component based on the GPS (Global Positioning System) or the Beidou satellite.
Furthermore, the smart glasses 200 may further include a frame 209 and a battery 210. The frame 209 may be, for example, a front frame with a lens (e.g., a sunglass lens, a clear lens, or a corrective lens). Preferably, the at least one temple 205 is detachably connected to the front frame 209.
Furthermore, at least one physical control button is provided on the at least one temple 205 and the frame 209, such as a power button, volume control button(s), translation mode switching button(s), etc.
Optionally, in other embodiments of the present disclosure, the smart glasses 200 may further include a display screen electrically connected to the processor 203, and the display screen is used to display the aforementioned second text or translation transcript(s).
The battery 210 is used for providing power to the aforementioned electronic components on the smart glasses 100, such as the speech pickup device 201, the output device 202, the at least one processor 203, the at least one memory 204, the virtual button 206, the wireless communication component 207, the IMU sensor 208 and the Bluetooth component 211, etc.
The various electronic components of the aforementioned smart glasses are connected through a bus.
It should be noted that, the relationship between the components of the aforementioned smart glasses is a substitution relationship or a superposition relationship. That is, all the aforementioned components in the embodiment are installed on the smart glasses, or some of the aforementioned components selectively are installed according to requirements. When the relationship is an alternative relationship, the smart glasses are further provided with at least one of a peripheral connection interface, for example, a PS/2 interface, a serial interface, a parallel interface, an IEEE1394 interface, and a USB (Universal Serial Bus) interface. The function of the replaced component is realized through the peripheral device connected to the connection interface, and the peripheral device such as external speaker, external sensor, etc.
For unspecified details about the smart glasses in this embodiment, please refer to the relevant descriptions in the embodiments of the translation system shown in
In the embodiment, by using the LLM(s) through the smart glasses, the intelligent translation based on the smart glasses is realized, thereby improving the practicality, interactivity and intelligence of the smart glasses, and increasing the user stickiness in the product. Moreover, due to the scalability and self-creativity of the LLM, the accuracy of translation can be further improved.
Referring to
S501, receiving a first speech in a first language sent by a smart wearable device, converting the first speech into a first text using a speech-to-text engine configured in the local or cloud, and obtaining a translation prompt;
S502, translating, through a large language model (LLM) configured in the local or the cloud, the first text to obtain a second text in a second language according to the translation prompt, wherein the LLM includes a generative artificial intelligence large language model (GAILLM) or a multimodal large language model (MLLM);
S503, converting the second text into a second speech using a text-to-speech engine configured in the local or the cloud, and sending the second speech to the smart wearable device for playback.
The smart wearable device may include but not limited to, smart safety helmets, smart earphones, smart earrings, smart watches, and smart glasses shown in
Optionally, the first smart mobile terminal 120 generates a translation prompt through a built-in prompt generator; or, the first smart mobile terminal 120 generates, through a prompt generator configured in a cloud server, the translation prompt.
Optionally, in other embodiments of the present disclosure, a mobile application is installed on the smart mobile terminal, and the method further includes:
Optionally, in other embodiments of the present disclosure, a mobile application is installed on the smart mobile terminal, and before receiving the first speech in the first language sent by the smart wearable device, the method further includes:
The LLM is configured in a model server, and the step of translating, through the LLM configured in the local or cloud, the first text to obtain the second text in the second language according to the translation prompt, includes:
Optionally, in other embodiments of the present disclosure, a mobile application is installed on the smart mobile terminal, and the method further includes:
Optionally, in other embodiments of the present disclosure, the method further includes: in response to an action of a user performed on a sharing button in a user interface (UI) of the mobile application being detected, sharing, through the mobile application, the translation transcript according to a sharing manner indicated by the action.
Optionally, in other embodiments of the present disclosure, a mobile application is installed on the smart mobile terminal, and before sending the second speech to the smart wearable device for playback, the method further includes:
Optionally, in other embodiments of the present disclosure, a mobile application is installed on the smart mobile terminal, the LLM is configured in the model server. The smart wearable device performs a speech picking up action, and sends the picked-up speech to the smart mobile terminal in real time via a Bluetooth in a form of data stream, and the method further includes:
The method further includes: in response to a silence of a preset duration being detected, sending, through the Bluetooth, a stopping translation instruction to the smart wearable device to instruct the smart wearable device to exit the translation mode and end the speech picking up action.
Optionally, in other embodiments of the present disclosure, the first speech is a speech from a user of the smart wearable device, and the method further includes:
Optionally, in other embodiments of the present disclosure, a mobile application is installed on the smart mobile terminal, and the method further includes:
In the aforementioned translation method provided by the present disclosure, the smart wearable device has a plurality of translation modes, and specifically may include: a listening mode, a talking mode, a paired mode and a multi-points mode. The multi-points mode can be applied to application scenarios such as tour guide and conference, and that like.
As an application example, taking the smart glasses and the mobile phone as examples, in scenario A of the listening mode:
As an application example, taking the smart glasses and the mobile phone as examples, in scenario B of the listening mode:
As an application example, taking the smart glasses and the mobile phone as examples, in scenario C of the listening mode:
As an application example, taking the smart glasses and the mobile phone as examples, in scenario D of the listening mode:
As an application example, taking the smart glasses and the mobile phone as examples, in scenario E of the talking mode:
As an application example, taking the smart glasses and the mobile phone as examples, in scenario F of the talking mode:
As an application example, taking the smart glasses and the mobile phone as examples, in scenario G of the talking mode:
As an application example, taking the smart glasses and the mobile phone as examples, in scenario H of the talking mode, the following actions further can be performed:
As an application example, taking the smart glasses and the mobile phone as examples, in scenario I of the talking mode:
As an application example, taking the smart glasses and the mobile phone as examples, in scenario J of the paired mode:
As an application example, taking the smart glasses and the mobile phone as examples, in scenario K of the multi-points mode:
In order to further illustrate the aforementioned four translation modes, the following will be described with reference to
As shown in
When the user B presses-and-holds a virtual button on the right temple of smart glasses worn on the user B, the user A starts speaking, and the smart glasses pick up a speech of user A through the microphone and send the picked-up speech in language A of the user A to the mobile APP. When the user B releases the virtual button, the mobile APP uses the LLM to translate the received speech into a speech in language B, and sends the speech in language B to the smart glasses, so that the smart glasses play the speech in language B for the user B through the built-in speaker.
At the same time, the mobile APP displays the original text in language A corresponding to the speech of user A and the translated text in language B on the screen of the mobile phone of user B.
As shown in
When the user B presses-and-holds a virtual button on the right temple of smart glasses worn on the user B, the user B starts speaking, and the smart glasses pick up a speech of user B through the microphone and send the picked-up speech in language B of user B to the mobile APP. When the user B releases the virtual button, the mobile APP uses the LLM to translate the received speech into a speech in language A, and sends the speech in language A to the smart glasses, so that the smart glasses play the speech in language A for the user A through the front speaker.
At the same time, the mobile APP displays the original text in language B corresponding to the speech of user B and the translated text in language A on the screen of the mobile phone of user B.
Furthermore, the listening mode shown in
On the one hand, when the user B presses-and-holds the virtual button on the right temple of smart glasses worn on the user B, the smart glasses enter the listening mode, and the smart glasses pick up the speech of user A through the microphone and send the picked-up speech in language A of user A to the mobile APP.
When the user B releases the virtual button, the smart glasses translate, through the mobile APP, the speech in language A into the speech in language B using the LLM, and play the speech in language B for the user B through the built-in speaker. At the same time, the mobile APP displays the original text in language A corresponding to the speech in language A and the translated text in language B on the screen of the mobile phone of user B.
On the other hand, when the smart glasses detect, through the VAD algorithm and/or the IMU sensor on the smart glasses, that the user B starts speaking, the smart glasses switch the current translation mode from the listening mode to the talking mode, pick up the speech of the user B through the microphone, and send the picked-up speech in language B of the user B to the mobile APP.
When the smart glasses detect, through the VAD algorithm and/or the IMU sensor on the smart glasses, that the user B stops speaking, the smart glasses translate, through the mobile APP, the speech in language B into the speech in language A using the LLM, and play the speech in language A for the user A through the front speaker of the smart glasses, or the external speaker, or the mobile phone. At the same time, the mobile APP displays the original text in language B corresponding to the speech in language B and the translated text in language A on the screen of the mobile phone of user B.
As shown in
Through the mobile APP on the mobile phone A, the user A can set up the language A spoken by the user A and the language B spoken by the user B, and switch the current translation mode to the paired mode. In the paired mode, in the mobile phone A, the language A spoken by the user A is the first language (i.e., the source language), and the language B spoken by the user B is the second language (i.e., the target language).
Through the mobile APP on the mobile phone B, the user B can set up the language B spoken by the user B and the language A spoken by the user A, and switch the current translation mode to the paired mode. In the paired mode, in the mobile phone B, the language B spoken by the user B is the first language (i.e., the source language), and the language A spoken by the user A is the second language (i.e., the target language).
When the user B presses-and-holds the virtual button on the right temple of smart glasses B worn on the user B, the user B starts speaking, and the smart glasses B pick up the speech of user B through the microphone and send the picked-up speech in language B to the mobile APP on the mobile phone B. When the user B releases the virtual button, the mobile APP on the mobile phone B translates the speech in language B into the text in language A using the LLM, and sends the text in language A to the mobile APP on the mobile phone A. The mobile APP on the mobile phone A converts the text in language A into the corresponding speech in language A, and sends the speech in language A to the smart glasses A for playback.
When the user A presses-and-holds the virtual button on the right temple of smart glasses A worn on the user A, the smart glasses A and the mobile phone A translate the picked-up the speech in language A of user A into the text in language B and send the text in language B to the mobile APP on mobile phone B in the similar manner to the user B. The mobile APP on the mobile phone B converts the text in language B into the corresponding speech in language B, and sends the speech in language B to the smart glasses B for playback.
Furthermore, the mobile APP on the mobile phone A displays the original text and the translated text processed by itself on the screen of mobile phone A. The mobile APP on the mobile phone B displays the original text and the translated text processed by itself on the screen of mobile phone B.
Furthermore, the mobile phones A and B can further display the original text and the translated text processed by the other party on their respective screens through sharing.
As shown in
When the user A presses the virtual button on the smart glasses A, the smart glasses A pick up the speech in language A of the user A, and then use the translation method in the aforementioned embodiments to translate, through the data interaction with the mobile phone A, the speech in language A into the speech in language B, the speech in language C and the speech in language D, and send the speech in language B, the speech in language C and the speech in language D to the smart glasses B, C and D respectively.
When the user B presses the virtual button on the smart glasses B, the smart glasses B pick up the speech in language B of the user B, and then use the translation method in the aforementioned embodiments to translate, through the data interaction with the mobile phone B, the speech in language B into the speech in language A, and send the speech in language A to the smart glasses A.
Furthermore, the terminal responsible for translation can further generate translation transcripts based on the original text and the translated text, and distribute the translation transcripts to each smart glasses and/or each mobile phone for sharing.
For unspecified details about the translation method in this embodiment, please refer to the relevant descriptions in the embodiments shown in
In the embodiment, by using the smart mobile terminal combined with the LLM, the intelligent translation based on the smart glasses is realized, thereby improving the practicality, interactivity and intelligence of the smart glasses, and increasing the user stickiness in the product. Moreover, due to the scalability and self-creativity of the LLM, the accuracy of translation can be further improved.
The present disclosure further provides a non-transitory computer-readable storage medium, which can be set in the smart glasses or smart wearable device or smart mobile terminal in the above-mentioned embodiments, and may be the memory 204 in the embodiment shown in
It should be understood that in the above-described embodiments of the present disclosure, the above-mentioned smart glasses, translation system and translation method may be implemented in other manners. For example, multiple units/modules may be combined or be integrated into another system, or some of the features may be ignored or not performed. In addition, the above-mentioned mutual coupling/connection may be direct coupling/connection or communication connection, and may also be indirect coupling/connection or communication connection through some interfaces/devices, and may also be electrical, mechanical or in other forms.
It should be noted that for the various method embodiments described above, for the sake of simplicity, they are described as a series of action combinations. However, those skilled in the art should understand that the present disclosure is not limited by the order of the described actions, as certain steps can be performed in a different order or simultaneously. Additionally, it should be understood that the embodiments described in this invention are preferred embodiments, and the actions and modules involved are not necessarily required for the present disclosure.
In the above-mentioned embodiments, the descriptions of each embodiment have different focuses. For portions not described in a particular embodiment, reference can be made to relevant descriptions in other embodiments.
The above is a description of the smart glasses, translation system and translation method provided by the present disclosure. Those skilled in the art should understand that based on the embodiments of the present disclosure, there may be changes in specific implementation methods and application scope. Therefore, the content of this specification should not be construed as limiting the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202311222742.4 | Sep 2023 | CN | national |