This application claims priority to Chinese Patent Application No. 202310875349.9, filed Jul. 14, 2023, which is hereby incorporated by reference herein as if set forth in its entirety.
The present disclosure generally relates to the technical field of smart glasses, and in particular to an artificial intelligence-based natural language command control system, method and smart glasses.
With the development of computer technology, smart glasses are becoming more and more popular. However, the existing smart glasses are expensive, and in addition to their own functions as smart glasses, they usually only have the functions of listening to music and making or answering calls. Hence, the function of existing smart glasses is relatively simple, and the intelligence degree of existing smart glasses is lower.
The embodiments of the present disclosure provide a natural language command control system, smart glasses for natural language commands, and a natural language command control method based on generative artificial intelligence large language models (GAILLM), which aim to improve the convenience of device control based on the smart glasses system, and the intelligence and interactivity of the smart glasses.
An embodiment of the present disclosure provides a natural language command control system based on GAILLM, including: a smart glasses system and a model server, and the model server is configured with the GAILLM.
The smart glasses system is configured to obtain a first user speech, perform a semantic parsing on the first user speech, generate at least one first prompt message based on the parsed semantics, and send the at least one first prompt message to the model server.
The model server is configured to obtain at least one task execution command through the GAILLM based on the at least one first prompt message from the smart glasses system, and send the at least one task execution command to the smart glasses system.
The smart glasses system is further configured to execute at least one action corresponding to the at least one task execution command.
An embodiment of the present disclosure further provides smart glasses for natural language commands based on GAILLM, including: a front frame, a temple, a microphone, a processor and a memory.
The temple is connected to the front frame, and the processor is electrically connected to the microphone and the memory, one or more computer programs executable on the processor are stored in the memory, and the one or more computer programs include instructions to: obtain a first user speech through the microphone; perform a semantic parsing on the first user speech; obtain, through the GAILLM, at least one task execution command based on the parsed semantics; and execute at least one action corresponding to the at least one task execution command.
An embodiment of the present disclosure further provides a natural language command control method based on GAILLM, applied to a smart wearable device system, including: obtaining a first user speech, performing a semantic parsing on the first user speech, and obtaining a parsing result; obtaining at least one task execution command through the GAILLM based on the parsing result; and executing at least one action corresponding to the at least one task execution command.
In each embodiment of the present application, the smart glasses system utilizes the GAILLM(s) to realize the device control based on the natural language voice command(s), thereby improving the convenience of the device control based on the smart glasses system, and due to the scalability and self-creativity of the GAILLM, the intelligence and interactivity of the smart glasses system are further improved.
In order to more clearly illustrate the technical solutions in this embodiment, the drawings used in the embodiments or the description of the prior art will be briefly introduced below. It should be understood that, the drawings in the following description are only examples of the present disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative works.
In order to make the objects, features and advantages of the present disclosure more obvious and easier to understand, the technical solutions in this embodiment will be clearly and completely described below with reference to the drawings. Apparently, the described embodiments are part of the embodiments of the present disclosure, not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present disclosure without creative efforts are within the scope of the present disclosure.
In the following descriptions, the terms “including”, “comprising”, “having” and their cognates that are used in the embodiments of the present disclosure are only intended to represent specific features, numbers, steps, operations, elements, components, or combinations of the foregoing items, and should not be understood as excluding the possibilities of the existence of one or more other features, numbers, steps, operations, elements, components or combinations of the foregoing items or adding one or more features, numbers, steps, operations, elements, components or combinations of the foregoing items.
In addition, in the present disclosure, the terms “first”, “second”, “third”, and the like are only used for distinguishing, and cannot be understood as indicating or implying relative importance.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meanings as commonly understood by those skilled in the art to which the embodiments of the present disclosure belong. The terms (e.g., the terms those defined in commonly used dictionaries) will be interpreted as having the same meaning as the contextual meaning in the relevant technology and will not be interpreted as having idealized or overly formal meanings, unless clearly defined in the embodiments of the present disclosure.
The model server 120 may be a single server or a distributed server cluster composed of a plurality of servers, and one or more GAILLMs is configured on the model server 120 (for ease of understanding, hereinafter collectively referred to as for the GAILLM server).
Specifically, the smart glasses system 110 is used to obtain a first user speech, perform a semantic parsing on the first user speech, generate at least one first prompt message based on the parsed semantics, and send the at least one first prompt message to the model server 120.
The model server 120 is used to obtain at least one task execution command through the GAILLM(s) based on the at least one first prompt message sent by the smart glasses system 110, and send the at least one task execution command to the smart glasses system 110.
The smart glasses system 110 is further used to execute at least one action corresponding to the at least one task execution command.
The first user speech includes at least one user voice command. The smart glasses system 110 performs the semantic parsing on the first user speech through the NLP (Natural Language Processing), and generates at least one first prompt message according to the parsed semantics. The first prompt message includes the parsed semantics. Specifically, the smart glasses system 110 uses each semantics as a piece of prompt message, or uses a plurality of associated semantics as a piece of prompt message according to the relationship between each semantics.
In some embodiment, the GAILLM may be, for example but not limited to: ChatGPT of Open AI, Bard of Google, and other models with similar functions. The GAILLM is trained with a large number of semantics and corresponding task execution commands as samples. Optionally, the same task execution command may correspond to a plurality of similar semantics, and the same semantics may correspond to different task execution commands. The task execution command is used to instruct the target object to execute at least one target task. One target task is associated with at least one corresponding action. The task execution command includes description information of the target task. The description information of the target task is used to indicate the content of the target task, such as what actions need to be executed.
For example, assuming that the first user speech contains semantics of “I want to call Simon” or “I want to talk to Simon immediately”, the GAILLM obtains the corresponding task execution command according to the semantics to instruct the smart glasses system 110 executes the action of calling Simon.
Further, the task execution command may include description information of the executor of each of the target tasks, such as a name, a type, or function(s) of the executor.
As shown in
The prompt generator may be a software module, or a microcontroller configured with the software module. The prompt generator is used to generate the corresponding prompt message(s) (such as, the first prompt message(s) and the second prompt message(s) below) according to the parsed semantics in the user speech, and the parsed semantics is obtained by parsing each of user speeches.
Optionally, in some embodiments of the present disclosure, the prompt generator is further configured on the GAILLM server 120, the smart glasses 111 send the parsed semantics to the GAILLM server 120. The GAILLM server 120 generates at least one first prompt message based on the parsed semantics using the prompt generator, and then inputs the at least one first prompt message into the GAILLM.
The smart glasses 111 may be open smart glasses, and the specific structure of the smart glasses 111 may refer to related descriptions in the following embodiments shown in
The smart mobile terminal 112 may include, but is not limited to: a cellular phone, a smart phone, other wireless communication devices, a personal digital assistant, an audio player, other media players, a music recorder, a video recorder, camera, other media recorders, a smart radio, a Laptop computer, a personal digital assistant (PDA), a portable multimedia player (PMP), a Moving Picture Experts Group (MPEG-1 or MPEG-2) audio layer 3 (MP3) player, a digital camera, and a smart wearable device (such as smart watch, smart bracelet, etc.). An Android or iOS or other operating systems are further installed on the smart mobile terminal 112.
Specifically, the smart glasses 111 are used to: obtain the first user speech through a built-in microphone, and send the first user speech to the smart mobile terminal 112 through the Bluetooth.
The smart mobile terminal 112 is used to: convert the first user speech into a first text through a speech-to-text engine, perform the semantic parsing on the first text, generate the at least one first prompt message based on the parsed semantics, and send the at least one first prompt message to the model server 120. The speech-to-text engine is configured on the smart mobile terminal 112 or a server in the cloud, such as a speech-to-text server, a prompt server, or the GAILLM server 120. The smart mobile terminal 112 converts the first user speech into the corresponding first text by sending the first user speech to a server configured with the speech-to-text engine. The prompt generator may be configured on the smart mobile terminal 112, and is used to generate the first prompt message based on the parsed semantics.
The GAILLM server 120 is further used to: obtain the at least one task execution command through the GAILLM based on the at least one first prompt message sent by the smart mobile terminal 112, and send the at least one task execution command to the smart mobile terminal 112.
The smart mobile terminal 112 is further used to execute the at least one action corresponding to the at least one task execution command.
Optionally, in other embodiments of the present disclosure, the smart mobile terminal 112 is further used to: generate a plurality of the first prompt messages according to the parsed semantics, and send the plurality of the first prompt messages together with the appearance order of semantics corresponding to each of the first prompt messages in the first text to the GAILLM server 120.
The GAILLM server 120 is further used to: obtain a plurality of task execution commands through the GAILLM based on the plurality of the first prompt messages and the appearance order sent by the smart mobile terminal 112, and send the plurality of task execution commands and execution order of each of the task execution commands to the smart mobile terminal 112. The execution order corresponds to the appearance order.
The smart mobile terminal 112 is further used to execute actions corresponding to each of the task execution commands according to the execution order.
Optionally, in other embodiments of the present disclosure, the smart mobile terminal 112 is further used to: generate a plurality of the first prompt messages based on the parsed semantics, and send the plurality of the first prompt messages to the GAILLM server 120 one by one according to the appearance order of semantics corresponding to each of the first prompt messages in the first text.
Specifically, after one of the first prompt messages is sent by the smart mobile terminal 112, the smart mobile terminal 112 sends the next one of the first prompt messages to the GAILLM server 120 when the smart mobile terminal 112 receives at least one task execution command corresponding to the sent prompt messages sent by the GAILLM server 120. Alternatively, after one of the first prompt messages is sent by the smart mobile terminal 112, the smart mobile terminal 112 receives the at least one task execution command corresponding to the sent prompt messages from the GAILLM server 120, executes the corresponding at least one task execution command, and then sends the next one of the first prompt messages to the GAILLM server 120.
Optionally, in other embodiments of the present disclosure, on the basis of the above-mentioned embodiments, the GAILLM server 120 is further used to:
The smart mobile terminal 112 is further used to: convert the text into a speech through a text-to-speech engine, and send the speech to the smart glasses 111. The text-to-speech engine is configured on the smart mobile terminal 112, or is configured on a cloud server, such as a text-to-speech server, a prompt server, or the GAILLM server 120. The smart mobile terminal 112 converts the text into the corresponding speech by sending the text to a server configured with the text-to-speech engine.
The smart glasses 111 are further used to: receive the speech through the Bluetooth, play the speech through a built-in speaker of the smart glasses 111, obtain a second user speech through the microphone, and send the second user speech to the smart mobile terminal 112 through the Bluetooth.
The smart mobile terminal 112 is further used to: convert the second user speech into a second text using the speech-to-text engine, perform a semantic parsing on the second text, generate the second prompt message(s) based on the parsed semantics in the second text, and send the second prompt message to the GAILLM server 120.
Alternatively, when there is the information that needs to be supplemented or confirmed, a response generated by the GAILLM may include: at least one task execution command and the text containing the prompt information of the information that needs to be supplemented or confirmed, so as to ask the user more information or acknowledgment to the user while executing the action(s) corresponding to the at least one task execution command, and then more task execution commands are obtained according to the user's reply, thereby the flexibility and intelligence of the task execution are further improved.
For example, assuming that the first user speech is to reserve the first meeting room at 3 p.m., the first task execution command and a text containing the prompt message of “who are the participants? do you need to notify them?” are obtained through the GAILLM. The smart mobile terminal 112 converts the text into a speech and sends the speech to the smart glasses 111 for playing. At the same time, the smart mobile terminal 112 executes the action of reserving the first conference room at 3:00 p.m. from the conference management server in the cloud according to the first task execution command.
After the speech is played, the smart glasses 111 obtain the second user speech and send the second user speech to the smart mobile terminal 112, such as: “call Peter and Simon to inform them that they will have a meeting in the first conference room at 3:00 p.m.”. The smart mobile terminal 112 converts the second user speech into a second text, performs the semantic parsing on the second text, generates a corresponding prompt message according to the parsed semantics, and sends the corresponding prompt message to the GAILLM. The GAILLM generates the second task execution command according to the corresponding prompt message, so that the smart mobile terminal 112 performs the following actions according to the second task execution command: obtaining the phone numbers of Peter and Simon, generating a notification speech such as “at 3 p.m. there is a meeting in the first conference room, please attend on time”, and calling Peter and Simon using the phone numbers and playing the notification speech.
Optionally, in other embodiments of the present disclosure, as shown in
The smart glasses 111 are used to obtain the first user speech through a built-in microphone of the smart glasses 111, and send the first user speech to the prompt server 113.
The prompt server 113 is used to send the first user speech to the speech-to-text server 130.
The speech-to-text server 130 is used to convert the first user speech into a first text, and send the first text to the prompt server 113.
The prompt server 113 is further used to perform the semantic parsing on the first text, generate the at least one first prompt message based on the parsed semantics, and send the at least one first prompt message to the GAILLM server 120. A prompt generator may be configured on the prompt server 113 to generate the at least one first prompt message based on the parsed semantics.
The GAILLM server 120 is further used to: determine whether there is information needs to be supplemented or confirmed through the GAILLM based on the at least one first prompt message; when there is no the information needs to be supplemented or confirmed, obtain the at least one task execution command through the GAILLM based on the at least one first prompt message, and send the at least one task execution command to the prompt server 113; and when there is the information needs to be supplemented or confirmed, obtain the at least one task execution command through the GAILLM based on the at least one first prompt message, and send the at least one task execution command and prompt information of the information needs to be supplemented or confirmed to the prompt server 113.
The prompt server 113 is further used to send one or more control instructions to at least one device in an Internet of Things (IoT) according to the at least one task execution command to control the at least one device to execute one or more actions specified by the at least one task execution command, generate a corresponding text according to the prompt information, and send the corresponding text to the text-to-speech server 140.
The text-to-speech server 140 is used to convert the corresponding text into a speech, and send the speech to the prompt server 113.
The prompt server 113 is further used to send the speech to the smart glasses 111.
The smart glasses are further used to play the speech through a built-in speaker.
Optionally, as shown in
The smart glasses 111 are used to obtain the first user speech by a built-in microphone of the smart glasses 111, and send the first user speech to the smart mobile terminal 112.
The smart mobile terminal 112 is used to send the first user speech to the speech-to-text server 130.
The speech-to-text server 130 is used to convert the first user speech into a first text and send the first text to the smart mobile terminal 112.
The smart mobile terminal 112 is further used to send the first text to the prompt server 113.
The prompt server 113 is used to perform the semantic parsing on the first text, generate the at least one first prompt message according to the parsed semantics, and send the at least one first prompt message to the GAILLM server 120.
The GAILLM server 120 is further used to: determine whether there is information needs to be supplemented or confirmed through the GAILLM based on the at least one first prompt message; when there is no the information needs to be supplemented or confirmed, obtain the at least one task execution command through the GAILLM based on the at least one first prompt message, and send the at least one task execution command to the prompt server 113; and when there is the information needs to be supplemented or confirmed, obtain the at least one task execution command through the GAILLM based on the at least one first prompt message, and send the at least one task execution command and prompt information of the information needs to be supplemented or confirmed to the prompt server 113.
The prompt server 113 is further used to send one or more control instructions to at least one device in the IoT according to the at least one task execution command to control the at least one device to execute one or more actions specified by the at least one task execution command, generate a corresponding text according to the prompt information, and send the corresponding text to the smart mobile terminal 112.
The smart mobile terminal 112 is further used to send the corresponding text to the text-to-speech server 140.
The text-to-speech server 140 is used to convert the corresponding text into a speech and send the speech to the smart mobile terminal 112.
The smart mobile terminal 112 is further used to send the speech to the smart glasses 111.
The smart glasses 111 are further used to play the speech through the built-in speaker of the smart glasses 111.
Optionally, in other embodiments of the present disclosure, as shown in
The smart glasses 111 are used to obtain the first user speech through the built-in microphone, and send the first user speech to the smart mobile terminal 112.
The smart mobile terminal 112 is used to convert the first user speech into a first text and send the first text to the prompt server 113.
The prompt server 113 is further used to perform the semantic parsing on the first text, generate the at least one first prompt message based on the parsed semantics, and send the at least one first prompt message to the GAILLM server 120.
The GAILLM server 120 is further used to: determine whether there is information needs to be supplemented or confirmed through the GAILLM based on the at least one first prompt message; when there is no the information needs to be supplemented or confirmed, obtain the at least one task execution command through the GAILLM based on the at least one first prompt message, and send the at least one task execution command to the prompt server 113; and when there is the information needs to be supplemented or confirmed, obtain the at least one task execution command through the GAILLM based on the at least one first prompt message, and send the at least one task execution command and prompt information of the information needs to be supplemented or confirmed to the prompt server 113.
The prompt server 113 is further used to send one or more control instructions to at least one device in the IoT according to the at least one task execution command to control the at least one device to execute one or more actions specified by the at least one task execution command, and generate a corresponding text according to the prompt information and send the corresponding text to the smart mobile terminal 112.
The smart mobile terminal 112 is further used to convert the corresponding text into a speech, and send the speech to the smart glasses 111.
The smart glasses 111 are further used to play the speech through the built-in speaker.
Optionally, in other embodiments of the present disclosure, the smart glasses system 110 includes smart glasses 111 and a smart mobile terminal 112, and the control system 100 further includes a prompt server. The prompt server is configured with a speech-to-text engine, and the smart mobile terminal 112 is configured with a text-to-speech engine.
The smart glasses 111 are used to obtain the first user speech through a built-in microphone, and send the first user speech to the prompt server through a wireless network.
The prompt server is used to convert the first user speech into a first text through the speech-to-text engine and perform the semantic parsing on the first text, generate the at least one first prompt message according to the parsed semantics, and send the at least one first prompt message to the smart mobile terminal 112. The smart glasses 111 further send the identification information of the smart mobile terminal 112 to the prompt server while sending the first user speech, so that the prompt server sends the at least one first prompt message to the mobile terminal 112 according to the identification information. Alternatively, the prompt server pre-stores the associated information between the smart glasses 111 and the smart mobile terminal 112. While sending the first user speech, the smart glasses 111 further send the identity information of the smart glasses 111 to the prompt server, so that the prompt server sends the at least one first prompt message to the smart mobile terminal 112 according to the identity information and the associated information. Alternatively, the prompt server sends the at least one first prompt message to the smart glasses 111, so as to forward the at least one first prompt message to the smart mobile terminal 112 through the smart glasses 111.
The smart mobile terminal 112 is used to send the at least one first prompt message to the GAILLM server 120.
The GAILLM server 120 is further used to: determine whether there is information needs to be supplemented or confirmed through the GAILLM based on the at least one first prompt message; when there is no the information needs to be supplemented or confirmed, obtain the at least one task execution command through the GAILLM based on the at least one first prompt message, and send the at least one task execution command to the smart mobile terminal 112; and when there is the information needs to be supplemented or confirmed, obtain the at least one task execution command through the GAILLM based on the at least one first prompt message, and send prompt information of the information needs to be supplemented or confirmed and the at least one task execution command to the smart mobile terminal 112. The prompt information of the information needs to be supplemented or confirmed is generated by the GAILLM server 120 according to the determined information that needs to be supplemented or confirmed, and is used to indicate or prompt what information needs to be supplemented or confirmed.
The smart mobile terminal 112 is further used to execute at least one action corresponding to the at least one task execution command.
The smart mobile terminal 112 is further used to generate a corresponding text according to the prompt information, convert the corresponding text into a speech through the text-to-speech engine, and send the speech to the smart glasses 111. For example, the prompt information may be “confirm which Peter in the contact list needs to be called”, and the corresponding text may be “which Peter do you want to call, Peter (1), phone number 12345, Peter (2), Phone number 23456, Peter (3), phone number 34567”.
The smart glasses 111 are further used to play the speech through the built-in speaker.
Optionally, in other embodiments of the present disclosure, the smart glasses system 110 includes: smart glasses 111 and a smart mobile terminal 112, and the smart mobile terminal 112 is configured with the GAILLM.
The smart glasses 111 are used to obtain the first user speech through the built-in microphone, and send the first user speech to the smart mobile terminal 112 through the Bluetooth.
The smart mobile terminal 112 is used to convert the first user speech into a first text through a speech-to-text engine and perform the semantic parsing on the first text, obtain the at least one task execution command through the GAILLM configured on the smart mobile terminal 112 based on the parsed semantics, and execute the at least one action corresponding to the at least one task execution command. The speech-to-text engine is configured on the smart mobile terminal 112 or the speech-to-text server 130. The smart mobile terminal 112 sends the first user speech to the speech-to-text server 130 to convert the first user speech into the first text by the speech-to-text server 130, or to convert the first user speech into the first text and perform the semantic parsing on the first text by the speech-to-text server 130.
The smart mobile terminal 112 is further used to: determine whether there is information needs to be supplemented or confirmed through the GAILLM based on the parsed semantics; when there is no the information needs to be supplemented or confirmed, obtain the at least one task execution command through the GAILLM based on the parsed semantics; and when there is the information needs to be supplemented or confirmed, generate a text including prompt information of the information needs to be supplemented or confirmed through the GAILLM, convert the text into a speech through a text-to-speech engine, and send the speech to the smart glasses 111.
The smart glasses 111 are further used to receive the speech through the Bluetooth, and play the speech through the built-in speaker, obtain the second user speech through the microphone, and send the second user speech to the smart mobile terminal 112 through the Bluetooth.
The smart mobile terminal 112 is further used to convert the second user speech into a second text through the speech-to-text engine, perform a semantic parsing on the second text, and determine, through the GAILLM, whether there is the information needs to be supplemented or confirmed based on the parsed semantics in the second text (or based on all parsed semantics, such as the semantics in the first text and the semantics in the second text), when there is no the information needs to be supplemented or confirmed, obtain, through the GAILLM, the at least one task execution command based on all parsed semantics; and when there is the information needs to be supplemented or confirmed, return to the step of generating the text including the prompt information of the information needs to be supplemented or confirmed through the GAILLM.
The smart mobile terminal 112 is further used to determine whether the smart mobile terminal 112 is configured with the GAILLM according to the installation log or the preset mark when the first user speech is received by the smart mobile terminal 112. When the smart mobile terminal 112 is configured with the GAILLM, the parsed semantics is inputted into the GAILLM on the smart mobile terminal 112, so as to obtain the at least one task execution command through the GAILLM on the smart mobile terminal 112. Otherwise, the at least one first prompt message is generated based on the parsed semantics by the smart mobile terminal 112, and the generated at least one first prompt message is sent to the GAILLM server 120.
Optionally, in other embodiments of the present disclosure, the GAILLM is further configured on the smart glasses 111. The smart glasses 111 are further used to: obtain the first user speech, perform the semantic parsing on the first user speech to obtain parsed semantics, input the parsed semantics into the GAILLM to obtain the at least one task execution command through the GAILLM, and execute the at least one action corresponding to the at least one task execution command. Specifically, the smart glasses 111 may utilize a local speech-to-text engine, or through the speech-to-text server 130, convert the first user speech into the corresponding first text, and perform the semantic parsing on the first text. The smart glasses 111 may be configured with a prompt generator for generating the at least one first prompt message based on the parsed semantics, and the smart glasses 111 input the at least one first prompt message into the GAILLM to generate the at least one task execution command through the GAILLM.
The smart glasses 111 are further used to: determine whether the GAILLM is installed on the smart glasses 111 according to the installation log; when the GAILLM is installed on the smart glasses 111, input the parsed semantics into the GAILLM to generate the at least one task execution command through the GAILLM; and when the GAILLM is not installed on the smart glasses 111, generate the at least one first prompt message according to the parsed semantics, send the at least one first prompt message to the GAILLM server 120 or the smart mobile terminal 112, to obtain the at least one task execution command through the GAILLM configured on the GAILLM server 120 or the smart mobile terminal 112.
Alternatively, the smart glasses 111 are further used to: determine the installation location of the GAILLM according to the pre-stored model configuration information; when the GAILLM is installed locally on the smart glasses 111, input the parsed semantics into the GAILLM to generate the at least one task execution command through the GAILLM; when the GAILLM is installed on the GAILLM server 120, generate the at least one first prompt message based on the parsed semantics, and send the at least one first prompt message to the GAILLM server 120, or forward the at least one first prompt message to the GAILLM server 120 through the smart mobile terminal 112, so as to obtain the at least one task execution command through the GAILLM on the GAILLM server 120; and when the GAILLM is installed on the smart mobile terminal 112, generate the at least one first prompt message based on the parsed semantics, and send the first prompt message to the smart mobile terminal 112, to obtain the at least one task execution command through the GAILLM on smart mobile terminal 112.
Further, in other embodiments of the present disclosure, the smart glasses 111, the smart mobile terminal 112, or the prompt server 113 are further used to: after the at least one task execution command is received, determine one or more executors of one or more target tasks according to description information of target task or executor description information of target task in the task execution command, and distribute each task execution command to the corresponding executor(s), so that each executor executes at least one action associated with the corresponding task execution command.
The smart glasses 111 obtain the first user speech and send the first user speech to the smart mobile terminal 112. The smart mobile terminal 112 converts the first user speech into a first text through a speech-to-text engine and performs the semantic parsing on the first text, generates the at least one first prompt message based on the parsed semantics, and sends the at least one first prompt message to the GAILLM server 120. The GAILLM server 120 obtains the at least one task execution command through the GAILLM based on the at least one first prompt message, and sends the at least one task execution command to the smart mobile terminal 112. The smart mobile terminal 112 executes the at least one action corresponding to the at least one task execution command.
A user speech in natural language (i.e., user voice command in audio format shown in
According to the execution result, when there is only one Peter in the contact list, the GAILLM outputs the second task execution command, so that the mobile APP makes a call according to Peter's phone number in the contact list.
When there is no Peter in the contact list, the GAILLM outputs a response including prompt information and the third task execution command, where the prompt information is used to prompt the mobile APP to ask the user about Peter's phone number, so that the mobile APP generates a text containing the prompt information, such as “please enter Peter's phone number”, converts the text into a synthesis speech in audio format through the text-to-speech engine, and sends the synthesis speech to the smart glasses 111 for playback, and then the mobile APP analyzes the user speech obtained by the smart glasses 111 to obtain a phone number spoken by the user, and according to the third task execution command, makes a call based on the phone number spoken by the user.
When there are a plurality of Peter on the contact list such as Peter Zhang, Peter Wang, and Peter Lee, the GAILLM outputs a response including prompt information used to prompt the mobile APP to confirm which Peter the user needs to call, so that the mobile APP generates a text containing the prompt information, converts the text into a synthesis speech in audio format through the text-to-speech engine, and sends the synthesis speech to the smart glasses 111 for playback. After the synthesis speech is played, the smart glasses 111 obtain the second user speech (for example: “Peter Lee”) and send the second user speech to the mobile APP. The mobile APP converts the second user speech into a text through the speech-to-text engine and performs the semantic parsing on the converted text, generates the second prompt message according to the parsed semantics, and inputs the second prompt message into the GAILLM. According to the second prompt message, the GAILLM determines the phone number of Peter Lee to be called and outputs the fourth task instruction command, so that the mobile APP makes a call based on the determined phone number according to the fourth task execution command.
A user speech in natural language (i.e., user voice command in audio format shown in
A user speech in natural language (i.e., user voice command in audio format shown in
A user speech in natural language (i.e., user voice command in audio format shown in
A voice command in natural language is spoken by a user wearing the smart glasses 111, for example, a user command in audio data format of “turn on the light in my bedroom after 10 minutes”. The smart glasses 111 obtain the user command in audio data format, and send the user command in audio data format to the prompt server 113. The prompt server 113 converts the user command in audio data format into the first text through the speech-to-text server 130, and performs the semantic parsing on the first text, generates prompt message based on the parsed semantics, and then sends the prompt message to the GAILLM Server 120. The GAILLM server 120 inputs the prompt message into the GAILLM to obtain a response outputted by the GAILLM, and sends the response to the prompt server 113. The response includes a task execution command or prompt information of the information needs to be supplemented or confirmed.
For example, when there is only one light in the bedroom, the response includes a task execution command, so that the prompt server 113 executes the action of sending a command to the light in the bedroom (as the IoT device) through WIFI or cellular network after 10 minutes according to the task execution command, so as to control the light in the bedroom to turn on.
When there are a plurality of lamps in the bedroom, the response includes prompt information of asking the user to confirm or inquire which lamp to turn on. The prompt server 113 generates a text including prompt information for the response, converts the text into a synthesis speech in audio data format through the text-to-speech server 140, and sends the converted synthesis speech to the smart glasses 111 for playback, so as to confirm or inquire with the user which light in the bedroom needs to be turned on by the audio data manner. After the synthesis speech is played, the smart glasses 111 obtain a user speech in audio data format (for example, “turn on the bedside lamp in the bedroom”) and send the user speech to the prompt server 113. The prompt server 113 converts the user speech in audio data format into a second text through the speech-to-text server 130, and performs the semantic parsing on the second text, generates prompt message based on the parsed semantics, and then sends the prompt message to the GAILLM server 120. The GAILLM server 120 inputs the prompt message into the GAILLM to obtain a response outputted by the GAILLM, and sends the response to the prompt server 113. The response includes a task execution command, and the prompt server 113 executes an action of sending a command to the bedside lamp (as the IoT device) through WIFI or cellular network after 10 minutes according to the task execution command to control the bedside lamp to turn on.
A voice command in natural language is spoken by a user wearing the smart glasses 111, for example, a user command in audio data format of “turn on the light in my bedroom after 10 minutes”. The smart glasses 111 obtain the user command in audio data format, and send the user command in audio data format to the smartphone 112. The smartphone 112 converts the user command in audio data format into a command in text, and sends the command in text to the prompt server 113. The prompt server 113 performs the semantic parsing on the command in text, generates prompt message according to the parsed semantics, and then sends the prompt message to the GAILLM server 120. The GAILLM server 120 inputs the prompt message into the GAILLM to obtain a response outputted by the GAILLM, and sends the response to the prompt server 113. The response includes a task execution command and prompt information of the information needs to be supplemented or confirmed. The prompt server 113 executes an action of sending a command to the light in the bedroom (as the IoT device) through WIFI or cellular network after 10 minutes according to the task execution command to control the light in the bedroom to turn on. At the same time, the prompt server 113 generates a response in text format for the response, and the response in text format will be sent to the user and includes the prompt information of the information needs to be supplemented or confirmed. The prompt server 113 sends the response in text format to the smartphone 112. The smartphone 112 converts the response in text format to be sent to the user into a synthesis speech, and sends the synthesis speech to the smart glasses 111. The smart glasses 111 play the synthesis speech, so as to ask the user more information or acknowledgment to the user by the audio data manner.
In the above-mentioned embodiments, the smart glasses system utilizes the GAILLM(s) to realize the device control based on the natural language voice command(s), thereby improving the convenience of the device control based on the smart glasses system, and due to the scalability and self-creativity of the GAILLM, the intelligence and interactivity of the smart glasses system are further improved.
The front frame 201 may be, for example, a front frame with lenses (e.g., sunglasses lenses, clear lenses, or corrective lenses). The at least one temple 202 may include, for example, a left temple 202A and a right temple 202B.
The temple 202 is connected to the front frame 201, and the processor 205 is electrically connected to the microphone 203, the speaker 204 and the memory 205. The microphone 203, the speaker 204, the processor 205 and the memory 206 are arranged on at least one temple 202 and/or the front frame 201. Preferably, the at least one temple 202 is detachably connected to the frame 201.
The processor 205 includes an CPU (Central Processing Unit/Processor) and a DSP (Digital Signal Processing). The DSP is used to process the voice data obtained by the microphone 103. The CPU is preferably an MCU (Microcontroller Unit).
The memory 206 is a non-transitory memory, and specifically may include: a RAM (Random Access Memory) and a flash memory component. One or more programs executable by the processor 205 are stored in the memory 206, and the one or more programs include a plurality of instructions. The instructions are used to obtain the first user speech through the microphone 203, perform a semantic parsing on the first user speech, obtain at least one task execution command through the GAILLM based on the parsed semantics, and execute at least one action corresponding to the at least one task execution command. The GAILLM is configured on the smart glasses 200 or other devices, such as a smart mobile terminal or other cloud server. The instructions obtain the at least one task execution command by inputting the parsed semantics into the GAILLM on the smart glasses 200, or by sending the parsed semantics to other devices configured with the GAILLM.
Optionally, in other embodiments of the present disclosure, the GAILLM is configured on a model service (e.g., the GAILLM server 120 in the control system shown in
The wireless communication component 207 includes a wireless signal transceiver and peripheral circuits, which can be specifically arranged in the inner cavity of the front frame 201 and/or the at least one temple 202. The wireless signal transceiver may, but is not limited to, use at least one of the WIFI (Wireless Fidelity) protocol, the NFC (Near Field Communication) protocol, the ZigBee protocol, the UWB (Ultra-Wide Band) protocol, the RFID (Radio Frequency Identification) protocol, and the cellular mobile communication protocol (such as 3G/4G/5G, etc.) to perform the data transmission.
Optionally, in other embodiments of the present disclosure, the GAILLM is configured on a model service, the smart glasses 200 further include a wireless communication component 207 electrically connected to the processor 205, the instructions are further used to: send, through the wireless communication component 207, the first user speech to the model server, so as to convert, through the speech-to-text engine on the model server, the first user speech into a first text and perform the semantic parsing on the first text, and to obtain, through the GAILLM on the model server, the at least one task execution command based on the parsed semantics; and receive, through the wireless communication component 207, the at least one task execution command from the model server.
Optionally, the speech-to-text engine is configured on the model server or a speech-to-text server. The smart glasses 200 or the model server converts the first user speech into the first text by the data interaction with the speech-to-text server.
Optionally, in other embodiments of the present disclosure, the instructions are further used to: receive, through the wireless communication component 207, the task execution commands and execution order of each of the task execution commands from the model server; and execute actions corresponding to each of the task execution commands according to the execution order.
Optionally, in other embodiments of the present disclosure, the model server sends the task execution commands one by one to the smart glasses 200, and the instructions are further used to send a notification message to the model server after executing an action corresponding to one task execution command, so that the model server sends a next task execution command to the smart glasses 200 according to the notification message.
Optionally, in other embodiments of the present disclosure, the instructions are further used to: receive, through the wireless communication component 207, a speech including prompt information of information needs to be supplemented or confirmed from the model server, and play, through the speaker 204, the speech; and obtain, through the microphone 203, a second user speech, and send, through the wireless communication component 207, the second user speech to the model server, so that the model server obtains the at least one task execution command through the GAILLM based on the first user speech and the second user speech.
Optionally, in other embodiments of the present disclosure, the smart glasses 200 further include a wireless communication component 207 electrically connected to the processor 205, and the instructions are further used to send one or more control instructions to at least one device in an Internet of Things (IoT) according to the at least one task execution command through the wireless communication component 207, to control the at least one device to execute one or more actions specified by the at least one task execution command.
Optionally, in other embodiments of the present disclosure, the smart glasses 200 further include a wireless communication component 207 electrically connected to the processor 205, the GAILLM is configured on a model server, and the instructions are further used to send, through the wireless communication component 207, the first user speech to a prompt server, so that the prompt server converts the first user speech into a first text through a speech-to-text engine, performs the semantic parsing on the first text, generates at least one first prompt message based on the parsed semantics, and sends the at least one first prompt message to the model server. The model server obtains the at least one task execution command through the GAILLM based on the at least one first prompt message, and sends the at least one task execution command to the prompt server. The instructions are further used to receive, through the wireless communication component 207, the at least one task execution command sent by the prompt server.
Optionally, in other embodiments of the present disclosure, the smart glasses 200 further include a wireless communication component 207 electrically connected to the processor 205, the GAILLM is configured on a model server, and the instructions are further used to: send, through the wireless communication component 207, the first user speech to a prompt server, so that the prompt server converts the first user speech into a first text through a speech-to-text engine, performs the semantic parsing on the first text, and generates at least one first prompt message based on the parsed semantics; receive, through the wireless communication component 207, the at least one first prompt message sent by the prompt server, and send the at least one first prompt message to the model server, so that the model server obtains the at least one task execution command through the GAILLM based on the at least one first prompt message; and receive, through the wireless communication component 207, the at least one task execution command sent by the model server.
Optionally, in other embodiments of the present disclosure, the GAILLM is configured on the smart glasses 200 or the smart mobile terminal. The smart glasses 200 further include a Bluetooth component 208 electrically connected to the processor 205. The Bluetooth component 208 includes a Bluetooth signal transceiver and peripheral circuits, which can be specifically arranged in the inner cavity of the front frame 201 and/or the at least one temple 202. The Bluetooth component 208 can be connected to the smart mobile terminals such as smart phones or smart watches, and is used for phone call, music and data communication.
The instructions are further used to obtain the at least one task execution command through the GAILLM configured on the smart glasses 200 based on the parsed semantics.
The instructions are further used to perform data interaction with the smart mobile terminal through the Bluetooth component 208, so as to forward, through the smart mobile terminal, the at least one first prompt message to the model server, receive, through the smart mobile terminal, the at least one task execution command sent by the model server, and obtain, through the smart mobile terminal, the at least one task execution command.
Optionally, in other embodiments of the present disclosure, the smart glasses 200 further include input devices 209 electrically connected to the processor 205. The input devices 209 include a touch sensor and a key. The key may be a power-on button on the smart glasses 200.
Optionally, in other embodiments of the present disclosure, the smart glasses 200 further include an indicator light and/or a buzzer electrically connected to the processor 205, and the instructions are further used to output, through the indicator light and/or the buzzer, prompt information. The prompt information is used to indicate a state of the smart glasses 200, the state includes: a working state and an idle state. The working state includes: a starting speech pickup status, a speech pickup status, a completing speech pickup status, and a speech processing status. The indicator light may be an LED (Light Emitting Diode) light.
Optionally, in other embodiments of the present disclosure, the smart glasses 200 further include a data sensing component electrically connected to the processor 205, and the data sensing component includes at least one component of a position sensor, an inertial measurement unit (IMU) sensor, a temperature sensor, a proximity sensor, a humidity sensor, an electronic compass, a timer, a camera and a pedometer. The position sensor may be, but not limited to, the positioning component based on the GPS (Global Positioning System) or the Beidou satellite.
The instructions are further used to obtain the sensing data of the at least one component, and obtain, through the GAILLM, the at least one task execution command based on the sensing data and the parsed semantics.
Alternatively, the instructions are further used to obtain the sensing data of the at least one component, and send, through the wireless communication component 207, the sensing data of the at least one component and the at least one prompt message to the model server or the smart mobile terminal, so that the GAILLM on the model server or the smart mobile terminal obtains at least one task execution command according to the sensing data and the at least one prompt message. For example, assuming that the first user speech contains semantics of “remind me when I walk near the movie theater”, when the user of the smart glasses 200 is about to arrive at the movie theater, the GAILLM obtains a task execution command based on the real-time positioning data sent by the smart glasses 200 and the first prompt information containing the semantics, and sends the task execution command to the smart glasses 200, so as to instruct the smart glasses 200 to play the prompt speech of “you have arrived near the movie theater”.
Optionally, in other embodiments of the present disclosure, the smart glasses 200 further include a voice biometric recognition module, the voice biometric recognition module is used to identify the user using the acquired voice print of the user of the smart glasses 200, so as to enable the above-mentioned voice control function based on the smart glasses 200.
The smart glasses 200 further include a battery 210 for providing power to the above-mentioned electronic components (such as the microphone 203, the speaker 204, the processor 205, the memory 306, etc.) on the smart glasses 100.
The various electronic components of the above-mentioned smart glasses are connected through a bus.
It should be noted that, the relationship between the components of the above-mentioned smart glasses is a substitution relationship or a superposition relationship. That is, all the above-mentioned components in the embodiment are installed on the smart glasses, or some of the above-mentioned components selectively are installed according to requirements. When the relationship is an alternative relationship, the smart glasses are further provided with at least one of a peripheral connection interface, for example, a PS/2 interface, a serial interface, a parallel interface, an IEEE1394 interface, and a USB (Universal Serial Bus) interface. The function of the replaced component is realized through the peripheral device connected to the connection interface, and the peripheral device such as external speaker, external sensor, etc.
For details not covered in the embodiments about the smart glasses, reference may also be made to the relevant descriptions in the embodiments shown in the above-mentioned
In the above-mentioned embodiments, the smart glasses utilize the GAILLM(s) to realize the device control based on the natural language voice command(s), thereby improving the convenience of the device control based on the smart glasses system, and due to the scalability and self-creativity of the GAILLM, the intelligence and interactivity of the smart glasses system are further improved.
S301, obtaining a first user speech, performing a semantic parsing on the first user speech, and obtaining a parsing result;
S302, obtaining at least one task execution command through the GAILLM based on the parsing result; and
S303, executing at least one action corresponding to the at least one task execution command.
The GAILLM(s) can be configured on any device in the smart wearable device system, such as the smart glasses 111, the smart mobile terminal 112, or the prompt server 113 in the smart glasses system 110 shown in
Optionally, in other embodiments of the present disclosure, the smart wearable device system includes a smart wearable device and a smart mobile terminal, and the GAILLM is configured on a model server. The steps of obtaining the first user speech, performing the semantic parsing on the first user speech and obtaining the parsing result, obtaining the at least one task execution command through the GAILLM based on the parsing result, and executing the at least one action corresponding to the at least one task execution command include:
Optionally, in other embodiments of the present disclosure, the smart wearable device system includes a smart wearable device and a smart mobile terminal, and the GAILLM is configured on a model server. The steps of obtaining the first user speech, performing the semantic parsing on the first user speech and obtaining the parsing result, obtaining the at least one task execution command through the GAILLM based on the parsing result, and executing the at least one action corresponding to the at least one task execution command include:
Optionally, in other embodiments of the present disclosure, the smart wearable device system includes a smart wearable device and a smart mobile terminal, and the GAILLM is configured on the smart mobile terminal. The steps of obtaining the first user speech, performing the semantic parsing on the first user speech and obtaining the parsing result, obtaining the at least one task execution command through the GAILLM based on the parsing result, and executing the at least one action corresponding to the at least one task execution command include:
Optionally, in other embodiments of the present disclosure, the smart mobile terminal generates a plurality of first prompt messages based on the parsed semantics, and sends the first prompt messages and appearance order of semantics corresponding to each of the first prompt messages in the first text, to the model server. The model server obtains a plurality of task execution commands through the GAILLM based on the first prompt messages and the appearance order, and sends the task execution commands and execution order of each of the task execution commands to the smart mobile terminal. The execution order corresponds to the appearance order.
The step of receiving, by the smart mobile terminal, the at least one task execution command from the model server, and executing the at least one action corresponding to the at least one task execution command further includes:
Optionally, in other embodiments of the present disclosure, the smart mobile terminal generates a plurality of first prompt messages based on the parsed semantics, and sends the first prompt messages one by one to the model server.
Optionally, in other embodiments of the present disclosure, the model server determines whether there is information needs to be supplemented or confirmed through the GAILLM based on the at least one first prompt message. When there is no the information needs to be supplemented or confirmed, the model server obtains the at least one task execution command through the GAILLM based on the at least one first prompt message. When there is the information needs to be supplemented or confirmed, the model server generates a text including prompt information of the information needs to be supplemented or confirmed through the GAILLM, and sends the text to the smart mobile terminal.
The method further includes the following steps:
Optionally, in other embodiments of the present disclosure, the steps of performing the semantic parsing on the first user speech and obtaining the parsing result, obtaining the at least one task execution command through the GAILLM based on the parsing result, and executing the at least one action corresponding to the at least one task execution command include:
Optionally, in other embodiments of the present disclosure, the smart wearable device system includes a smart wearable device and a prompt server. The steps of obtaining the first user speech, performing the semantic parsing on the first user speech and obtaining the parsing result, obtaining the at least one task execution command through the GAILLM based on the parsing result, and executing the at least one action corresponding to the at least one task execution command include:
Optionally, in other embodiments of the present disclosure, the smart wearable device system includes a smart wearable device, a smart mobile terminal and a prompt server, the GAILLM is configured on the smart mobile terminal. The steps of obtaining the first user speech, performing the semantic parsing on the first user speech and obtaining the parsing result, obtaining the at least one task execution command through the GAILLM based on the parsing result, and executing the at least one action corresponding to the at least one task execution command include:
Optionally, in other embodiments of the present disclosure, the smart wearable device system includes a smart wearable device, a smart mobile terminal and a prompt server, the GAILLM is configured on a model server. The steps of obtaining the first user speech, performing the semantic parsing on the first user speech and obtaining the parsing result, obtaining the at least one task execution command through the GAILLM based on the parsing result, and executing the at least one action corresponding to the at least one task execution command include:
Optionally, in other embodiments of the present disclosure, the smart wearable device system includes a smart wearable device, a smart mobile terminal and a prompt server, the GAILLM is configured on a model server. The steps of obtaining the first user speech, performing the semantic parsing on the first user speech and obtaining the parsing result, obtaining the at least one task execution command through the GAILLM based on the parsing result, and executing the at least one action corresponding to the at least one task execution command include:
Hereinafter, the working mechanism of the above-mentioned control method will be described with reference to
As shown in
As shown in blocks 404 and 413 in
Further, when the response indicates to ask the user more information, the user voice command is obtained again after the prompt speech is played, and so on until the end.
For example, if the user voice command is “I want to call Simon”, the smart glasses system will generate the first prompt message to ask the GAILLM what is the intention of the user voice command. The response outputted by the GAILLM indicates that the intention is “call” and includes the first task execution command for obtain Simon's phone number. According to the first task execution command, the smart glasses system executes the action of obtaining Simon's phone number in the contact list of the smartphone and sends the execution result to the GAILLM. When there is only one Simon in the contact list, the GAILLM outputs the second task execution command, so that the smart glasses system makes a call according to Simon's phone number in the contact list. When there is no Simon in the contact list, the GAILLM outputs a response including the third task execution command and a prompt information of asking the user for Simon's phone number, so that the smart glasses system asks the user for Simon's phone number in the form of voice according to the response, and executes the action of making a call based on the phone number answered by the user according to the third task execution command. When there are a plurality of Simon on the phone of the contact, the GAILLM outputs a response including the fourth task execution command and prompt information of confirming which Simon the user needs to call, so that the smart glasses system confirms to the user which Simon needs to be called in the form of voice according to the response, and performs the call action based on the phone number confirmed by the user according to the fourth task execution command.
For details not covered in the embodiments about the control method, reference may further be made to the relevant descriptions in the above-mentioned embodiments shown in
In the above-mentioned embodiments, the smart wearable device system utilizes the GAILLM(s) to realize the device control based on the natural language voice command(s), thereby improving the convenience of the device control based on the smart glasses system, and due to the scalability and self-creativity of the GAILLM, the intelligence and interactivity of the smart glasses system are further improved.
The present disclosure further provides a non-transitory computer-readable storage medium, which can be set in the smart glasses or smart wearable device in the above-mentioned embodiments, and may be the memory 206 in the embodiment shown in
It should be understood that in the above-described embodiments of the present disclosure, the above-mentioned smart glasses, control system, and control method may be implemented in other manners. For example, multiple units/modules may be combined or be integrated into another system, or some of the features may be ignored or not performed. In addition, the above-mentioned mutual coupling/connection may be direct coupling/connection or communication connection, and may also be indirect coupling/connection or communication connection through some interfaces/devices, and may also be electrical, mechanical or in other forms.
It should be noted that for the various method embodiments described above, for the sake of simplicity, they are described as a series of action combinations. However, those skilled in the art should understand that the present disclosure is not limited by the order of the described actions, as certain steps can be performed in a different order or simultaneously. Additionally, it should be understood that the embodiments described in this invention are preferred embodiments, and the actions and modules involved are not necessarily required for the present disclosure.
In the above-mentioned embodiments, the descriptions of each embodiment have different focuses. For portions not described in a particular embodiment, reference can be made to relevant descriptions in other embodiments.
The above is a description of the smart glasses, control system, and control method provided by the present disclosure. Those skilled in the art should understand that based on the embodiments of the present disclosure, there may be changes in specific implementation methods and application scope. Therefore, the content of this specification should not be construed as limiting the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202310875349.9 | Jul 2023 | CN | national |