This application claims priority to Chinese Patent Application No. 201911417229.4, filed with the China National Intellectual Property Administration on Dec. 31, 2019 and entitled “SPEECH CONTROL METHOD AND APPARATUS, SERVER, TERMINAL DEVICE, AND STORAGE MEDIUM”, which is incorporated herein by reference in its entirety.
This application pertains to the field of terminal technologies, and in particular, to a speech control method and apparatus, a server, a terminal device, and a storage medium.
In a human-machine natural language dialog system, a speech assistant is an intelligent application, and may be loaded on an intelligent terminal device such as a mobile phone, a television, a tablet, a computer, or a sound box. The speech assistant receives an audio signal of a user, performs speech recognition, and performs determining or makes a response. A dialog process including speech assistant wakeup, speech recognition, and responding requires cloud support from a speech database. A dialog manager (Dialog Manager, DM) may serve as a cloud service, and is responsible for maintaining and updating a process and a status of a dialog. An input of the dialog manager is an utterance (utterance) and a related context. After understanding the utterance, the dialog manager outputs a system response.
With the development of the internet and the internet of things, based on a network connection between a plurality of devices, a cross-device joint dialog may be performed by using the plurality of devices through mutual speech control, to form an all-scenario session scenario. For example, speech interaction with a mobile phone is performed, and a television is controlled by using the mobile phone to perform a corresponding task operation.
Currently, when the cross-device joint dialog is performed by using the plurality of devices, for the plurality of devices, the dialog manager repeatedly processes a task instruction of a user in a plurality of phases. This prolongs a response time of a system and increases a dialog delay.
Embodiments of this application provide a speech control method and apparatus, a server, a terminal device, and a storage medium, to resolve a problem that a system response time is prolonged and a dialog delay is increased because a dialog manager repeatedly processes a task instruction of a user in a plurality of phases during a joint dialog of a plurality of devices.
According to a first aspect, an embodiment of this application provides a speech control method, including:
receiving a speech instruction recognition result sent by a first terminal; performing semantic processing on the speech instruction recognition result, to obtain operation information, where the operation information includes a first semantic instruction and a second semantic instruction; sending the first semantic instruction and the second semantic instruction to the first terminal, where the first semantic instruction is used to instruct the first terminal to send the second semantic instruction to a second terminal, and receiving an execution command fed back by the second terminal after the second terminal recognizes the second semantic instruction, and sending, according to the execution command, service logic corresponding to the second semantic instruction to the second terminal.
According to the speech control method provided in this application, a server is used as an execution body. The server performs semantic processing on the speech instruction recognition result by receiving the speech instruction recognition result sent by the first terminal, to obtain to-be-executed operation information in the speech instruction recognition result, and sends the operation information to the first terminal. The first terminal executes the first semantic instruction in the operation information, and sends the second semantic instruction in the operation information to the second terminal. After the second terminal recognizes the second semantic instruction, the server may directly receive the execution command fed back by the second terminal, invoke, according to the execution command, the service logic corresponding to the second semantic instruction, and send the service logic to the second terminal. In this way, a processing procedure for the second semantic instruction is omitted, a dialog delay is shortened, and a response time of a dialog system is improved.
In a possible implementation of the first aspect, the performing semantic processing on the speech instruction recognition result, to obtain operation information includes: recognizing the speech instruction recognition result, to obtain a target intent and a target sub-intent of the speech instruction recognition result; pre-verifying the target sub-intent based on the target intent, to obtain response logic of the target intent and a pre-run result of the target sub-intent; and using the response logic as the first semantic instruction of the operation information, and using the target sub-intent and the pre-run result as the second semantic instruction of the operation information.
In this possible implementation, after the speech instruction recognition result (that is, text information corresponding to a speech instruction entered by a user) sent by the first terminal is received, semantic recognition is performed on the speech instruction recognition result, to obtain the target intent and the target sub-intent in the speech instruction recognition result. The response logic of the target intent and the pre-run result of the pre-verified target sub-intent is obtained by pre-verifying the target sub-intent based on the target intent, and when the response logic is sent to the first terminal as the first semantic instruction, the target sub-intent and the pre-run result are further sent to the first terminal as the second semantic instruction. The first semantic instruction is executed on the first terminal, and the second semantic instruction is sent to the second terminal, so as to provide an information basis for the dialog system, and improve the response speed of the dialog system.
In a possible implementation of the first aspect, the sending the first semantic instruction and the second semantic instruction to the first terminal includes:
sending the first semantic instruction and the second semantic instruction to the first terminal in a semantic representation form.
In a possible implementation of the first aspect, the sending, according to the execution command, service logic corresponding to the second semantic instruction to the second terminal includes:
parsing the pre-run result according to the execution command; and invoking the service logic based on the parsed pre-run result, and sending the service logic to the second terminal in a semantic representation form.
In this possible implementation, after the execution command sent by the second terminal is received, a corresponding command may be directly executed, to parse the pre-run result, corresponding service logic is directly invoked based on a result of parsing the pre-run result, and processes such as performing semantic processing on the target sub-intent and selecting a corresponding execution manner do not need to be performed, thereby shortening the response time of the dialog system.
According to a second aspect, an embodiment of this application provides a speech control method, including:
receiving a speech instruction entered by a user, and performing speech recognition on the speech instruction to obtain a speech instruction recognition result, sending the speech instruction recognition result to a server; receiving operation information fed back by the server after the server performs semantic processing on the speech instruction recognition result, where the operation information includes a first semantic instruction and a second semantic instruction; and executing the first semantic instruction, and sending the second semantic instruction to a second terminal, where the second semantic instruction is used to instruct the second terminal to send an execution command to the server and receive service logic that is fed back by the server and that is corresponding to the second semantic instruction.
According to the speech control method provided in this application, a first terminal is used as an execution body. After performing speech recognition on the speech instruction entered by the user, the first terminal sends the obtained speech instruction recognition result to the server, receives the operation information obtained after the server performs semantic processing on the speech instruction recognition result, executes the first semantic instruction in the operation information, and sends the second semantic instruction to the second terminal. The first terminal receives the first semantic instruction and the second semantic instruction that are fed back by the server in response to the speech instruction recognition result, executes the first semantic instruction, and sends the second semantic instruction to the second terminal, so that the second terminal directly invokes an execution interface of the server according to the second semantic instruction, sends the execution command to the server, and receives the service logic that is fed back by the server and that is corresponding to the second semantic instruction. In this way, an information basis is provided for a dialog system to further respond to the second speech instruction, and a processing procedure for the second semantic instruction is omitted, so that a response time of the dialog system can be shortened.
In a possible implementation of the second aspect, the receiving operation information fed back by the server after the server performs semantic processing on the speech instruction recognition result includes:
receiving response logic fed back by the server for a target intent in the speech instruction recognition result, and receiving a pre-run result fed back by the server for a target sub-intent in the speech instruction recognition result.
In a possible implementation of the second aspect, the first semantic instruction is response logic fed back by the server for a target intent in the speech instruction recognition result, and the second semantic instruction is a pre-run result fed back by the server for a target sub-intent in the speech instruction recognition result and the target sub-intent; and
correspondingly, the executing the first semantic instruction, and sending the second semantic instruction to a second terminal includes:
executing the response logic fed back by the server, and sending, to the second terminal, the target sub-intent and the pre-run result that are fed back by the server.
In this possible implementation, when the response logic fed back by the server for the target intent in the speech instruction recognition result is received, the pre-run result fed back by the server for the target sub-intent in the speech instruction recognition result is also received, and the pre-run result of the target sub-intent is used as intermediate data to be transmitted to the second terminal, so as to provide a data basis for the second terminal. By executing the response logic fed back by the server, when the target sub-intent is sent to the second terminal, the pre-run result is also sent to the second terminal, so that the second terminal may directly invoke an execution interface of the server based on the pre-run result, and there is no need to upload the target sub-intent to the server for processes such as semantic processing and determining execution, thereby reducing a data processing procedure and shortening the response time of the dialog system.
According to a third aspect, an embodiment of this application provides a speech control method, including:
receiving a second semantic instruction sent by a first terminal when the first terminal executes a first semantic instruction, where the first semantic instruction and the second semantic instruction are operation information that is fed back by a server based on a speech instruction recognition result and that is received by the first terminal after the first terminal sends the speech instruction recognition result to the server; recognizing the second semantic instruction, to obtain a recognition result of the second semantic instruction; sending an execution command to the server based on the recognition result; and receiving service logic that is fed back by the server based on the execution command and that is corresponding to the second semantic instruction, and executing the service logic.
According to the speech control method provided in this application, a second terminal is used as an execution body. The second terminal recognizes the received second semantic instruction, and directly invokes an execution interface of the server based on the recognition result, to instruct the server to feed back service logic corresponding to the second semantic instruction, and there is no need to perform semantic processing on the second semantic instruction by using the server. This reduces a data processing procedure, improves a response speed of the second terminal, and shortens a delay of a session system.
In a possible implementation of the third aspect, the operation information includes response logic fed back by the server for a target intent in the speech instruction recognition result, and a pre-run result fed back by the server for a target sub-intent in the speech instruction recognition result; and
correspondingly, the receiving a second semantic instruction sent by a first terminal when the first terminal executes a first semantic instruction includes: receiving the target sub-intent and the pre-run result that are sent by the first terminal when the first terminal executes the response logic.
In a possible implementation of the third aspect, the second semantic instruction includes a pre-run result obtained by the server by pre-verifying a target sub-intent in the speech instruction recognition result; and
correspondingly, the recognizing the second semantic instruction, to obtain a recognition result of the second semantic instruction includes; recognizing the second semantic instruction, to obtain the pre-run result of the target sub-intent.
In a possible implementation of the third aspect, the sending an execution command to the server based on the recognition result includes:
sending the execution command corresponding to the pre-run result to the server based on the recognition result.
For example, the pre-run result includes a skill identifier, an intent identifier, and a slot list, where a slot includes a slot name, a slot type, and a slot value.
It should be understood that the server, the first terminal, and the second terminal may be interconnected with each other in a networked state, and implement data transmission with each other by using a data transmission protocol. Alternatively, the three terminals are separately connected to a cloud-side service to exchange data.
For example, the server, the first terminal, and the second terminal may be connected to each other through mutual confirmation of addresses and interfaces between the terminals by using a wireless Wi-Fi or a cellular network, to form a device circle of a dialog system, and implement mutual control by using a speech instruction.
For example, the server sends the first semantic instruction in the operation information to the first terminal, and directly sends the second semantic instruction to the second terminal.
According to a fourth aspect, an embodiment of this application provides a speech control apparatus, including:
a first receiving module, configured to receive a speech instruction recognition result sent by a first terminal:
a semantic processing module, configured to perform semantic processing on the speech instruction recognition result, to obtain operation information, where the operation information includes a first semantic instruction and a second semantic instruction;
a first sending module, configured to send the first semantic instruction and the second semantic instruction to the first terminal, where the first semantic instruction is used to instruct the first terminal to send the second semantic instruction to a second terminal; and
a command execution module, configured to: receive an execution command fed back by the second terminal after the second terminal recognizes the second semantic instruction, and send, according to the execution command, service logic corresponding to the second semantic instruction to the second terminal.
In a possible implementation, the semantic processing module includes:
a semantic recognition submodule, configured to recognize the speech instruction recognition result, to obtain a target intent and a target sub-intent of the speech instruction recognition result; and
a task execution submodule, configured to: pre-verify the target sub-intent based on the target intent, to obtain response logic of the target intent and a pre-run result of the target sub-intent; and use the response logic as the first semantic instruction of the operation information, and use the target sub-intent and the pre-run result as the second semantic instruction of the operation information.
In a possible implementation, the first sending module is further configured to send the first semantic instruction and the second semantic instruction to the first terminal in a semantic representation form.
In a possible implementation, the first sending module includes:
a first submodule, configured to parse the pre-run result according to the execution command; and
a second word module, configured to invoke the service logic based on the parsed pre-run result, and send the service logic to the second terminal in the semantic representation form.
According to a fifth aspect, an embodiment of this application provides a speech control apparatus, including:
a speech recognition module, configured to; receive a speech instruction entered by a user, and perform speech recognition on the speech instruction to obtain a speech instruction recognition result:
a second sending module, configured to send the speech instruction recognition result to a server;
a second receiving module, configured to receive operation information fed back by the server after the server performs semantic processing on the speech instruction recognition result, where the operation information includes a first semantic instruction and a second semantic instruction; and
an instruction execution module, configured to: execute the first semantic instruction; and send the second semantic instruction to a second terminal, where the second semantic instruction is used to instruct the second terminal to send an execution command to the server and receive service logic that is fed back by the server and that is corresponding to the second semantic instruction.
In a possible implementation, the second receiving module is further configured to receive response logic fed back by the server for a target intent in the speech instruction recognition result, and receive a pre-run result fed back by the server for a target sub-intent in the speech instruction recognition result.
In a possible implementation, the first semantic instruction is the response logic fed back by the server for the target intent in the speech instruction recognition result, and the second semantic instruction is the pre-run result fed back by the server for the target sub-intent in the speech instruction recognition result and the target sub-intent. The instruction execution module is further configured to execute the response logic fed back by the server, and send, to the second terminal, the target sub-intent and the pre-run result that are fed back by the server.
According to a sixth aspect, an embodiment of this application provides a speech control apparatus, including:
a third receiving module, configured to receive a second semantic instruction sent by a first terminal when the first terminal executes a first semantic instruction, where the first semantic instruction and the second semantic instruction are operation information that is fed back by a server based on a speech instruction recognition result and that is received by the first terminal after the first terminal sends the speech instruction recognition result to the server;
an instruction recognition module, configured to recognize the second semantic instruction, to obtain a recognition result of the second semantic instruction;
a third sending module, configured to send an execution command to the server based on the recognition result; and
a service execution module, configured to: receive service logic that is fed back by the server based on the execution command and that is corresponding to the second semantic instruction, and execute the service logic.
In a possible implementation, the operation information includes response logic fed back by the server for a target intent in the speech instruction recognition result, and a pre-run result fed back by the server for a target sub-intent in the speech instruction recognition result. The third receiving module is further configured to receive the target sub-intent and the pre-run result that are sent by the first terminal when the first terminal executes the response logic.
In a possible implementation, the second semantic instruction includes a pre-run result obtained by the server by pre-verifying a target sub-intent in the speech instruction recognition result. The instruction recognition module is further configured to recognize the second semantic instruction, to obtain the pre-run result of the target sub-intent.
In a possible implementation, the third sending module is further configured to send the execution command corresponding to the pre-run result to the server based on the recognition result.
According to a seventh aspect, an embodiment of this application provides a server. The server includes a memory, a processor, a natural language understanding module, and a dialog management module. The memory is configured to store a computer program, the computer program includes instructions, and when the instructions are executed by the server, the server is enabled to perform the speech control method.
According to an eighth aspect, an embodiment of this application provides a terminal device. The terminal device includes a memory, a processor, and a speech assistant. The memory is configured to store a computer program, the computer program includes instructions, and when the instructions are executed by the terminal device, the terminal device is enabled to perform the speech control method.
According to a ninth aspect, an embodiment of this application provides a terminal device. The terminal device includes a memory and a processor. The memory is configured to store a computer program, the computer program includes instructions, and when the instructions are executed by the terminal device, the terminal device is enabled to perform the speech control method.
According to a tenth aspect, an embodiment of this application provides a computer storage medium. The computer-readable storage medium stores a computer program, the computer program includes instructions, and when the instructions are run on a terminal device, the terminal device is enabled to perform the speech control method.
According to an eleventh aspect, an embodiment of this application provides a computer program product including instructions. When the computer program product is run on a terminal device, the terminal device is enabled to perform the speech control method according to any one of the possible implementations of the first aspect.
It may be understood that for beneficial effects of the second aspect to the eleventh aspect, refer to technical effects of the first aspect or the implementations of the first aspect. Details are not described herein again.
Compared with the current technology, this embodiment of this application has the following beneficial effect: According to the speech control method provided in this application, semantic processing is performed on the speech instruction recognition result by receiving the speech instruction recognition result sent by the first terminal, to obtain the to-be-executed operation information in the speech instruction recognition result, and the operation information is sent to the first terminal. The first terminal executes the first semantic instruction in the operation information, and sends the second semantic instruction in the operation information to the second terminal. After recognizing the second semantic instruction, the second terminal may directly receive the execution command fed back by the second terminal, invoke, according to the execution command, the service logic corresponding to the second semantic instruction, and send the service logic to the second terminal. In this embodiment, after receiving the second semantic instruction, the second terminal may directly receive the execution command that is fed back by the second terminal based on task information included in the second semantic instruction, and does not need to perform semantic processing again on the second semantic instruction received by the second terminal. Corresponding service logic may be invoked based on the execution command that is fed back, and the execution interface is used to send the execution command to the second terminal. In this way, the processing procedure for the second semantic instruction is omitted, the dialog delay is shortened, and the response time of the dialog system is improved.
In the following description, to illustrate rather than limit, specific details such as a particular system structure, and a technology are provided to make a thorough understanding of the embodiments of this application. However, persons skilled in the art should know that this application may also be implemented in other embodiments without these specific details. In other cases, detailed descriptions of well-known systems, apparatuses, circuits, and methods are omitted, so that this application is described without being obscured by unnecessary details.
It should be understood that, when used in the specification and the appended claims of this application, the terms “comprises” and/or “comprising” indicate presence of the described features, entireties, steps, operations, elements, and/or components, but does not exclude presence or addition of one or more other features, entireties, steps, operations, elements, components, and/or sets thereof.
It should also be understood that the term “and/or” used in the specification and the appended claims of this application refers to any combination and all possible combinations of one or more associated listed items, and includes these combinations.
As used in the specification and the appended claims of this application, according to the context, the term “if” may be interpreted as “when” or “once” or “in response to determining” or “in response to detecting”. Similarly, according to the context, the phrase “if it is determined that” or “if (a described condition or event) is detected” may be interpreted as a meaning of “once it is determined that” or “in response to determining” or “once (a described condition or event) is detected” or “in response to detecting (a described condition or event)”.
In addition, in the specification and the appended claims of this application, the terms “first”. “second”. “third”, and the like are merely used for distinguishing description, and shall not be understood as an indication or implication of relative importance.
Reference to “an embodiment”, “some embodiments”, or the like described in the specification of this application indicates that one or more embodiments of this application include a specific feature, structure, or characteristic described with reference to the embodiments. Therefore, in this specification, statements, such as “in an embodiment”, “in some embodiments”, “in some other embodiments”, and “in other embodiments”, that appear at different places do not necessarily mean referring to a same embodiment, instead, they mean “one or more but not all of the embodiments”, unless otherwise specifically emphasized in other ways. The terms “include”, “comprise”, “have”, and their variants all mean “include but are not limited to”, unless otherwise specifically emphasized in other ways.
A speech control method provided in this application may be applied to an all-scenario session scenario in which a plurality of devices perform cross-device joint dialogs and control each other by using speeches. For example, speech interaction with a mobile phone is performed, and a television is controlled by using the mobile phone to execute corresponding service logic.
Currently, in the all-scenario session scenario in which the plurality of devices perform mutual speech control, each device in the scenario needs to have a networking function. The devices may communicate with each other in a wired or wireless manner through mutual confirmation of addresses and interfaces, or each device accesses a cloud-side service and implements communication by using the cloud-side service. The wireless manner includes the internet, a Wi-Fi network, or a mobile network. The mobile network may include existing 2G (for example, a global system for mobile communications (English: Global System for Mobile Communication, GSM)), 3G (for example, a universal mobile telecommunications system (English: Universal Mobile Telecommunications System. UMTS)), 4G (for example, FDD LTE and TDD LTE), 4.5G, 5G, and the like. The devices use a transmission protocol, for example, a communications protocol such as http, to transmit data. The devices each may be a mobile phone, a television, a tablet, a sound box, a computer, or the like, and the devices may have functions such as networking and a speech assistant.
In an actual application scenario, when a plurality of devices perform cross-device joint dialogs and control each other by using speeches, a dialog manager (Dialog Manager, DM) needs to serve as a cloud service to maintain and update a process and a status of the dialog, input an utterance (utterance) corresponding to a speech instruction, and output a system response by understanding the utterance with reference to a related context.
The dialog manager (Dialog Manager. DM) obtains, based on semantics of the input speech instruction, a task corresponding to the speech instruction, determines information required by the task, then connects to a service platform to complete the task, or requests to further input more speech instruction information, or obtains service logic corresponding to the task on the service platform, and finally returns an execution result to a user.
DM with different functions may be interconnected to different service platforms. The service platform may be a service platform preset by a system, or may be a third-party platform. For example, semantics of listening to a song or an e-book may be interconnected to a platform such as NetEase cloud music or Himalaya, and semantics of watching a video may be interconnected to a third-party platform such as iQIYI or Bilibili.
As shown in
It should be noted that a multi-device interconnection speech control system may include a plurality of devices, and the implemented speech control may include any type of speech instruction for cross-device control, for example, an instruction for controlling playing of a television through cross-device control, an instruction for controlling an air conditioner to adjust temperature through cross-device control, or an instruction for controlling a cooking mode of a cooking tool through cross-device control.
In a human-machine natural language dialog system, a dialog manager is responsible for controlling a process and a status of a dialog, and outputs a system response after multi-channel parallel skill discovery, pre-run, sorting and selection, execution, and session connection by inputting an utterance and a related context.
As shown in
For example, when the user says “Play the movie Ne Zha on the television” to the mobile phone, the dialog management server performs semantic recognition based on the mobile phone context to determine that the skill is a “switch” skill, and when the “switch” skill is executed, it is required to pre-verify whether the television supports “Play the movie Ne Zha”. A processing procedure for skill discovery, pre-run, and selection is performed based on an analog television context recognition utterance “Play the movie Ne Zha” in the dialog management server. If a skill can be selected, it indicates that the television supports “Play the movie Ne Zha”. Otherwise, it indicates that the television does not support the task, and a corresponding semantic processing result needs to be returned or further confirmation with the user is required.
When a verification result is “support”, the dialog management server returns response logic obtained after semantic processing, that is, returns skill=switch, target=television, and utterance=play the movie Ne Zha to the mobile phone. When receiving the response logic of “switch”, the mobile phone executes switching logic: sending “Play the movie Ne Zha” to the television. After receiving “Play the movie Ne Zha”, the television recognizes text information of “Play the movie Ne Zha”, invokes the dialog management server again, performs semantic processing: skill discovery, pre-run, and selection on “Play the movie Ne Zha” based on a real television context, then invokes an execution interface of the server based on a selected pre-run result, and sends an execution command to the server. The server interconnects to service logic of “Play the movie Ne Zha” according to the execution command, feeds back the service logic to the television, and returns skill=play a movie, and name=Ne Zha. The television plays the movie.
Currently, in the dialog management server, context information of an analog target terminal (a television) may be set to pre-verify whether a target terminal supports an intent of a current utterance. Only a verification result is obtained, but a task is not executed.
It can be learned from the foregoing process that, in a semantic processing process performed by the dialog management server relative to the mobile phone and the television, repeated processing of the procedure “skill discovery, pre-run, and selection” is performed for “Play the movie Ne Zha”. Consequently, a relatively long delay is caused in a speech interaction process of a dialog system, a response time of the system is prolonged, running load of the dialog management server is increased, and user experience is relatively poor.
Based on the foregoing problem, according to the speech control method provided in this application, in an all-scenario multi-device cooperative dialog, by controlling information exchange between devices, when device switching is recognized, a pre-run result of pre-verification for a target device is used as intermediate data, and an intermediate device transmits the intermediate data to the target terminal, or the intermediate data is directly transmitted to the target terminal through a dialog management server.
For example, in the system architecture of multi-device interconnection speech control shown in
When the server feeds back the response logic to the first terminal, and directly sends the pre-run result to the second terminal, the first terminal may respond to the user that switching is being performed or a command is being executed. The server invokes the second terminal, and directly sends the pre-run result to the second terminal. The second terminal recognizes the pre-run result, directly invokes an execution interface of the server, and sends an execution command to the server. The server interconnects to a service platform according to the instruction command, invokes corresponding service logic, and feeds back the service logic to the second terminal. The second terminal executes the service logic. In this way, a repeated processing process performed by the server on an utterance is omitted, thereby improving a response speed of the target device, shortening a response time of the dialog system, and reducing a delay of human-machine speech interaction.
Step S301: Receive a speech instruction recognition result sent by a first terminal.
In this embodiment, the server receives the speech instruction recognition result sent by the first terminal. The speech instruction recognition result is text information that is of a speech instruction and that is obtained by performing speech recognition on audio information of the speech instruction after the first terminal receives the speech instruction entered by a user, and the text information of the speech instruction is used as the speech instruction recognition result. The first terminal may be a terminal device on which a speech assistant is disposed, for example, a mobile phone, a computer, a tablet, a television, or a sound box. The audio information of the user is received by using a microphone of the first terminal. For example, the user says “Play the movie Ne Zha on the television” to a speech assistant of the mobile phone.
Specifically, after recognizing the speech instruction, the first terminal obtains the text information corresponding to the speech instruction, and transmits the text information to the server by using wireless Wi-Fi or a cellular mobile network. The server performs semantic recognition and processing.
The speech instruction may be a speech control instruction of a task type, and the speech instruction recognition result may include a target intent and a target sub-intent. For example, in “Play the movie Ne Zha on the television” or “Play a song of Beatles on the sound box”, “on the television” or “on the sound box” corresponds to the target intent, and “Play the movie Ne Zha” or “Play a song of Beatles” may be correspondingly recognized as the target sub-intent.
It should be noted that, in a state in which both the server and the first terminal are connected to a network, the first terminal and the server may implement networking communication through mutual confirmation of addresses and interfaces, or may communicate with each other through a gateway or a route. Information transmission between the server and the first terminal conforms to a data transmission protocol, for example, the HTTP protocol.
Step S302: Perform semantic processing on the speech instruction recognition result, to obtain operation information, where the operation information includes a first semantic instruction and a second semantic instruction.
In this embodiment, as a dialog management system in a speech interaction process, the server may perform semantic recognition on the speech instruction recognition result through natural language understanding, to obtain a semantic representation that can be recognized by a machine. The server obtains the target intent and the target sub-intent in the speech instruction recognition result based on the semantic representation, and performs parallel processing in a plurality of phases to obtain the operation information for replying to the first terminal, so as to respond to the speech instruction recognition result.
The operation information may be an execution result of implementing the target intent in the speech instruction recognition result by the server, that is, response logic, for example, service logic invoked based on the speech instruction recognition result; or may be further requiring a client to input more information to implement the target intent.
For example, when the server receives “Play the movie Ne Zha on the television” sent by the mobile phone, the server performs processes such as skill discovery, pre-run, and selection based on a specified mobile phone context, and determines a “switch” skill. Based on semantic recognition, it may be determined that the target intent is “switch”, and the target sub-intent is “Play the movie Ne Zha”. Based on the semantic recognition, if a target device television needs to be switched to, whether the television supports “Play the movie Ne Zha” is pre-verified, so as to avoid that the television displays “not support” or “cannot understand” after switching to the television is performed. Analog television context information is set on the server, and includes a domain and a target object in a current dialog, and slot information, a sequence, and a pronoun mentioned in a previous dialog. Based on the analog television context information, the utterance “Play the movie Ne Zha” is pre-verified, that is, a processing procedure for skill discovery, pre-run, and skill selection and determining is performed. If a playing skill can be determined, it indicates that the television supports the target sub-intent. In this case, the server generates corresponding operation information based on a “switch” action to be performed by the mobile phone and a pre-run result of the pre-verification process, performs session connection, and replies to the mobile phone.
Specifically, when a cross-device control “switch” action is determined based on the mobile phone context information, the operation information may be divided into an operation instruction that needs to be executed by the mobile phone currently and an operation instruction that needs to be executed by the target device currently, that is, the operation information for replying to the mobile phone is divided into the first semantic instruction and the second semantic instruction. The first semantic instruction corresponds to reply logic responding to the current mobile phone, and corresponds to the target intent in the speech instruction recognition result. The second semantic instruction is logic that needs to be executed by the target device, and corresponds to the target sub-intent in the speech instruction recognition result.
It should be noted that in a process of recognizing a task, executing the task, and replying a result based on the speech instruction recognition result, the dialog management server may further dispose a plurality of slots to perform a plurality of rounds of speech interaction with the client, to clarify the target intent or the target sub-intent. For example, after receiving an utterance “Play on the television” sent by the mobile phone, the server may return a question “What to play”, and then receives “the movie Ne Zha”. Through a plurality of rounds of dialogs, a task of a target utterance is clarified, so that a dialog system can accurately reply or respond.
In a possible implementation, the performing semantic processing on the speech instruction recognition result, to obtain operation information includes:
3.1: Recognize the speech instruction recognition result, to obtain the target intent and the target sub-intent of the speech instruction recognition result.
3.2: Pre-verify the target sub-intent based on the target intent, to obtain the response logic of the target intent and a pre-run result of the target sub-intent.
3.3: Use the response logic as the first semantic instruction of the operation information, and use the target sub-intent and the pre-run result as the second semantic instruction of the operation information.
In this embodiment, the server performs semantic processing on the speech instruction recognition result, and recognizes semantic information in the text information of the speech instruction recognition result, to obtain the target intent and the target sub-intent of the speech instruction recognition result. The target intent may be an operation that needs to be performed by the first terminal and that is determined based on the speech instruction recognition result, and the target sub-intent may be an operation that needs to be performed to control the target device across devices and that is determined based on the speech instruction recognition result. The server determines the target intent of the speech instruction recognition result based on the mobile phone context, for example, determines a “switch” intent. The server performs pre-verification and pre-run on the target sub-intent, to determine whether the target terminal supports execution of the target sub-intent. Through an execution process, the response logic {skill=switch, target=television, utterance=play the movie Ne Zha} of the target intent, and a verification result and the pre-run result of the target sub-intent are determined. The verification result is used to indicate whether the target terminal supports execution of the target sub-intent, and the pre-run result is used to indicate a processing result obtained by performing simulation run of the target sub-intent.
Specifically, the response logic and the pre-run result may include a skill identifier, an intent identifier, and slot information. The skill identifier determines a skill. The skill is a set of capabilities and can support a plurality of intents. For example, a weather skill supports an intent of querying weather and PM2.5. The intent identifier determines a unique intent in the skill. The slot information is a list of parameters required for intent execution. There may be any quantity of parameters in the slot information, for example, there may be zero or a plurality of parameters. The slot information includes a slot name, a slot type, and a slot value. The slot name determines a parameter name of the slot, and the slot type determines a type of the slot parameter, such as a date, a number, or a character string. The slot value is a parameter value.
For example, the server uses the response logic and the pre-run result as a result reply, uses the response logic as the first semantic instruction of the operation information, and uses the utterance corresponding to the target sub-intent and the running result as the second semantic instruction of the operation information.
Step S303: Send the first semantic instruction and the second semantic instruction to the first terminal, where the first semantic instruction is used to instruct the first terminal to send the second semantic instruction to a second terminal.
In this embodiment, in a wired or wireless manner, the server uses the first semantic instruction and the second semantic instruction as the result reply, and sends both the first semantic instruction and the second semantic instruction to the first terminal.
Specifically, the first semantic instruction includes the response logic for replying to the first terminal. For example, in the foregoing scenario, the response logic corresponding to the first terminal may be {skill=switch, target=television, utterance=play the movie Ne Zha}. The second semantic instruction includes the utterance corresponding to the target sub-intent and the pre-run result of the target sub-intent in the speech instruction recognition result. For example, the pre-run result may be {skill=play a movie, name=Ne Zha}. The first terminal executes the first semantic instruction, and sends the second semantic instruction to the second terminal. The second terminal recognizes the second semantic instruction, and may further recognize the pre-run result of the target sub-intent while recognizing the target sub-intent from the second semantic instruction. The server does not need to perform the processing procedure for skill discovery, pre-run, and selection on the utterance of the target sub-intent.
Alternatively, in another possible implementation, the server may send the first semantic instruction to the first terminal in a wired or wireless manner, and directly send the second semantic instruction to the second terminal (namely, the target terminal) in a wired or wireless manner. The first terminal executes the switching skill, and determines to switch to the second terminal (the target terminal). The second terminal (the target terminal) directly obtains the second semantic instruction sent by the server. The second semantic instruction includes the pre-run result of the target sub-intent. The second terminal may recognize the pre-run result in the second semantic instruction, directly send an execution command to the server based on the pre-run result, and invoke an execution interface of the server. The server invokes, according to the execution command, service logic corresponding to the target sub-intent, so that a processing process in which the server performs skill discovery, pre-run, and selection again on the utterance of the target sub-intent in the second semantic instruction is omitted, thereby improving a response speed of the dialog system.
It should be noted that, in a state in which the server, the first terminal, and the second terminal are all connected to a network, the server and the first terminal, the server and the second terminal, and the first terminal and the second terminal may implement networking communication through mutual confirmation of addresses and interfaces, or may communicate with each other through a gateway or a route. Therefore, the pre-run result in the second semantic control instruction may be used as an intermediate result and transmitted by the first terminal to the second terminal, or may be directly sent by the server to the second terminal to invoke the second terminal.
In a possible implementation, the sending the first semantic instruction and the second semantic instruction to the first terminal includes:
sending the first semantic instruction and the second semantic instruction to the first terminal in a semantic representation form.
In this embodiment, the semantic representation form is a machine-readable language representation manner, and the server uses the speech instruction recognition result obtained after the semantic processing as a reply result to the first terminal or the second terminal in the semantic representation form.
Correspondingly, the server may further send the first semantic instruction to the first terminal in the semantic representation form, and for example, the semantic representation form is {skill=switch, target=television, utterance=play the movie Ne Zha}. The server may further send the pre-run result in the second semantic instruction to the second terminal in the semantic representation form, and for example, the semantic representation form of the pre-run result is {skill=play a movie, name=Ne Zha}.
Step S304: Receive the execution command fed back by the second terminal after the second terminal recognizes the second semantic instruction, and send, according to the execution command, service logic corresponding to the second semantic instruction to the second terminal.
In this embodiment, the second semantic instruction includes the target sub-intent and the pre-run result obtained by pre-verifying the target sub-intent. After receiving the second semantic instruction, the second terminal obtains the pre-run result by recognizing the second semantic instruction. The second terminal directly invokes the execution interface of the server based on the pre-run result, and sends the execution command to the server. The server receives the execution command sent by the second terminal, interconnects, according to the execution command, to the service logic corresponding to the second semantic instruction, and sends the service logic to the second terminal device. For example, movie data in the server is invoked, and the movie data is sent to the second terminal as response logic, where the response logic may be {skill=play a movie, name=Ne Zha}. The second terminal executes the corresponding service logic, that is, plays the movie Ne Zha.
In a possible implementation, the sending, according to the execution command, service logic corresponding to the second semantic instruction to the second terminal includes:
3.4: Parse the pre-run result according to the execution command.
3.5: Invoke the service logic based on the parsed pre-run result, and send the service logic to the second terminal in the semantic representation form.
In this embodiment, the server receives the execution command sent by the second terminal, parses the pre-run result of the target sub-intent, invokes, based on the parsed result, the service logic corresponding to the target sub-intent, and sends the service logic to the second terminal in the semantic representation form. For example, the server returns {skill=play a movie, name=Ne Zha} to the second terminal.
It should be noted that a dialog management server corresponding to the first terminal and a dialog management server corresponding to the second terminal may be a same server, or two servers having a same function.
According to the speech control method provided in this application, the server is used as the execution body. The server receives the speech instruction recognition result sent by the first terminal, performs semantic processing on the speech instruction recognition result to obtain the to-be-executed operation information in the speech instruction recognition result, and sends the operation information to the first terminal. The first terminal executes the first semantic instruction in the operation information, and sends the second semantic instruction in the operation information to the second terminal. After the second terminal recognizes the second semantic instruction, the server may directly receive the execution command fed back by the second terminal, invoke, according to the execution command, the service logic corresponding to the second semantic instruction, and send the service logic to the second terminal. In this embodiment, after the second terminal receives the second semantic instruction, the server may directly receive the execution command that is fed back by the second terminal based on task information included in the second semantic instruction, and does not need to perform semantic processing again on the second semantic instruction received by the second terminal. The corresponding service logic may be invoked according to the execution command that is fed back, and be sent to the second terminal through the execution interface. In this way, the processing procedure for the second semantic instruction is omitted, a dialog delay is shortened, and a response time of the dialog system is improved.
Step S401: Receive a speech instruction entered by a user, and perform speech recognition on the speech instruction to obtain a speech instruction recognition result.
Step S402: Send the speech instruction recognition result to a server.
Step S403: Receive operation information fed back by the server after the server performs semantic processing on the speech instruction recognition result, where the operation information includes a first semantic instruction and a second semantic instruction.
Step S404: Execute the first semantic instruction, and send the second semantic instruction to a second terminal, where the second semantic instruction is used to instruct the second terminal to send an execution command to the server and receive service logic that is fed back by the server and that is corresponding to the second semantic instruction.
In some embodiments of this application, a speech assistant may be disposed in the first terminal. The speech assistant receives, by using a microphone, the speech instruction entered by the user, and performs speech recognition ASR on the speech instruction, to obtain the speech instruction recognition result, that is, text information corresponding to the speech instruction. The speech assistant sends the speech instruction recognition result to the server in a wired or wireless manner, and receives the operation information fed back by the server. The operation information may include the first semantic instruction corresponding to the first terminal and the second semantic instruction corresponding to the second terminal. The first terminal executes the first semantic instruction in the operation information, invokes and switches to the second terminal, and sends the second semantic instruction to the second terminal at the same time. The second semantic instruction may include a pre-run result of a target sub-intent in the speech instruction recognition result. The second terminal may recognize the pre-run result in the second semantic instruction, directly send the execution command to the server based on the pre-run result, and invoke an execution interface of the server. The server connects, according to the execution command, to service logic corresponding to the target sub-intent, and feeds back the service logic to the second terminal, so that the second terminal completes the service logic. In this way, a repeated processing process of the server to an utterance of the target sub-intent is omitted, thereby improving a response speed of a target device, shortening a response time of a dialog system, and reducing a delay of human-machine speech interaction.
In a possible implementation, the receiving operation information fed back by the server after the server performs semantic processing on the speech instruction recognition result includes:
receiving response logic fed back by the server for a target intent in the speech instruction recognition result, and receiving the pre-run result fed back by the server for the target sub-intent in the speech instruction recognition result.
In a possible implementation, the first semantic instruction is response logic fed back by the server for the target intent in the speech instruction recognition result, and the second semantic instruction is the pre-run result fed back by the server for the target sub-intent in the speech instruction recognition result and the target sub-intent.
Correspondingly, the executing the first semantic instruction, and sending the second semantic instruction to a second terminal includes;
executing the response logic fed back by the server, and sending, to the second terminal, the target sub-intent and the pre-run result that are fed back by the server.
According to this embodiment of this application, when obtaining the response logic fed back by the server based on a first terminal context, the first terminal obtains the pre-run result of the target sub-intent in the speech instruction recognition result, and also sends the pre-run result to the second terminal when invoking the second terminal, so that the second terminal can directly obtain the pre-run result of the target sub-intent in the speech instruction recognition result, and the server does not need to perform a series of semantic processing on the utterance of the target sub-intent, thereby optimizing a data processing procedure of the dialog system, and improving a response speed of the dialog system.
Step S501: Receive a second semantic instruction sent by a first terminal when the first terminal executes a first semantic instruction, where the first semantic instruction and the second semantic instruction are operation information that is fed back by a server based on a speech instruction recognition result and that is received by the first terminal after the first terminal sends the speech instruction recognition result to the server.
Step S502: Recognize the second semantic instruction, to obtain a recognition result of the second semantic instruction.
Step S503: Send an execution command to the server based on the recognition result.
Step S504: Receive service logic that is fed back by the server based on the execution command and that is corresponding to the second semantic instruction, and execute the service logic.
In some embodiments of this application, after receiving the second semantic instruction fed back by the server by using the first terminal, the second terminal recognizes the second semantic instruction, to obtain a pre-run result of a target sub-intent in the speech instruction recognition result. According to the pre-run result, semantic recognition processing does not need to be performed on an utterance of the target sub-intent, and an execution command is directly sent to the server to invoke an execution interface of the server, so that the server connects to a corresponding service platform based on the pre-run result, and invokes corresponding service logic. The second terminal receives the service logic fed back by the server, and executes the service logic.
In a possible implementation, the operation information includes response logic fed back by the server for a target intent in the speech instruction recognition result, and the pre-run result fed back by the server for the target sub-intent in the speech instruction recognition result.
Correspondingly, the receiving a second semantic instruction sent by a first terminal when the first terminal executes a first semantic instruction includes:
receiving the target sub-intent and the pre-run result that are sent by the first terminal when the first terminal executes the response logic.
In a possible implementation, the second semantic instruction includes the pre-run result obtained by the server by pre-verifying the target sub-intent in the speech instruction recognition result.
Correspondingly, the recognizing the second semantic instruction, to obtain a recognition result of the second semantic instruction includes:
recognizing the second semantic instruction, to obtain the pre-run result of the target sub-intent.
According to this embodiment of this application, when receiving the pre-run result of the target sub-intent in the speech instruction recognition result, the second terminal may directly invoke the execution interface of the server based on the pre-run result, and does not need to perform semantic recognition processing on the utterance of the target sub-intent. After receiving the execution command of the second terminal, the server connects to the service platform corresponding to the target sub-intent, invokes the corresponding service logic, and feeds back the service logic to the second terminal, so that the second terminal executes the service logic. In this way, a repeated semantic processing procedure on the utterance corresponding to the target sub-intent in the speech instruction recognition result is omitted, and a response speed of a dialog system is improved.
1: A first terminal receives a speech instruction entered by a user, and performs speech recognition on the speech instruction to obtain a speech instruction recognition result.
2: The first terminal sends the speech instruction recognition result to a server.
3: The server performs semantic processing on the speech instruction recognition result, to obtain operation information.
4: The server sends the operation information to the first terminal, where the operation information includes a first semantic instruction and a second semantic instruction.
5: The first terminal executes the first semantic instruction.
6: The first terminal sends the second semantic instruction to a second terminal.
7: The second terminal recognizes the second semantic instruction.
8: The second terminal sends an execution command to the server, and invokes an execution interface of the server.
9: The server invokes, according to the execution command, service logic corresponding to the second semantic instruction.
10: The server sends the service logic to the second terminal.
11: The second terminal executes the service logic.
An execution principle of steps in this embodiment is the same as that in the foregoing embodiment, and details are not described again.
As shown in the figure, the mobile phone receives a speech instruction “Play the movie Ne Zha on the television” entered by a user, performs speech recognition on the speech instruction, to obtain text information of the speech instruction, and the mobile phone sends the text information to the dialog management server in a wired or wireless manner. The dialog management server performs semantic recognition on “Play the movie Ne Zha on the television” based on a mobile phone context, and determines, by performing skill discovery, pre-run, and selection of an optimal skill “switch”, that a target is “television”, and an utterance is “Play the movie Ne Zha”. When a switching intent is determined, whether the television supports the playing needs to be pre-verified. After a series of processing such as skill discovery, pre-run, and selection is performed based on an analog television context, that a verification result is “support” and a pre-run result is “target object Object” is obtained. The skill “switch”, the determined target, “television”, and the utterance “Play the movie Ne Zha” are fed back to the mobile phone as response logic. After receiving the response logic, the mobile phone executes a switching instruction, sends “Play the movie Ne Zha” to the television, and sends the pre-run result “Object” to the television. The television recognizes the pre-run result “Object”, directly sends an execution command to the dialog management server, and invokes an execution interface of the dialog management server. The dialog management server connects to service logic corresponding to “Play the movie Ne Zha”, and feeds back the service logic to the television. The television performs an operation of playing the movie Ne Zha based on the fed-back service logic.
In a possible implementation,
In another possible implementation, after execution is performed on a server side based on a mobile phone context, response logic corresponding to a target intent and the pre-run result of a target sub-intent are obtained. The server may directly invoke the television, and send the utterance “Play the movie Ne Zha” of the target sub-intent and the pre-run result to the television at the same time. The television recognizes the utterance corresponding to the target sub-intent and the pre-run result, and the television directly invokes the execution interface of the dialog management server based on the pre-run result, and sends the execution command to the dialog management server. The dialog management server connects to the service logic corresponding to “Play the movie Ne Zha” and feeds back the service logic to the television. The television performs the operation of playing the movie Ne Zha based on the fed-back service logic.
As shown in
According to this embodiment of this application, the first half of the procedure of the target device is reduced, a response delay of a dialog system is significantly shortened (in actual application, the delay may be shortened by more than 50%), so that dialog experience is improved.
Corresponding to the speech control method described in the foregoing embodiments and the embodiments of the application scenario,
Referring to
The first receiving module 101 is configured to receive a speech instruction recognition result sent by a first terminal.
The semantic processing module 102 is configured to perform semantic processing on the speech instruction recognition result, to obtain operation information, where the operation information includes a first semantic instruction and a second semantic instruction.
The first sending module 103 is configured to send the first semantic instruction and the second semantic instruction to the first terminal, where the first semantic instruction is used to instruct the first terminal to send the second semantic instruction to a second terminal.
The command execution module 104 is configured to: receive an execution command fed back by the second terminal after the second terminal recognizes the second semantic instruction, and send, according to the execution command, service logic corresponding to the second semantic instruction to the second terminal.
In a possible implementation, the semantic processing module includes:
a semantic recognition submodule, configured to recognize the speech instruction recognition result, to obtain a target intent and a target sub-intent of the speech instruction recognition result; and
a task execution submodule, configured to: pre-verify the target sub-intent based on the target intent, to obtain response logic of the target intent and a pre-run result of the target sub-intent; and use the response logic as the first semantic instruction of the operation information, and use the target sub-intent and the pre-run result as the second semantic instruction of the operation information.
In a possible implementation, the first sending module is further configured to send the first semantic instruction and the second semantic instruction to the first terminal in a semantic representation form.
In a possible implementation, the first sending module includes:
a first submodule, configured to parse the pre-run result according to the execution command; and
a second word module, configured to invoke the service logic based on the parsed pre-run result, and send the service logic to the second terminal in the semantic representation form.
Corresponding to the speech control method described in the foregoing embodiments and the embodiments of the application scenario,
Referring to
The speech recognition module 111 is configured to: receive a speech instruction entered by a user, and perform speech recognition on the speech instruction to obtain a speech instruction recognition result.
The second sending module 112 is configured to send the speech instruction recognition result to a server.
The second receiving module 113 is configured to receive operation information fed back by the server after the server performs semantic processing on the speech instruction recognition result, where the operation information includes a first semantic instruction and a second semantic instruction.
The instruction execution module 114 is configured to: execute the first semantic instruction; and send the second semantic instruction to a second terminal, where the second semantic instruction is used to instruct the second terminal to send an execution command to the server and receive service logic that is fed back by the server and that is corresponding to the second semantic instruction.
In a possible implementation, the second receiving module is further configured to receive response logic fed back by the server for a target intent in the speech instruction recognition result, and receive a pre-run result fed back by the server for a target sub-intent in the speech instruction recognition result.
In a possible implementation, the first semantic instruction is the response logic fed back by the server for the target intent in the speech instruction recognition result, and the second semantic instruction is the pre-run result fed back by the server for the target sub-intent in the speech instruction recognition result and the target sub-intent. The instruction execution module is further configured to execute the response logic fed back by the server, and send, to the second terminal, the target sub-intent and the pre-run result that are fed back by the server.
Corresponding to the speech control method described in the foregoing embodiments and the embodiments of the application scenario,
Referring to
The third receiving module 121 is configured to receive a second semantic instruction sent by a first terminal when the first terminal executes a first semantic instruction, where the first semantic instruction and the second semantic instruction are operation information that is fed back by a server based on a speech instruction recognition result and that is received by the first terminal after the first terminal sends the speech instruction recognition result to the server.
The instruction recognition module 122 is configured to recognize the second semantic instruction, to obtain a recognition result of the second semantic instruction.
The third sending module 123 is configured to send an execution command to the server based on the recognition result.
The service execution module 124 is configured to: receive service logic that is fed back by the server based on the execution command and that is corresponding to the second semantic instruction, and execute the service logic.
In a possible implementation, the operation information includes response logic fed back by the server for a target intent in the speech instruction recognition result, and a pre-run result fed back by the server for a target sub-intent in the speech instruction recognition result. The third receiving module is further configured to receive the target sub-intent and the pre-run result that are sent by the first terminal when the first terminal executes the response logic.
In a possible implementation, the second semantic instruction includes a pre-run result obtained by the server by pre-verifying a target sub-intent in the speech instruction recognition result. The instruction recognition module is further configured to recognize the second semantic instruction, to obtain the pre-run result of the target sub-intent.
In a possible implementation, the third sending module is further configured to send the execution command corresponding to the pre-run result to the server based on the recognition result.
According to this embodiment, a speech control method is used. The speech instruction recognition result sent by the first terminal is received, semantic processing is performed on the speech instruction recognition result, to obtain to-be-executed operation information in the speech instruction recognition result, and the operation information is sent to the first terminal. The first terminal executes the first semantic instruction in the operation information, and sends the second semantic instruction in the operation information to the second terminal. After recognizing the second semantic instruction, the second terminal may directly receive the execution command fed back by the second terminal, invoke, according to the execution command, the service logic corresponding to the second semantic instruction, and send the service logic to the second terminal. In this embodiment, after receiving the second semantic instruction, the second terminal may directly receive the execution command that is fed back by the second terminal based on task information included in the second semantic instruction, and does not need to perform semantic processing again on the second semantic instruction received by the second terminal. Corresponding service logic may be invoked based on the execution command that is fed back, and the execution interface is used to send the execution command to the second terminal. In this way, the processing procedure for the second semantic instruction is omitted, a dialog delay is shortened, and a response time of the dialog system is improved.
It may be clearly understood by persons skilled in the art that, for the purpose of convenient and brief description, division of the foregoing function units and modules is used as an example for illustration. In actual application, the foregoing functions can be allocated to different function units and modules and implemented based on a requirement, that is, an inner structure of the apparatus is divided into different function units and modules to implement all or some of the functions described above. Function units and modules in the embodiments may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units may be integrated into one unit. The integrated unit may be implemented in a form of hardware, or may be implemented in a form of a software function unit. In addition, specific names of the function units and modules are merely for ease of distinguishing between the function units and modules, but are not intended to limit the protection scope of this application. For a specific working process of the units and modules in the foregoing system, refer to a corresponding process in the foregoing method embodiments. Details are not repeatedly described herein.
receive a speech instruction recognition result sent by a first terminal; perform semantic processing on the speech instruction recognition result, to obtain operation information, where the operation information includes a first semantic instruction and a second semantic instruction; send the first semantic instruction and the second semantic instruction to the first terminal, where the first semantic instruction is used to instruct the first terminal to send the second semantic instruction to a second terminal; and receive an execution command fed back by the second terminal after the second terminal recognizes the second semantic instruction, and send, according to the execution command, service logic corresponding to the second semantic instruction to the second terminal.
receive a speech instruction entered by a user, and perform speech recognition on the speech instruction to obtain a speech instruction recognition result; send the speech instruction recognition result to a server; receive operation information fed back by the server after the server performs semantic processing on the speech instruction recognition result, where the operation information includes a first semantic instruction and a second semantic instruction; and execute the first semantic instruction, and send the second semantic instruction to a second terminal, where the second semantic instruction is used to instruct the second terminal to send an execution command to the server and receive service logic that is fed back by the server and that is corresponding to the second semantic instruction.
receive a second semantic instruction sent by a first terminal when the first terminal executes a first semantic instruction, where the first semantic instruction and the second semantic instruction are operation information that is fed back by a server based on a speech instruction recognition result and that is received by the first terminal after the first terminal sends the speech instruction recognition result to the server; recognize the second semantic instruction, to obtain a recognition result of the second semantic instruction; send an execution command to the server based on the recognition result; and receive service logic that is fed back by the server based on the execution command and that is corresponding to the second semantic instruction, and execute the service logic.
The server 13 may be a device such as a cloud server or a local physical server. The terminal device 14 and the terminal device 15 may be devices such as desktop computers, laptops, palmtop computers, mobile phones, televisions, and sound boxes. The server 13, the terminal device 14, and the terminal device 15 may include, but are not limited to, a processor and a memory. Persons skilled in the art may understand that
The processor may be a central processing unit (Central Processing Unit, CPU). The processor may further be another general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field-programmable gate array (Field-Programmable Gate Array, FPGA) or another programmable logic device, a discrete gate or a transistor logic device, or a discrete hardware component. The general-purpose processor may be a microprocessor, or the processor may be any conventional processor or the like.
In some embodiments, the memory may be an internal storage unit, for example, a hard disk or a memory, of the server 13, the terminal device 14, or the terminal device 15. In some other embodiments, the memory may also be an external storage device, for example, a disposed pluggable hard disk, a smart media card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, or a flash card (Flash Card), of the server 13, the terminal device 14, or the terminal device 15. Further, the memory may include not only the internal storage unit but also the external storage device of the server 13, the terminal device 14, or the terminal device 15. The memory is configured to store an operating system, an application, a bootloader (BootLoader), data, and another program, for example, program code of the computer program. The memory may be further configured to temporarily store data that has been output or is to be output.
According to an embodiment of this application, a computer-readable storage medium is further provided. The computer-readable storage medium stores a computer program, the computer program includes instructions, and when the instructions are run on a terminal device, the terminal device is enabled to perform the speech control method.
According to an embodiment of this application, a computer program product including instructions is provided. When the computer program product is run on a terminal device, the terminal device is enabled to perform the speech control method according to any one of the possible implementations of the first aspect.
When the integrated unit is implemented in the form of a software function unit and sold or used as an independent product, the integrated unit may be stored in a computer-readable storage medium. Based on such an understanding, all or some of the processes of the method in the embodiments of this application may be implemented by a computer program instructing related hardware. The computer program may be stored in a computer-readable storage medium. When the computer program is executed by the processor, steps of the foregoing method embodiments may be implemented. The computer program includes computer program code. The computer program code may be in a source code form, an object code form, an executable file form, some intermediate forms, or the like. The computer-readable medium may include at least any entity or apparatus that can carry computer program code to a photographing apparatus/terminal device, a recording medium, a computer memory, a read-only memory (ROM, Read-Only Memory), a random access memory (RAM, Random Access Memory), an electrical carrier signal, a telecommunications signal, and a software distribution medium. For example, a USB flash drive, a removable hard disk, a magnetic disk, or an optical disk. In some jurisdictions, the computer-readable medium cannot be the electrical carrier signal or the telecommunications signal according to legislation and patent practices.
In the foregoing embodiments, the description of each embodiment has respective focuses. For a part that is not described in detail or recorded in an embodiment, refer to related descriptions in other embodiments.
Persons of ordinary skill in the art may be aware that, in combination with the examples described in the embodiments disclosed in this specification, units and algorithm steps may be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether the functions are performed by hardware or software depends on particular applications and design constraint conditions of the technical solutions. Persons skilled in the art may use different methods to implement the described functions for each particular application, but it should not be considered that the implementation goes beyond the scope of this application.
In the embodiments provided in this application, it should be understood that the disclosed apparatus/network device and method may be implemented in other manners. For example, the described apparatus/network device embodiment is merely an example. For example, the module or unit division is merely logical function division and may be other division in actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communications connections may be implemented through some interfaces. The indirect couplings or communications connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected based on actual requirements to achieve the objectives of the solutions of the embodiments.
The foregoing embodiments are merely intended for describing the technical solutions of this application, but not for limiting this application. Although this application is described in detail with reference to the foregoing embodiments, persons of ordinary skill in the art should understand that they may still make modifications to the technical solutions described in the foregoing embodiments or make equivalent replacements to some technical features thereof, without departing from the spirit and scope of the technical solutions of the embodiments of this application, and these modifications and replacements shall fall within the protection scope of this application.
Number | Date | Country | Kind |
---|---|---|---|
201911417229.4 | Dec 2019 | CN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2020/125215 | 10/30/2020 | WO |