This disclosure relates to the artificial intelligence field, and in particular, to a data processing method and a related device.
Artificial intelligence (AI) is a theory, a method, a technology, and an application system that simulates, extends, and expands human intelligence by using a digital computer or a machine controlled by a digital computer, to perceive an environment, obtain knowledge, and obtain an optimal result by using the knowledge. In other words, the artificial intelligence is a branch of computer science, and is intended to understand essence of intelligence and produce a new intelligent machine that can react in a manner similar to the human intelligence. The artificial intelligence is intended to study design principles and implementation methods of various intelligent machines, so that the machines have perception, inference, and decision-making functions.
A language model is a model that can predict an unknown word in a sentence based on a given semantic segment. For example, a given natural language sequence segment is “Huawei are very good.” The language model may generate an unknown word based on this segment. For example, in this example, the language model may generate the word “mobile phones” based on the given segment, to obtain a sentence “Huawei mobile phones are very good”.
In an existing natural language generation model (with reference to
According to a first aspect, this disclosure provides a data processing method. The method includes:
The target data is data with missing data. The target data includes non-missing data (referred to as a known data unit in this embodiment of this disclosure) and missing data (referred to as a to-be-predicted data unit in this embodiment of this disclosure, for example, a first to-be-predicted data unit and a second to-be-predicted data unit). The known data unit is a data unit in the non-missing data. For example, the target data may be text data. In this case, the known data unit in the target data may be a known word or a known letter in the text data, and the to-be-predicted data unit may be a to-be-predicted word or a to-be-predicted letter in the text data. For example, the target data may be speech data. In this case, the known data unit in the target data may be a known audio sequence in the speech data, and the to-be-predicted data unit may be a to-be-predicted audio sequence in the speech data. For example, the target data may be image data. In this case, the known data unit in the target data may be a known sample in the image data, and the to-be-predicted data unit may be a to-be-predicted sample in the speech data. It should be understood that data granularities of the known data unit and the to-be-predicted data unit are related to a type of the target data. The data granularities of the known data unit and the to-be-predicted data unit may be a minimum data unit in the target data or a plurality of data units including minimum data units. The granularities of the known data unit and the to-be-predicted data unit are not limited herein.
The method further includes: processing the M first embedding vectors by using a target encoder, to obtain M first output vectors corresponding to M known data units, where a first output vector corresponding to each known data unit is generated based on the M first embedding vectors.
Each first output vector is obtained based on the M first embedding vectors. It may be understood that each first output vector may use the M first embedding vectors as a reference. In other words, when each first output vector is generated, each first embedding vector is visible, or each first output vector has a dependency relationship with the M first embedding vectors.
In one embodiment, the target encoder may be a transformer layer, and that each first output vector is obtained based on the M first embedding vectors may be understood as that there is an attention association between any two of the M first embedding vectors.
The method further includes: processing the M first output vectors and the second embedding vector by using a target prediction network, to obtain the first to-be-predicted data unit.
In this embodiment of this disclosure, for the M first embedding vectors corresponding to the M known data units, the target encoder may use the M first embedding vectors as input. The first embedding vectors include position information of the known data units and data information of the known data units. M pieces of additional position information do not need to be separately set as the input of the target encoder, and a quantity of latent variables of intermediate output of the target encoder is also consistent with a quantity of input embedding vectors, thereby reducing a computation amount and memory consumption of the target encoder.
In one embodiment, the first position indicates a relative position relationship between the known data unit and another known data unit and a relative position relationship between the known data unit and the first to-be-predicted data unit, and the second position indicates a relative position relationship between the first to-be-predicted data unit and each known data unit in the target data.
In one embodiment, the target encoder is a first transformer layer, and the target prediction network is a second transformer layer.
In one embodiment, the first transformer layer includes a plurality of serial transformer sub-layers, and the processing the M first embedding vectors by using a target encoder, to obtain M first output vectors corresponding to M known data units includes:
In other words, input of each transformer sub-layer includes M eigenvectors corresponding to the M known data units, and output of each transformer sub-layer includes M output vectors corresponding to the M known data units. In this way, the quantity of latent variables of the intermediate output of the target encoder is also consistent with the quantity of input embedding vectors, thereby reducing the computation amount and the memory consumption of the target encoder.
In one embodiment, the target encoder includes an attention head, and the processing the M first embedding vectors by using a target encoder includes:
In one embodiment, the method further includes:
The method further includes: obtaining a position vector of each of the M known data units, where the position vector indicates the first position. In some embodiments, the position vector of each of the M known data units may be obtained. The position vector indicates the first position. The first position indicates a position of a known data unit in the target data. Specifically, the first position may indicate the relative position relationship between the known data unit in the target data and the another known data unit other than the known data unit and the relative position relationship between the known data unit and the first to-be-predicted data unit.
The method further includes: integrating each of the M third embedding vectors and a corresponding position vector, to obtain the M first embedding vectors. It should be understood that an integration manner may be performing an addition operation on the third embedding vector and the position vector, or performing another operation so that the first embedding vector can carry a known data unit in the target data and information about a first position of the known data unit in the target data. A specific integration manner is not limited herein.
In one embodiment, the target data further includes a second to-be-predicted data unit, and a prediction order of the second to-be-predicted data unit and the first to-be-predicted data unit is randomly determined.
In one embodiment, if the second to-be-predicted data unit is predicted after the first to-be-predicted data unit, the method further includes:
In this embodiment of this disclosure, prediction is performed in a random order manner. Order information of a to-be-predicted data unit is fully used, and the order information is explicitly integrated into an output vector.
In one embodiment, a second output vector corresponding to each known data unit is generated based on the M first embedding vectors, and the second output vectors corresponding to the first to-be-predicted data unit are generated based on the M first embedding vectors and the fourth embedding vector.
In one embodiment, the target data is text data, the known data unit is a known word in the text data, and the first to-be-predicted data unit is a to-be-predicted word in the text data;
According to a second aspect, this disclosure provides a data processing method. The method includes:
In one embodiment, the first position indicates a relative position relationship between the data unit and another data unit.
In one embodiment, the target encoder is a first transformer layer, and the task network is a second transformer layer.
In one embodiment, the first transformer layer includes a plurality of serial transformer sub-layers, and the processing the M first embedding vectors by using a target encoder, to obtain M first output vectors corresponding to M known data units includes:
In one embodiment, the target encoder includes an attention head, and the processing the M first embedding vectors by using a target encoder includes:
In one embodiment, the target data is text data, and the data unit is a word in the text data;
In one embodiment, the target processing task includes short text classification, long text classification, natural language inference, text similarity matching, or text emotion classification.
According to a third aspect, this disclosure provides a data processing method. The method includes:
In one embodiment, the first position indicates a relative position relationship between the known data unit and another known data unit and a relative position relationship between the known data unit and the first to-be-predicted data unit, and the second position indicates a relative position relationship between the first to-be-predicted data unit and each known data unit in the target data.
In one embodiment, the first encoder is a first transformer layer, and the first prediction network is a second transformer layer.
In one embodiment, the first transformer layer includes a plurality of serial transformer sub-layers, and the processing the M first embedding vectors by using the first encoder, to obtain M first output vectors corresponding to M known data units includes:
In one embodiment, the target data further includes a second to-be-predicted data unit, and a prediction order of the second to-be-predicted data unit and the first to-be-predicted data unit is randomly determined.
In one embodiment, if the second to-be-predicted data unit is predicted after the first to-be-predicted data unit, the method further includes:
In one embodiment, a second output vector corresponding to each known data unit is generated based on the M first embedding vectors, and the second output vectors corresponding to the first to-be-predicted data unit are generated based on the M first embedding vectors and the fourth embedding vector.
In one embodiment, the target data is text data, the known data unit is a known word in the text data, and the first to-be-predicted data unit is a to-be-predicted word in the text data;
According to a fourth aspect, this disclosure provides a data processing apparatus, including:
In one embodiment, the first position indicates a relative position relationship between the known data unit and another known data unit and a relative position relationship between the known data unit and the first to-be-predicted data unit, and the second position indicates a relative position relationship between the first to-be-predicted data unit and each known data unit in the target data.
In one embodiment, the target encoder is a first transformer layer, and the target prediction network is a second transformer layer.
In one embodiment, the first transformer layer includes a plurality of serial transformer sub-layers, and the processing the M first embedding vectors by using a target encoder, to obtain M first output vectors corresponding to M known data units includes:
In one embodiment, the target encoder includes an attention head, and the encoding module is configured to: obtain attention information, where the attention information indicates that there is an attention association between any two of the M first embedding vectors when the attention head processes the M first embedding vectors; and process the M first embedding vectors based on the attention information by using the target encoder.
In one embodiment, the apparatus further includes:
In one embodiment, the target data further includes a second to-be-predicted data unit, and a prediction order of the second to-be-predicted data unit and the first to-be-predicted data unit is randomly determined.
In one embodiment, the second to-be-predicted data unit is predicted after the first to-be-predicted data unit,
In one embodiment, a second output vector corresponding to each known data unit is generated based on the M first embedding vectors, and the second output vectors corresponding to the first to-be-predicted data unit are generated based on the M first embedding vectors and the fourth embedding vector.
In one embodiment, the target data is text data, the known data unit is a known word in the text data, and the first to-be-predicted data unit is a to-be-predicted word in the text data;
According to a fifth aspect, this disclosure provides a data processing apparatus, including:
In one embodiment, the first position indicates a relative position relationship between the data unit and another data unit.
In one embodiment, the target encoder is a first transformer layer, and the task network is a second transformer layer.
In one embodiment, the first transformer layer includes a plurality of serial transformer sub-layers, and the processing the M first embedding vectors by using a target encoder, to obtain M first output vectors corresponding to M known data units includes:
In one embodiment, the target encoder includes an attention head, and the encoding module is configured to: obtain attention information, where the attention information indicates that there is an attention association between any two of the M first embedding vectors when the attention head processes the M first embedding vectors; and
In one embodiment, the target data is text data, and the data unit is a word in the text data;
In one embodiment, the target processing task includes short text classification, long text classification, natural language inference, text similarity matching, or text emotion classification.
According to a sixth aspect, this disclosure provides a data processing apparatus, including:
In one embodiment, the first position indicates a relative position relationship between the known data unit and another known data unit and a relative position relationship between the known data unit and the first to-be-predicted data unit, and the second position indicates a relative position relationship between the first to-be-predicted data unit and each known data unit in the target data.
In one embodiment, the first encoder is a first transformer layer, and the first prediction network is a second transformer layer.
In one embodiment, the first transformer layer includes a plurality of serial transformer sub-layers, and the processing the M first embedding vectors by using the first encoder, to obtain M first output vectors corresponding to M known data units includes:
In one embodiment, the target data further includes a second to-be-predicted data unit, and a prediction order of the second to-be-predicted data unit and the first to-be-predicted data unit is randomly determined.
In one embodiment, the second to-be-predicted data unit is predicted after the first to-be-predicted data unit,
In one embodiment, a second output vector corresponding to each known data unit is generated based on the M first embedding vectors, and the second output vectors corresponding to the first to-be-predicted data unit are generated based on the M first embedding vectors and the fourth embedding vector.
In one embodiment, the target data is text data, the known data unit is a known word in the text data, and the first to-be-predicted data unit is a to-be-predicted word in the text data;
According to a seventh aspect, an embodiment of this disclosure provides an execution device that may include a memory, a processor, and a bus system. The memory is configured to store a program. The processor is configured to execute the program in the memory, to perform the method in the first aspect and any optional implementation of the first aspect, or the method in the second aspect and any optional implementation of the second aspect.
According to an eighth aspect, an embodiment of this disclosure provides a training device that may include a memory, a processor, and a bus system. The memory is configured to store a program. The processor is configured to execute the program in the memory, to perform the method in the third aspect and any optional implementation of the third aspect.
According to a ninth aspect, an embodiment of this disclosure provides a computer-readable storage medium. The computer-readable storage medium stores a computer program. When the computer program is run on a computer, the computer is enabled to perform the method in the first aspect and any optional implementation of the first aspect, the method in the second aspect and any optional implementation of the second aspect, and the method in the third aspect and any optional implementation of the third aspect.
According to a tenth aspect, an embodiment of this disclosure provides a computer program. When the computer program is run on a computer, the computer is enabled to perform the method in the first aspect and any optional implementation of the first aspect, the method in the second aspect and any optional implementation of the second aspect, and the method in the third aspect and any optional implementation of the third aspect.
According to an eleventh aspect, this disclosure provides a chip system. The chip system includes a processor, configured to support an execution device or a training device in implementing functions in the foregoing aspects, for example, send or process data or information in the foregoing methods. In a possible design, the chip system further includes a memory. The memory is configured to store program instructions and data that are necessary for the execution device or the training device. The chip system may include a chip, or may include a chip and another discrete component.
An embodiment of this disclosure provides a data processing method. The method includes: obtaining M first embedding vectors and a second embedding vector, where each first embedding vector indicates one known data unit in target data and a first position of the known data unit in the target data, the second embedding vector indicates a second position, in the target data, of a first to-be-predicted data unit in the target data, and M is a positive integer; processing the M first embedding vectors by using a target encoder, to obtain M first output vectors corresponding to M known data units, where a first output vector corresponding to each known data unit is generated based on the M first embedding vectors; and processing the M first output vectors and the second embedding vector by using a target prediction network, to obtain the first to-be-predicted data unit. In the foregoing manner, for the M first embedding vectors corresponding to the M known data units, the target encoder may use the M first embedding vectors as input. The first embedding vectors include position information and data information of the known data units. M pieces of additional position information do not need to be separately set as the input of the target encoder, and a quantity of latent variables of intermediate output of the target encoder is also consistent with a quantity of input embedding vectors, thereby reducing a computation amount and memory consumption of the target encoder.
The following describes embodiments of the present disclosure with reference to the accompanying drawings in embodiments of the present disclosure. Terms used in embodiments of the present disclosure are merely used to explain specific embodiments of the present disclosure, but are not intended to limit the present disclosure.
The following describes embodiments of this disclosure with reference to the accompanying drawings. A person of ordinary skill in the art may learn that, with development of technologies and emergence of new scenarios, the technical solutions provided in embodiments of this disclosure are also applicable to a similar technical problem.
In the specification, claims, and accompanying drawings of this disclosure, the terms “first”, “second”, and the like are intended to distinguish between similar objects but do not necessarily indicate a specific order or sequence. It should be understood that the terms used in such a way are interchangeable in appropriate circumstances, and this is merely a discrimination manner for describing objects having a same attribute in embodiments of this disclosure. In addition, the terms “include”, “contain”, and any other variants mean to cover the non-exclusive inclusion, so that a process, method, system, product, or device that includes a series of units is not necessarily limited to those units, but may include other units not expressly listed or inherent to such a process, method, product, or device.
An overall working procedure of an artificial intelligence system is first described.
(1) Infrastructure
The infrastructure provides computing capability support for the artificial intelligence system, implements communication with the external world, and implements support by using a basic platform. The infrastructure communicates with the outside by using a sensor. A computing capability is provided by a smart chip (a hardware acceleration chip such as a CPU, an NPU, a GPU, an ASIC, or an FPGA). The basic platform includes related platform assurance and support such as a distributed computing framework and a network, and may include cloud storage and computing, an interconnection and interworking network, and the like. For example, the sensor communicates with the outside to obtain data, and the data is provided to an intelligent chip in a distributed computing system provided by the basic platform for computing.
(2) Data
Data at an upper layer of the infrastructure indicates a data source in the artificial intelligence field. The data relates to a graph, an image, a speech, and text, and further relates to Internet of things data of a conventional device; and includes service data of an existing system and perception data such as force, displacement, a liquid level, a temperature, and humidity.
(3) Data Processing
Data processing usually includes a manner such as data training, machine learning, deep learning, searching, inference, or decision-making.
Machine learning and deep learning may mean performing symbolic and formalized intelligent information modeling, extraction, preprocessing, training, and the like on data.
Inference is a process in which a human intelligent inference manner is simulated in a computer or an intelligent system, and machine thinking and problem resolving are performed by using formatted information according to an inference control policy. A typical function is searching and matching.
Decision-making is a process in which a decision is made after intelligent information is inferred, and usually provides functions such as classification, ranking, and prediction.
(4) General Capability
After data processing mentioned above is performed on data, some general capabilities may further be formed based on a data processing result, for example, an algorithm or a general system, such as translation, text analysis, computer vision processing, speech recognition, and image recognition.
(5) Intelligent Product and Industry Application
The intelligent product and the industry application are a product and an application of an artificial intelligence system in various fields, and are package of an overall solution of the artificial intelligence, so that decision-making for intelligent information is productized and an application is implemented. Application fields mainly include an intelligent terminal, intelligent transportation, intelligent health care, autonomous driving, a safe city, and the like.
This disclosure may be applied to the natural language processing field, the image processing field, and the audio and video processing field in the artificial intelligence field. The following uses natural language processing as an example to describe a plurality of application scenarios of implementing a plurality of products.
To better understand the solutions in embodiments of this disclosure, the following briefly describes possible application scenarios of embodiments of this disclosure with reference to
The data processing device may be a device or a server with a data processing function, such as a cloud server, a network server, an application server, or a management server. The data processing device receives a query statement/speech/text or the like from the intelligent terminal through an interaction interface; then performs, by using a memory storing data and a processor processing data, language data processing in a manner of machine learning, deep learning, searching, inference, decision-making, or the like; and feeds back a processing result to the user equipment. The memory in the data processing device may be a general name, including a local storage and a database storing historical data. The database may be in the data processing device, or may be in another network server.
In the natural language processing system shown in
Natural language generation is used as an example. Natural language generation may also be referred to as a text prediction task or a natural language synthesis task, and is a task of generating a missing text or a subsequent text when a text segment is given. Natural language generation is widely used in scenarios such as a search engine and an input method. Following input of the user may be predicted when the user inputs a part of a text, to greatly improve efficiency of using the product by the user. In addition, a text with a missing text can be restored. For example, in this embodiment of this disclosure, the user equipment may receive a segment of text data (for example, target data described in embodiments of this disclosure) input by the user. The text data includes a known word and a to-be-predicted word. The to-be-predicted word is invisible. Only a position of the to-be-predicted word in the text data is known. Then, the user equipment may initiate a request (the request carries the text data) to the data processing device. Therefore, the data processing device predicts the to-be-predicted word in the text data to obtain the to-be-predicted word, and feeds back the to-be-predicted word to the user equipment.
For example, the user equipment may receive a segment of text data input by the user, and then initiate a request to the data processing device. Therefore, the data processing device performs entity classification on the segment of text data to obtain an entity classification result for the segment of text data, and feeds back the entity classification result to the user equipment.
For example, the user equipment may receive a segment of text data (the text data is a Chinese text) input by the user, and then initiate a request to the data processing device. Therefore, the data processing device translates the segment of text data into English to obtain an English translated text for the segment of text data, and feeds back the English translated text to the user equipment.
In
The user equipment in
The processor in
It should be understood that this embodiment of this disclosure may be further applied to the image processing field and the audio/video processing field, and the data processing device processes the target data by using the data processing method in embodiments of this disclosure.
It should be understood that the data processing device may also be referred to as a data processing apparatus, an execution device, a server, a terminal device, or the like in subsequent embodiments.
The following describes in detail a system architecture provided in an embodiment of this disclosure with reference to
The execution device 510 includes a computing module 511, an I/O interface 512, a preprocessing module 513, and a preprocessing module 514. The computing module 511 may include a target model/rule 501. The preprocessing module 513 and the preprocessing module 514 are optional.
The data collection device 560 is configured to collect training data. In a natural language synthesis task, the training data may be text data with a missing text and complete text data corresponding to the text data with the missing text. In an audio synthesis task, the training data may be speech data with a missing audio sequence and complete speech data corresponding to the speech data with the missing audio sequence. In an image synthesis (or referred to as image reconstruction) task, the training data may be image data or video data with a missing pixel and complete image data or video data corresponding to the image data or video data with the missing pixel. After collecting the training data, the data collection device 560 stores the training data in the database 530. The training device 520 obtains a target model/rule 501 through training based on the training data maintained in the database 530.
For example, the target model/rule 501 is used to implement the natural language synthesis task. In this case, the target model/rule 501 (for example, a target encoder or a target prediction network in embodiments of this disclosure) can be used to implement the natural language synthesis task. To be specific, the text data with the missing text is input to the target model/rule 501, to obtain the missing text (for example, a first to-be-predicted data unit and a second to-be-predicted data unit in embodiments of this disclosure).
For example, the target model/rule 501 is used to implement a target processing task (for example, short text classification, long text classification, natural language inference, text similarity matching, and text emotion classification). In this case, the target model/rule 501 (for example, the target encoder and the task network in embodiments of this disclosure) can be used to implement the target processing task. To be specific, target data is input to the target model/rule 501, to obtain a task processing result.
It should be noted that, during actual application, the training data maintained in the database 530 is not necessarily collected by the data collection device 560, and may also be received from another device. It should further be noted that the training device 520 may not necessarily train the target model/rule 501 completely based on the training data maintained in the database 530, and may obtain training data from a cloud or another place to perform model training. The foregoing description should not be construed as a limitation on embodiments of this disclosure.
The target model/rule 501 obtained through training by the training device 520 may be applied to different systems or devices, for example, the execution device 510 shown in
The preprocessing module 513 and the preprocessing module 514 are configured to perform preprocessing based on the input data received by the I/O interface 512 (for example, a preprocessing process such as obtaining positions of a known data unit and a to-be-predicted data unit in the target data, or generating attention information). It should be understood that the preprocessing module 513 and the preprocessing module 514 may not exist, or there is only one preprocessing module. If the preprocessing module 513 and the preprocessing module 514 do not exist, the computing module 511 may be directly used to process the input data.
In a process in which the execution device 510 preprocesses the input data, or the computing module 511 of the execution device 510 performs processing related to computing or the like, the execution device 510 may invoke data, code, and the like in the data storage system 550 for corresponding processing, and may further store, in the data storage system 550, data, instructions, and the like that are obtained through the corresponding processing.
Finally, the I/O interface 512 presents, to the customer device 540, a processing result, for example, a missing text, a missing audio sequence, or a missing pixel (for example, the first to-be-predicted data unit, the second to-be-predicted data unit, and the task processing result in embodiments of this disclosure) obtained through the processing, to provide the processing result to the user.
In a case shown in
It should be noted that
It should be understood that the execution device 510 may be alternatively deployed in the customer device 540.
From a perspective of model inference, in this embodiment of this disclosure, the data storage system 550 may store related code for implementing the data processing method in embodiments of this disclosure, and the computing module 511 may obtain, from the data storage system 550, the related code for implementing the data processing method in embodiments of this disclosure, to perform the data processing method in embodiments of this disclosure.
In this embodiment of this disclosure, the computing module 511 may include a hardware circuit (for example, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a general-purpose processor, a digital signal processor (DSP), a microprocessor, or a microcontroller), or a combination of these hardware circuits. For example, the computing module 511 may be a hardware system having an instruction execution function, for example, a CPU or a DSP; or a hardware system having no instruction execution function, for example, an ASIC or an FPGA; or a combination of the foregoing hardware system having no instruction execution function and the foregoing hardware system having the instruction execution function.
Specifically, the computing module 511 may be the hardware system having the instruction execution function. The data processing method provided in embodiments of this disclosure may be software code stored in the data storage system 550. The computing module 511 may obtain the software code from the data storage system 550, and execute the obtained software code to implement the data processing method provided in embodiments of this disclosure.
It should be understood that the computing module 511 may be the combination of the hardware system having no instruction execution function and the hardware system having the instruction execution function. Some operations of the data processing method provided in embodiments of this disclosure may be alternatively implemented by using the hardware system having no instruction execution function in the computing module 511, or by using the preprocessing module 513 or the preprocessing module 514. This is not limited herein.
Because embodiments of this disclosure relate to massive application of a neural network, for ease of understanding, the following first describes terms related to embodiments of this disclosure and concepts related to the neural network and the like.
(1) Neural Network
The neural network may include neurons. The neuron may be an operation unit that uses xs (namely, input data) and an intercept of 1 as input. Output of the operation unit may be as follows:
h
W,b(x)=f(WTx)=f(Σs=1nWsxs+b)
s=1, 2, . . . , n; n is a natural number greater than 1; Ws is a weight of xs; and b is a bias of the neuron. f indicates an activation function of the neuron. The activation function is used for introducing a non-linear feature into the neural network, to convert an input signal in the neuron into an output signal. The output signal of the activation function may be used as input of a next convolutional layer. The activation function may be a sigmoid function. The neural network is a network formed by connecting a plurality of single neurons together. To be specific, output of a neuron may be input of another neuron. Input of each neuron may be connected to a local receptive field of a previous layer to extract a feature of the local receptive field. The local receptive field may be a region including several neurons.
(2) Transformer Layer
(3) Attention Mechanism
The attention mechanism simulates an internal process of biological observation behavior, and is a mechanism that aligns internal experience with external feeling to increase observation precision of some regions. The mechanism can quickly select highly valuable information from a large amount of information by using limited attention resources. The attention mechanism is widely used in natural language processing tasks, especially machine translation, because the attention mechanism can quickly extract an important feature of sparse data. A self-attention mechanism is an improvement of the attention mechanism. The self-attention mechanism becomes less dependent on external information and is better at capturing an internal correlation of data or features. An essential idea of the attention mechanism can be expressed by using the following formula:
Lx=∥Source∥ represents a length of a source. A meaning of the formula is that a constituent element in the source is considered to constitute a series of data pairs. In this case, an element Query in a target is given; similarity or a correlation between Query and each key is calculated, to obtain a weight coefficient of a value corresponding to each key; and then weighted summation is performed on values, to obtain a final attention value. Therefore, essentially, the attention mechanism is to perform weighted summation on values Values of the elements in the source. Query and the key are used to calculate a weight coefficient of a corresponding value. Conceptually, the attention mechanism may be understood as a mechanism for selecting a small amount of important information from a large amount of information, and focusing on the important information and ignoring most unimportant information. A focusing process is reflected in calculation of a weight coefficient. A larger weight indicates that a value corresponding to the weight is more focused on. In other words, the weight indicates importance of information, and the value indicates information corresponding to the value. The self-attention mechanism may be understood as an intra-attention mechanism. The attention mechanism is used between the element Query in the target and each element in the source. The self-attention mechanism indicates an attention mechanism used between elements in the source or between elements in the target, and may also be understood as an attention calculation mechanism in a special case in which Target=Source. A specific calculation process of the self-attention mechanism is the same except that a calculation object changes.
(4) Natural Language Processing
A natural language is a human language. Natural language processing (NLP) is processing for the human language. Natural language processing is a process of performing systematic analysis, understanding, and information extraction on text data in an intelligent and efficient manner. By using NLP and components of NLP, large chunks of text data can be managed or a large quantity of automated tasks can be performed, and various problems can be resolved, for example, automatic summarization, machine translation (MT), named entity recognition (NER), relation extraction (RE), information extraction (IE), emotion analysis, speech recognition, a question answering system, and topic segmentation.
(5) Pre-Trained Language Model
The pre-trained language model is a natural language sequence encoder, and encodes each word in a natural language sequence into a vector representation to perform a prediction task. Training for the pre-trained language model includes two stages. At a pre-training stage, the model is trained for a language model task on a large scale of an unsupervised text to learn a word representation. At a fine tuning stage, the model is initialized by using parameters learned at the pre-training stage, and is trained in few operations on downstream tasks such as text classification and sequence labeling, so that semantic information obtained through pre-training can be successfully migrated to the downstream tasks.
(6) Autoregressive Language Model
The autoregressive language model is a model that can predict, based on a given context (for example, “a mobile phone is very”), a next word (for example, “good”) that may follow. The model is usually used to predict a right-side following word when a left-side preceding text is given, and may also be used to predict a specific middle word when a left-side preceding text and a right-side following text are given.
The data processing method provided in embodiments of this disclosure is first described by using a model inference stage as an example.
601: Obtain M first embedding vectors and a second embedding vector, where each first embedding vector indicates one known data unit in target data and a first position of the known data unit in the target data, the second embedding vector indicates a second position, in the target data, of a first to-be-predicted data unit in the target data, and M is a positive integer.
The target data is data with missing data. The target data includes non-missing data (referred to as a known data unit in this embodiment of this disclosure) and missing data (referred to as a to-be-predicted data unit in this embodiment of this disclosure, for example, a first to-be-predicted data unit and a second to-be-predicted data unit). The known data unit is a data unit in the non-missing data. For example, the target data may be text data. In this case, the known data unit in the target data may be a known word or a known letter in the text data, and the to-be-predicted data unit may be a to-be-predicted word or a to-be-predicted letter in the text data. For example, the target data may be speech data. In this case, the known data unit in the target data may be a known audio sequence in the speech data, and the to-be-predicted data unit may be a to-be-predicted audio sequence in the speech data. For example, the target data may be image data. In this case, the known data unit in the target data may be a known sample in the speech data, and the to-be-predicted data unit may be a to-be-predicted sample in the speech data. It should be understood that data granularities of the known data unit and the to-be-predicted data unit are related to a type of the target data. The data granularities of the known data unit and the to-be-predicted data unit may be a minimum data unit in the target data or a plurality of data units including minimum data units. The granularities of the known data unit and the to-be-predicted data unit are not limited herein.
Specifically, in this embodiment of this disclosure, the target data may include the M known data units and the at least one to-be-predicted data unit (including the first to-be-predicted data unit). The to-be-predicted data unit is invisible data in the target data, and the to-be-predicted data unit needs to be determined based on the M known data units.
For example, the target data is text data. In this embodiment of this disclosure, the text data may include M known words and at least one to-be-predicted word (including a first to-be-predicted word). The text data may be a Chinese text, or may be an English text, or may be a text in another language. The text data may be a sentence, a paragraph, a chapter, or the like.
For example, the target data may be “sat on the mat”. “__sat”, “on”, “the”, and “mat” are known data units, and “_” and “_” are invisible in the target data and are to-be-predicted data units. It should be understood that the symbol “_” herein means empty rather than an underline.
In this embodiment of this disclosure, the M first embedding vectors may be obtained. Each first embedding vector indicates one known data unit in target data and a first position of the known data unit in the target data.
The following first describes how to generate the M first embedding vectors.
In an implementation, embedding processing may be performed on the M known data units in the target data by using an embedding layer, to obtain M third embedding vectors.
The embedding layer may be referred to as an input embedding (input embedding) layer. Current input may be the M known data units. After obtaining the current input, the embedding layer may perform embedding processing on the various known data units in the current input, to obtain the embedding vectors (that is, the third embedding vectors) corresponding to the various known data units.
In some embodiments, a position vector of each of the M known data units may be further obtained. The position vector indicates the first position. The first position indicates a position of the known data unit in the target data. Specifically, the first position indicates a relative position relationship between the known data unit and another known data unit and between the known data unit and the first to-be-predicted data unit.
In an implementation, the embedding layer may include the input embedding layer and a positional encoding layer. At the input embedding layer, word embedding processing may be performed on each known data unit in the current input, to obtain the third embedding vector of each known data unit. At the positional encoding layer, the position of each known data unit in the current input may be obtained, to generate the position vector for the position of each known data unit.
In some examples, the first position of each known data unit in the target data may be an absolute position of each known data unit in the target data. For example, the current input is “what date should the Ant Credit Pay be paid back”. A position of “what” may be represented as a first position, a position of “date” may be represented as a second position, and the like. In some examples, the first position of each known data unit in the target data may be a relative position of each known data unit in the target data. Still in the example in which the current input is “what date should the Ant Credit Pay be paid back”, the position of “what” may be represented as before “date”, and the position of “date” may be represented as after “what” and before “should”, and the like. When the third embedding vector and the position vector of each known data unit in the current input are obtained, the position vector and the corresponding third embedding vector of each known data unit may be integrated to obtain the first embedding vector of each known data unit. In this way, the plurality of first embedding vectors corresponding to the current input are obtained. It should be understood that an integration manner may be performing an addition operation on the third embedding vector and the position vector, or performing another operation so that the first embedding vector carries a known data unit in the target data and information about a first position of the known data unit in the target data. A specific integration manner is not limited herein. The plurality of first embedding vectors may be represented as an embedding matrix having a preset dimension. It may be set that a quantity of the plurality of first embedding vectors is M, and the preset dimension is H dimensions. In this case, the plurality of first embedding vectors may be represented as an M×H embedding matrix.
In this embodiment of this disclosure, the second embedding vector may be obtained. The second embedding vector indicates the second position, in the target data, of the first to-be-predicted data unit in the target data. The second position may indicate the relative position relationship between the first to-be-predicted data unit and each known data unit in the target data.
The following describes how to generate the second embedding vector.
In an implementation, embedding processing may be performed on the second position of the first to-be-predicted data unit in the target data by using the embedding layer, to obtain the second embedding vector for representing the second position, in the target data, of the first to-be-predicted data unit in the target data. The second embedding vector may be used as input of a subsequent target prediction network. The second position indicating the relative position relationship between the first to-be-predicted data unit and each known data unit in the target data. For description of the second position, refer to the description of the first position in the foregoing embodiment. Similarities are not described herein again.
Further, the M first embedding vectors for the M known data units and the second embedding vector for the first to-be-predicted data unit may be obtained.
602: Process the M first embedding vectors by using a target encoder, to obtain M first output vectors corresponding to M known data units, where a first output vector corresponding to each known data unit is generated based on the M first embedding vectors.
In this embodiment of this disclosure, the target encoder may process the M first embedding vectors to obtain the M first output vectors corresponding to the M known data units, that is, may obtain one first output vector corresponding to each known data unit.
In an existing natural language generation model (with reference to
In this embodiment of this disclosure, in a process in which the target encoder processes the M first embedding vectors, a quantity of hidden states is consistent with a quantity of hidden states in each of the autoencoder language model and the autoregressive language model. Specifically, for the M first embedding vectors corresponding to the M known data units, the target encoder may use the M first embedding vectors as input. The first embedding vectors include position information and data information of the known data units. M pieces of additional position information do not need to be separately set as the input of the target encoder, and a quantity of latent variables of intermediate output of the target encoder is also consistent with a quantity of input embedding vectors, thereby reducing a computation amount and memory consumption of the target encoder.
For details, refer to
In this embodiment of this disclosure, each first output vector is obtained based on the M first embedding vectors.
Each first output vector is obtained based on the M first embedding vectors. It may be understood that each first output vector may use the M first embedding vectors as a reference. In other words, when each first output vector is generated, each first embedding vector is visible, or each first output vector has a dependency relationship with the M first embedding vectors.
In an implementation, the target encoder may be a first transformer layer, and that each first output vector is obtained based on the M first embedding vectors may be understood as that there is an attention association between any two of the M first embedding vectors.
With reference to
Data output by a previous transformer sub-layer adjacent to each transformer sub-layer may be processed by using the transformer sub-layer, to obtain M intermediate vectors. The M intermediate vectors are output to a next transformer sub-layer adjacent to the transformer sub-layer. If the transformer sub-layer is a transformer layer closest to an input side in the plurality of transformer sub-layers, input data of the transformer sub-layer is the M first embedding vectors. If the transformer sub-layer is a transformer layer closest to an output side in the plurality of transformer sub-layers, output data of the transformer sub-layer is the M first output vectors.
In other words, input of each transformer sub-layer includes M eigenvectors corresponding to the M known data units, and output of each transformer sub-layer includes M output vectors corresponding to the M known data units. In this way, the quantity of latent variables of the intermediate output of the target encoder is also consistent with the quantity of input embedding vectors, thereby reducing the computation amount and the memory consumption of the target encoder.
In other words, input of each transformer sub-layer includes M eigenvectors corresponding to the M known data units, and output of each transformer sub-layer includes M output vectors corresponding to the M known data units. In this way, the quantity of latent variables of the intermediate output of the target encoder is also consistent with the quantity of input embedding vectors, thereby reducing the computation amount and the memory consumption of the target encoder.
A core feature of the transformer layer is a unique attention mechanism used by the transformer layer. When a natural language, for example, a sentence, is processed, a transformer model uses the attention mechanism to assign different attention coefficients to embedding vectors of various words in the sentence, to more comprehensively consider impact of a context of the sentence on the words. Specifically, the transformer layer may include a multi-head attention layer, an addition and normalization (add & norm) layer, a feed-forward layer, and an addition and normalization layer that are sequentially adjacent. The attention layer is connected to the embedding layer. The M embedding vectors are obtained from the embedding layer as input vectors. The embedding vectors are synthesized based on association degrees between the M embedding vectors to obtain the M output vectors. Then, the M output vectors are output to a subsequent transformer layer. The transformer layer obtains the output of the previous layer as the input vectors, and performs an operation similar to that of the previous transformer layer.
The multi-head attention layer obtains M input vectors X1 from a previous layer of the multi-head attention layer. The M input vectors X1 may also be represented as a matrix X. The vectors are transformed by using a self-attention mechanism based on an association degree between the vectors, to obtain M output vectors. The M output vectors may also be represented as a matrix Y. It may be understood that, when the multi-head attention layer is a layer directly connected to the embedding layer, for example, the transformer layer directly connected to the embedding layer in
Therefore, each association degree αi,j between the ith input vector Xi and each input vector Xj may be used as a weight factor to perform weighted combination on a third intermediate vector (v vector, vj) corresponding to each input vector Xj, thereby obtaining an ith combined vector Ci corresponding to the ith input vector Xi:
C
i=Σj=1Nαi,jvj.
Therefore, a vector sequence <C1, C2, . . . , CN> or a matrix C of M combined vectors corresponding to the M input vectors may be obtained. The M output vectors may be obtained based on the combined vector sequence. Specifically, in an embodiment, the vector sequence of the N combined vectors may be directly used as the M output vectors, that is, Yi=Ci. In this case, the output matrix Y is the combined vector matrix C, and may also be written as follows:
The foregoing describes a processing process of the attention head. In an MHA architecture, the MHA layer maintains m sets of transformation matrices, and each set of transformation matrices includes the first transformation matrix Q, the second transformation matrix K, and the third transformation matrix V. Therefore, the foregoing operations may be performed in parallel to obtain m combined vector sequences (that is, m matrices C), and each vector sequence includes N combined vectors obtained based on one set of transformation matrices. In this case, the MHA layer concatenates the obtained m combined vector sequences to obtain a concatenated matrix, and then transforms the concatenated matrix by using a fourth transformation matrix W to obtain the final output matrix Y. The output matrix Y is split, that is, corresponds to the M output vectors <Y1, Y2, . . . , YN>. According to the foregoing operation process, at the MHA layer, the transformation operation is performed based on the association degree between the N input vectors to obtain the M output vectors.
As shown in
In this embodiment of this disclosure, the target encoder includes an attention head. Because the known data units in the target data are visible to each other, when the M first embedding vectors are processed, there is an attention association between any two of the M first embedding vectors. Specifically, attention information may be obtained. The attention information indicates that there is an attention association between any two of the M first embedding vectors when the attention head processes the M first embedding vectors. In this way, the M first embedding vectors may be processed based on the attention information by using the target encoder, so that each output vector has a dependency relationship with the M first embedding vectors.
603: Process the M first output vectors and the second embedding vector by using a target prediction network, to obtain the first to-be-predicted data unit.
In this embodiment of this disclosure, after the M output vectors are obtained, the M output vectors may be input into the target prediction network, and the M first output vectors and the second embedding vector are processed by using the target prediction network, to obtain the first to-be-predicted data unit. The target prediction network may be a transformer layer.
The target prediction network may use the M first output vectors and the second embedding vector as input, to obtain a vector representation of the first to-be-predicted data unit. It should be understood that the first to-be-predicted data unit may be restored based on the vector representation of the first to-be-predicted data unit by using a classifier (for example, a support vector machine, a softmax classifier, or a K-nearest neighbors algorithm).
Text data is used as an example. In a data processing process of the target prediction network, a first to-be-predicted word may be obtained based on a position vector (the second embedding vector) corresponding to the first to-be-predicted word and each known word (the first embedding vector). Therefore, the target prediction network may use the M first output vectors and the second embedding vector as the input, to obtain a word vector representation of the first to-be-predicted word.
For example, it is learned that words at a position 3 to a position 6 in the target data are “sat on the mat”. A target is to predict first two words in a sentence. The target prediction network may first determine, based on four input vectors corresponding to “sat on the mat” and a prediction position 1, that a word of the first to-be-predicted word is “that”. Similarly, the target prediction network then predicts a word at a position 2 based on “that sat on the mat”.
In this embodiment of this disclosure, the target data further includes a second to-be-predicted data unit. Before the M first embedding vectors are processed by using the target encoder, a prediction order of the first to-be-predicted data unit and the second to-be-predicted data unit may be randomly determined. If the prediction order indicates that the second to-be-predicted data unit is predicted after the first to-be-predicted data unit, a fourth embedding vector and a fifth embedding vector may be obtained after the first to-be-predicted data unit is obtained. The fourth embedding vector indicates the first to-be-predicted data unit and the second position of the first to-be-predicted data unit in the target data. The fifth embedding vector indicates a third position, in the target data, of the second to-be-predicted data unit in the target data. The M first embedding vectors and the fourth embedding vector are processed by using the target encoder, to obtain the M known data units and M+1 second output vectors corresponding to the first to-be-predicted data unit. The M+1 second output vectors and the fifth embedding vector are processed by using the target prediction network, to obtain the second to-be-predicted data unit.
A second output vector corresponding to each known data unit is generated based on the M first embedding vectors. The second output vectors corresponding to the first to-be-predicted data unit are generated based on the M first embedding vectors and the fourth embedding vector.
The following uses text data as an example to describe the data processing method in this embodiment of this disclosure with reference to a specific example.
With reference to
The autoregressive word vector encoding module may be the target encoder in the foregoing embodiment. The query module is configured to generate the second embedding vector and the fifth embedding vector. The prediction module may be the target prediction network in the foregoing embodiment. The predicted token may be the to-be-predicted data unit in the foregoing embodiment.
With reference to
The autoregressive word vector encoding module may learn context information corresponding to each word, and finally obtain, through learning for each word in a sentence, a word vector sequence (that is, the output vector in the foregoing embodiment) including the context information of the word.
The autoregressive word vector encoding module may be shown in a left-side diagram of
With reference to
In this embodiment of this disclosure, prediction is performed in a random order manner. Order information of a to-be-predicted data unit is fully used, and the order information is explicitly integrated into an output vector.
It should be understood that the foregoing describes a method for predicting a to-be-predicted word by using text data as an example. The data processing method in this embodiment of this disclosure may be further applied to a computer vision field or a speech field. Specifically, the target text may be replaced with a sequence of an image or a speech. Correspondingly, operations such as disorder and block division of the preprocessing module are performed on the sequence, to obtain a vector sequence of rearranged image or speech units and position information of a to-be-predicted position. The vector sequence and the position information are input to the autoregressive encoding module and the query module. Finally, the prediction module obtains a corresponding image or speech unit in the to-be-predicted position.
This embodiment of this disclosure may further be presented in a form of a service or software on a cloud side. With reference to
This embodiment of this disclosure provides the data processing method. The method includes: obtaining the M first embedding vectors and the second embedding vector, where each first embedding vector indicates the known data unit in target data and the first position of the known data unit in the target data, the second embedding vector indicates the second position, in the target data, of the first to-be-predicted data unit in the target data, and M is a positive integer; processing the M first embedding vectors by using the target encoder, to obtain the M first output vectors corresponding to the M known data units, where the first output vector corresponding to each known data unit is generated based on the M first embedding vectors; and processing the M first output vectors and the second embedding vector by using the target prediction network, to obtain the first to-be-predicted data unit. In the foregoing manner, for the M first embedding vectors corresponding to the M known data units, the target encoder may use the M first embedding vectors as input. The first embedding vectors include position information and data information of the known data units. M pieces of additional position information do not need to be separately set as the input of the target encoder, and a quantity of latent variables of intermediate output of the target encoder is also consistent with a quantity of input embedding vectors, thereby reducing a computation amount and memory consumption of the target encoder.
The foregoing describes a model inference process. The following describes, from a perspective of model training, the data processing method provided in embodiments of this disclosure.
1501: Obtain a first encoder, a first prediction network, M first embedding vectors, and a second embedding vector, where each first embedding vector indicates one known data unit in target data and a first position of the known data unit in the target data, the second embedding vector indicates a second position, in the target data, of a first to-be-predicted data unit in the target data, and M is a positive integer.
In this embodiment of this disclosure, the first encoder and the first prediction network are to-be-trained neural network models.
In one embodiment, the target data is text data, the known data unit is a known word in the text data, and the first to-be-predicted data unit is a to-be-predicted word in the text data;
In one embodiment, the first position indicates a relative position relationship between the known data unit and another known data unit and a relative position relationship between the known data unit and the first to-be-predicted data unit, and the second position indicates a relative position relationship between the first to-be-predicted data unit and each known data unit in the target data.
In one embodiment, the first encoder is a first transformer layer, and the first prediction network is a second transformer layer.
For more description of operation 1501, refer to the description of operation 601. Details are not described herein again.
1502: Process the M first embedding vectors by using the first encoder, to obtain M first output vectors corresponding to M known data units, where first output vector corresponding to each known data unit is generated based on the M first embedding vectors.
In one embodiment, the first transformer layer includes a plurality of serial transformer sub-layers. Data output by a previous transformer sub-layer adjacent to each transformer sub-layer may be processed by using the transformer sub-layer, to obtain M intermediate vectors. The M intermediate vectors are output to a next transformer sub-layer adjacent to the transformer sub-layer. If the transformer sub-layer is a transformer layer closest to an input side in the plurality of transformer sub-layers, input data of the transformer sub-layer is the M first embedding vectors.
If the transformer sub-layer is a transformer layer closest to an output side in the plurality of transformer sub-layers, output data of the transformer sub-layer is the M first output vectors.
For more description of operation 1502, refer to the description of operation 602. Details of similarities are not described herein again.
1503: Process the M first output vectors and the second embedding vector by using the first prediction network, to obtain a third predicted data unit.
The third predicted data unit is a result of prediction performed by the first prediction network.
For more description of operation 1503, refer to the description of operation 603. Details of similarities are not described herein again.
1504: Update the first encoder and the first prediction network based on a difference between the third predicted data unit and the first to-be-predicted data unit, to obtain a target encoder and a target prediction network.
The third predicted data unit is the result of prediction performed by the first prediction network. Therefore, a loss needs to be constructed based on the difference between the third predicted data unit and the first to-be-predicted data unit, and the first encoder and the first prediction network are updated based on the constructed loss, to obtain the target encoder and the target prediction network. It should be understood that another network structure such as an embedding layer may also be updated based on the foregoing loss. This is not limited herein.
In one embodiment, the target data further includes a second to-be-predicted data unit. Before the M first embedding vectors are processed by using the first encoder to obtain a first output vector corresponding to each known data unit, a prediction order of the first to-be-predicted data unit and the second to-be-predicted data unit may be randomly determined. If the prediction order indicates that the second to-be-predicted data unit is predicted after the first to-be-predicted data unit, a fourth embedding vector and a fifth embedding vector are obtained after the third predicted data unit is obtained. The fourth embedding vector indicates the first to-be-predicted data unit and the second position of the first to-be-predicted data unit in the target data. The fifth embedding vector indicates a third position, in the target data, of the second to-be-predicted data unit in the target data. The M first embedding vectors and the fourth embedding vector are processed by using the first encoder, to obtain each known data unit and second output vectors corresponding to the first to-be-predicted data unit. The M+1 second output vectors and the fifth embedding vector are processed by using the first prediction network, to obtain a fourth to-be-predicted data unit. Further, the first encoder and the first prediction network may be updated based on the difference between the third predicted data unit and the first to-be-predicted data unit and a difference between the fourth to-be-predicted data unit and the second to-be-predicted data unit, to obtain the target encoder and the target prediction network.
In one embodiment, a second output vector corresponding to each known data unit is generated based on the M first embedding vectors, and the second output vectors corresponding to the first to-be-predicted data unit are generated based on the M first embedding vectors and the fourth embedding vector.
For example, the target data is text data. Parameter optimization in a training stage may be performed by using a standard back propagation algorithm in deep learning. A loss function of this stage may be as follows:
L(θ1)=log P(y|x;θ1)=Σi∈S log p(yi|x;θ1).
θ1 is all parameters (including a transformer parameter, a position vector parameter, and a classifier parameter) of the model, x is an entire input sequence including several elements, y represents a sequence including all words that need to be predicted (that is, an original word corresponding to each to-be-predicted position), S represents a set of positions of all words in y, and yi represents a word that needs to be predicted at an ith position.
With reference to
1701: Obtain M first embedding vectors and a second embedding vector, where each first embedding vector indicates one data unit in target data and a first position of the data unit in the target data, the second embedding vector indicates a target processing task, and M is a positive integer.
Different from the embodiment corresponding to
In one embodiment, the first position indicates a relative position relationship between the data unit and another data unit.
In one embodiment, the target data is text data, and the data unit is a word in the text data;
For more specific description of operation 1701, refer to the description of operation 601 in the foregoing embodiment. Details are not described herein again.
1702: Process the M first embedding vectors by using a target encoder, to obtain M output vectors corresponding to M data units, where an output vector corresponding to each data unit is generated based on the M first embedding vectors.
The target encoder in this embodiment of this disclosure may be obtained by using the target encoder in the embodiment corresponding to
In one embodiment, the target encoder is a first transformer layer.
In one embodiment, the first transformer layer includes a plurality of serial transformer sub-layers. Data output by a previous transformer sub-layer adjacent to each transformer sub-layer may be processed by using the transformer sub-layer, to obtain M intermediate vectors. The M intermediate vectors are output to a next transformer sub-layer adjacent to the transformer sub-layer. If the transformer sub-layer is a transformer layer closest to an input side in the plurality of transformer sub-layers, input data of the transformer sub-layer is the M first embedding vectors. If the transformer sub-layer is a transformer layer closest to an output side in the plurality of transformer sub-layers, output data of the transformer sub-layer is the M output vectors.
In one embodiment, the target encoder includes an attention head. Attention information may be obtained. The attention information indicates that there is an attention association between any two of the M first embedding vectors when the attention head processes the M first embedding vectors. The M first embedding vectors are processed based on the attention information by using the target encoder.
For more specific description of operation 1702, refer to the description of operation 602 in the foregoing embodiment. Details are not described herein again.
1703: Perform, by using a task network, processing corresponding to the target processing task on the M output vectors and the second embedding vector, to obtain a task processing result.
In one embodiment, the task network is a second transformer layer.
The following describes the data processing method provided in this embodiment of this disclosure by using an example in which the target processing task is a plurality of tasks: a text classification task and a reading comprehension task.
The autoregressive module may use a transformer layer as an autoregressive word vector encoder. The module adds each word vector in the rearranged word vector sequence and a position vector corresponding to the word vector (each position corresponds to one position vector, and is a part of parameters of a model). The attention matrix provided by the preprocessing module is used in the modeling process. The matrix defines whether each word is visible to another word in a process of modeling a word representation by the transformer layer. A solid line in
During training in a fine tuning stage, the model predicts a token corresponding to the sentence. Parameter optimization in the fine tuning stage may be performed by using a standard back propagation algorithm in deep learning. A loss function of this stage may be as follows:
L(θ2)=log P(y|x;θ2).
θ2 is all parameters (including a transformer parameter, a position vector parameter, a task encoding parameter, and a classifier parameter) of the model, x is the entire input sequence including several elements, and y indicates the token corresponding to the sentence.
The autoregressive module may use a transformer as an autoregressive word vector encoder. The module adds each word vector in the rearranged word vector sequence and a position vector corresponding to the word vector (each position corresponds to one position vector, and is a part of parameters of a model). The attention matrix provided by the preprocessing module is used in the modeling process. The matrix defines whether each word is visible to another word in a process of modeling a word representation by the transformer. A solid line in the figure indicates visible. The transformer finally obtains a word vector representation that integrates context information for each word, and outputs the word vector representation to a prediction module. The query module outputs a task vector corresponding to a task type, and outputs the task vector to the prediction module. The prediction module still uses a transformer model. The model models a vector representation of the sentence. Each finally modeled word vector passes through two classifiers (probabilities whether each word is START and END are respectively output, as shown in a table in
During training in a fine tuning stage, the model predicts probabilities of START and END corresponding to each word in the chapter. Parameter optimization in the fine tuning stage is performed by using a standard back propagation algorithm in deep learning. A loss function of this stage may be as follows:
L(θ3)=log P(ySTART|x;θ3)+log P(yEND|x;θ3).
θ3 is all parameters of the model (including a transformer parameter, a position vector parameter, a task encoding parameter, and a classifier parameter), x is the entire input sequence including several elements, P (ySTART|x; θ3) indicates a probability that the model predicts a word in the START position in the answer to be START, and P(yEND|x; θ3) indicates a probability that the model predicts a word in the END position in the answer to be END.
In an inference stage, the fine-tuned model may be used for prediction for a downstream task. A text classification task and a reading comprehension task are used as an example, a prediction manner of the model is the same as that in the fine tuning stage. A token of a sentence or a word is obtained by using four modules and a classifier. In the reading comprehension task, the model uses, as a word in a start position of the span, a word with a maximum START probability that is predicted by the classifier; and then uses, as a word in an end position of the span, a word with a maximum END probability after the start position.
Based on the embodiments corresponding to
The obtaining module 2001 is configured to obtain M first embedding vectors and a second embedding vector. Each first embedding vector indicates one known data unit in target data and a first position of the known data unit in the target data. The second embedding vector indicates a second position, in the target data, of a first to-be-predicted data unit in the target data. M is a positive integer.
For specific description of the obtaining module 2001, refer to the description of operation 601 in the foregoing embodiment. This is not described herein again.
The encoding module 2002 is configured to process the M first embedding vectors by using a target encoder, to obtain M first output vectors corresponding to M known data units. A first output vector corresponding to each known data unit is generated based on the M first embedding vectors.
For specific description of the encoding module 2002, refer to the description of operation 602 in the foregoing embodiment. This is not described herein again.
The prediction module 2003 is configured to process the M first output vectors and the second embedding vector by using a target prediction network, to obtain the first to-be-predicted data unit.
For specific description of the prediction module 2003, refer to the description of operation 603 in the foregoing embodiment. This is not described herein again.
In one embodiment, the first position indicates a relative position relationship between the known data unit and another known data unit and a relative position relationship between the known data unit and the first to-be-predicted data unit, and the second position indicates a relative position relationship between the first to-be-predicted data unit and each known data unit in the target data.
In one embodiment, the target encoder is a first transformer layer, and the target prediction network is a second transformer layer.
In one embodiment, the first transformer layer includes a plurality of serial transformer sub-layers, and the processing the M first embedding vectors by using a target encoder, to obtain M first output vectors corresponding to M known data units includes:
In one embodiment, the target encoder includes an attention head, and the encoding module is configured to: obtain attention information, where the attention information indicates that there is an attention association between any two of the M first embedding vectors when the attention head processes the M first embedding vectors; and
In one embodiment, the apparatus further includes:
In one embodiment, the target data further includes a second to-be-predicted data unit, and a prediction order of the second to-be-predicted data unit and the first to-be-predicted data unit is randomly determined.
In one embodiment, if the second to-be-predicted data unit is predicted after the first to-be-predicted data unit, the method further includes:
In one embodiment, a second output vector corresponding to each known data unit is generated based on the M first embedding vectors, and the second output vectors corresponding to the first to-be-predicted data unit are generated based on the M first embedding vectors and the fourth embedding vector.
In one embodiment, the target data is text data, the known data unit is a known word in the text data, and the first to-be-predicted data unit is a to-be-predicted word in the text data;
Specifically,
The obtaining module 2101 is configured to obtain M first embedding vectors and a second embedding vector. Each first embedding vector indicates one data unit in target data and a first position of the data unit in the target data. The second embedding vector indicates a target processing task. M is a positive integer.
For specific description of the obtaining module 2101, refer to the description of operation 1701 in the foregoing embodiment. This is not described herein again.
The encoding module 2102 is configured to process the M first embedding vectors by using a target encoder, to obtain M output vectors corresponding to M data units. An output vector corresponding to each data unit is generated based on the M first embedding vectors.
For specific description of the encoding module 2102, refer to the description of operation 1702 in the foregoing embodiment. This is not described herein again.
The task processing module 2103 is configured to perform, by using a task network, processing corresponding to the target processing task on the M output vectors and the second embedding vector, to obtain a task processing result.
For specific description of the task processing module 2103, refer to the description of operation 1703 in the foregoing embodiment. This is not described herein again.
In one embodiment, the first position indicates a relative position relationship between the data unit and another data unit.
In one embodiment, the target encoder is a first transformer layer, and the task network is a second transformer layer.
In one embodiment, the first transformer layer includes a plurality of serial transformer sub-layers, and the processing the M first embedding vectors by using a target encoder, to obtain M first output vectors corresponding to M known data units includes:
In one embodiment, the target encoder includes an attention head, and the encoding module is configured to: obtain attention information, where the attention information indicates that there is an attention association between any two of the M first embedding vectors when the attention head processes the M first embedding vectors; and
In one embodiment, the target data is text data, and the data unit is a word in the text data;
In one embodiment, the target processing task includes short text classification, long text classification, natural language inference, text similarity matching, or text emotion classification.
Specifically,
The obtaining module 2201 is configured to obtain a first encoder, a first prediction network, M first embedding vectors, and a second embedding vector. Each first embedding vector indicates one known data unit in target data and a first position of the known data unit in the target data. The second embedding vector indicates a second position, in the target data, of a first to-be-predicted data unit in the target data. M is a positive integer.
For specific description of the obtaining module 2201, refer to the description of operation 1501 in the foregoing embodiment. This is not described herein again.
The encoding module 2202 is configured to process the M first embedding vectors by using the first encoder, to obtain M first output vectors corresponding to M known data units. A first output vector corresponding to each known data unit is generated based on the M first embedding vectors.
For specific description of the encoding module 2202, refer to the description of operation 1502 in the foregoing embodiment. This is not described herein again.
The prediction module 2203 is configured to process the M first output vectors and the second embedding vector by using the first prediction network, to obtain a third predicted data unit.
For specific description of the prediction module 2203, refer to the description of operation 1503 in the foregoing embodiment. This is not described herein again.
The model training module 2204 is configured to update the first encoder and the first prediction network based on a difference between the third predicted data unit and the first to-be-predicted data unit, to obtain a target encoder and a target prediction network.
For specific description of the model training module 2204, refer to the description of operation 1504 in the foregoing embodiment. This is not described herein again.
In one embodiment, the first position indicates a relative position relationship between the known data unit and another known data unit and a relative position relationship between the known data unit and the first to-be-predicted data unit, and the second position indicates a relative position relationship between the first to-be-predicted data unit and each known data unit in the target data.
In one embodiment, the first encoder is a first transformer layer, and the first prediction network is a second transformer layer.
In one embodiment, the first transformer layer includes a plurality of serial transformer sub-layers, and the processing the M first embedding vectors by using the first encoder, to obtain M first output vectors corresponding to M known data units includes:
In one embodiment, the target data further includes a second to-be-predicted data unit, and a prediction order of the second to-be-predicted data unit and the first to-be-predicted data unit is randomly determined.
In one embodiment, if the second to-be-predicted data unit is predicted after the first to-be-predicted data unit, the method further includes:
The updating the first encoder and the first prediction network based on a difference between the third predicted data unit and the first to-be-predicted data unit, to obtain a target encoder and a target prediction network includes:
In one embodiment, a second output vector corresponding to each known data unit is generated based on the M first embedding vectors, and the second output vectors corresponding to the first to-be-predicted data unit are generated based on the M first embedding vectors and the fourth embedding vector.
In one embodiment, the target data is text data, the known data unit is a known word in the text data, and the first to-be-predicted data unit is a to-be-predicted word in the text data;
The following describes an execution device provided in an embodiment of this disclosure.
The memory 2304 may include a read-only memory and a random access memory, and provide instructions and data for the processor 2303. A part of the memory 2304 may further include a non-volatile random access memory (non-volatile random access memory, NVRAM). The memory 2304 stores processor-executable instructions, an executable module or a data structure, a subnet thereof, or an expanded set thereof. The operation instructions may include various operation instructions to implement various operations.
The processor 2303 controls an operation of the execution device. In specific application, components of the execution device are coupled by using a bus system. In addition to a data bus, the bus system may further include a power bus, a control bus, a status signal bus, and the like. However, for clear description, various types of buses in the figure are all referred to as the bus system.
The methods disclosed in the foregoing embodiments of this disclosure may be applied to the processor 2303, or may be implemented by the processor 2303. The processor 2303 may be an integrated circuit chip, and has a signal processing capability. In an implementation process, the operations in the foregoing methods may be implemented by using a hardware integrated logic circuit in the processor 2303, or by using instructions in a software form. The foregoing processor 2303 may be a general-purpose processor, a digital signal processor (DSP), a microprocessor, or a microcontroller, and may further include an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or another programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component. The processor 2303 may implement or perform the methods, operations, and logical block diagrams that are disclosed in embodiments of this disclosure. The general-purpose processor may be a microprocessor, or the processor may be any conventional processor or the like. The operations of the methods disclosed with reference to embodiments of this disclosure may be directly performed and completed by a hardware decoding processor, or may be performed and completed by a combination of hardware and a software module in the decoding processor. The software module may be located in a mature storage medium in the art, for example, a random access memory, a flash memory, a read-only memory, a programmable read-only memory, an electrically erasable programmable memory, or a register. The storage medium is located in the memory 2304, and the processor 2303 reads information in the memory 2304 and completes the operations in the foregoing methods in combination with hardware of the processor 2303.
The receiver 2301 may be configured to: receive input digital or character information, and generate signal input related to related setting and function control of the execution device. The transmitter 2302 may be configured to output digital or character information through a first interface. The transmitter 2302 may be further configured to send instructions to a disk group through the first interface, to modify data in the disk group. The transmitter 2302 may further include a display device such as a display.
In this embodiment of this disclosure, in a case, the processor 2303 is configured to perform the data processing method described in the embodiments corresponding to
An embodiment of this disclosure further provides a training device.
The training device 2400 may further include one or more power supplies 2426, one or more wired or wireless network interfaces 2450, one or more input/output interfaces 2458, or one or more operating systems 2441, for example, Windows Server™, Mac OS X™, Unix™, Linux™, or FreeBSD™.
In this embodiment of this disclosure, the central processing unit 2424 is configured to perform the data processing method described in the embodiment corresponding to
An embodiment of this disclosure further provides a computer program product. When the computer program product is run on a computer, the computer is enabled to perform the operations performed by the foregoing execution device, or the computer is enabled to perform the operations performed by the foregoing training device.
An embodiment of this disclosure further provides a computer-readable storage medium. The computer-readable storage medium stores a program for signal processing. When the program is run on a computer, the computer is enabled to perform the operations performed by the foregoing execution device, or the computer is enabled to perform the operations performed by the foregoing training device.
The execution device, the training device, or the terminal device provided in embodiments of this disclosure may be specifically a chip. The chip includes a processing unit and a communication unit. The processing unit may be, for example, a processor. The communication unit may be, for example, an input/output interface, a pin, or a circuit. The processing unit may execute computer-executable instructions stored in a storage unit, so that a chip in the execution device performs the data processing method described in the foregoing embodiments, or a chip in the training device performs the data processing method described in the foregoing embodiments. In one embodiment, the storage unit is a storage unit in the chip, for example, a register or a buffer. Alternatively, the storage unit may be a storage unit in a wireless access device but outside the chip, for example, a read-only memory (ROM), another type of static storage device that can store static information and instructions, or a random access memory (RAM).
Specifically,
The data processing methods described in the embodiments corresponding to
In some implementations, the operation circuit 2503 includes a plurality of processing engines (PE) inside. In some implementations, the operation circuit 2503 is a two-dimensional systolic array. The operation circuit 2503 may be alternatively a one-dimensional systolic array or another electronic circuit that can perform mathematical operations such as multiplication and addition. In some implementations, the operation circuit 2503 is a general-purpose matrix processor.
For example, it is assumed that there is an input matrix A, a weight matrix B, and an output matrix C. The operation circuit fetches data corresponding to the matrix B from a weight memory 2502, and buffers the data on each PE in the operation circuit. The operation circuit fetches data of the matrix A from an input memory 2501 to perform a matrix operation on the matrix B, to obtain a partial result or a final result of the matrix that is then stored in an accumulator 2508.
A unified memory 2506 is configured to store input data and output data. Weight data is directly transferred to the weight memory 2502 through a direct memory access controller (DMAC) 2505. The input data is also transferred to the unified memory 2506 by using the DMAC.
A BIU is a bus interface unit, namely, a bus interface unit 2510, and is configured to perform interaction between an AXI bus and each of the DMAC and an instruction fetch buffer (IFB) 2509.
A bus interface unit (BIU) 2510 is used by an instruction fetch buffer 2509 to obtain instructions from an external memory, and is further used by the direct memory access controller 2505 to obtain original data of the input matrix A or the weight matrix B from the external memory.
The DMAC is mainly configured to transfer input data in the external memory DDR to the unified memory 2506, transfer weight data to the weight memory 2502, or transfer input data to the input memory 2501.
A vector calculation unit 2507 includes a plurality of operation processing units. If required, further processing is performed on an output of the operation circuit, for example, vector multiplication, vector addition, an exponential operation, a logarithmic operation, or size comparison. The vector calculation unit 2507 is mainly configured to perform network calculation at a non-convolutional/fully connected layer in a neural network, for example, batch normalization, pixel-level summation, and upsampling on a feature plane.
In some implementations, the vector calculation unit 2507 can store a processed output vector in the unified memory 2506. For example, the vector calculation unit 2507 may apply a linear function or a non-linear function to the output of the operation circuit 2503, for example, perform linear interpolation on a feature plane extracted at a convolutional layer. For another example, the linear function or the non-linear function is applied to a vector of an accumulated value to generate an activation value. In some implementations, the vector calculation unit 2507 generates a normalized value, a pixel-level summation value, or both. In some implementations, the processed output vector can be used as activation input to the operation circuit 2503, for example, to be used in a subsequent layer in the neural network.
The instruction fetch buffer 2509 connected to the controller 2504 is configured to store instructions used by the controller 2504.
The unified memory 2506, the input memory 2501, the weight memory 2502, and the instruction fetch buffer 2509 are all on-chip memories. The external memory is private for a hardware architecture of the NPU.
The processor mentioned anywhere above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits configured to control program execution.
It should be further noted that the apparatus embodiments described above are merely examples, and units described as separate components may be or may not be physically separate. A component displayed as a unit may be or may not be a physical unit; and may be located in one place, or may be distributed in a plurality of network units. Some or all of the modules may be selected according to an actual requirement, to achieve the objectives of the solutions in embodiments. In addition, in the accompanying drawings of the apparatus embodiments provided in this disclosure, connection relationships between modules indicate that there are communication connections between the modules, and may be specifically implemented as one or more communication buses or signal cables.
Based on the description of the foregoing implementations, a person skilled in the art may clearly understand that this disclosure may be implemented by software in addition to necessary universal hardware, or certainly may be implemented by dedicated hardware, including a dedicated integrated circuit, a dedicated CPU, a dedicated memory, a dedicated component, and the like. Usually, any function implemented by a computer program can be easily implemented by using corresponding hardware. In addition, specific hardware structures used to implement a same function may be various, for example, an analog circuit, a digital circuit, or a dedicated circuit. However, in this disclosure, a software program implementation is a better implementation in most cases. Based on such an understanding, the technical solutions of this disclosure essentially or the part contributing to the conventional technology may be implemented in a form of a software product. The computer software product is stored in a readable storage medium, such as a floppy disk, a USB flash drive, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disc of a computer, and includes several instructions for instructing a computer device (which may be a personal computer, a training device, or a network device) to perform the methods in embodiments of this disclosure.
All or some of the foregoing embodiments may be implemented by software, hardware, firmware, or any combination thereof. When software is used to implement the embodiments, all or some of the embodiments may be implemented in a form of a computer program product.
The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or some procedures or functions in embodiments of this disclosure are generated. The computer may be a general-purpose computer, a dedicated computer, a computer network, or another programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or may be transmitted from a computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, training device, or data center to another website, computer, training device, or data center in a wired (for example, a coaxial cable, an optical fiber, or a digital subscriber line (DSL)) or wireless (for example, infrared, radio, or microwave) manner. The computer-readable storage medium may be any usable medium accessible by the computer, or a data storage device, such as a training device or a data center, integrating one or more usable media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, a DVD), a semiconductor medium (for example, a solid state disk (SSD)), or the like.
Number | Date | Country | Kind |
---|---|---|---|
202110415349.1 | Apr 2021 | CN | national |
This application is a continuation of International Application No. PCT/CN2022/087028, filed on Apr. 15, 2022, which claims priority to Chinese Patent Application No. 202110415349.1, filed on Apr. 18, 2021. The disclosures of the aforementioned applications are hereby incorporated by reference in their entireties.