The present application claims priority to Chinese Patent Application No. CN202411328132.7, filed with the China National Intellectual Property Administration on Sep. 23, 2024, the disclosure of which is hereby incorporated herein by reference in its entirety.
The present disclosure relates to the technical field of data processing, and especially to the technical fields of artificial intelligence, big data, deep learning and large models.
At present, the natural language field is developing towards the era of hyperscale models. Training hyper-parametric models on massive textual data by means of the super computing power can enable the output language models to have general semantic understanding and generating capabilities of multi-task and few-shot learning. Large models have computational consumption and video memory footprints that increase in square level with the length of the inputs while demonstrating strong generalization capabilities, which brings the significant cost overhead for training and deploying models, and additionally limits their capabilities to solve long-text tasks.
The present disclosure provides a model training method and apparatus, a model reasoning method and apparatus, and an electronic device.
According to an aspect of the present disclosure, provided is a model training method, including:
According to another aspect of the present disclosure, provided is a model reasoning method, including:
According to another aspect of the present disclosure, provided is a model training apparatus, including:
According to another aspect of the present disclosure, provided is a model reasoning apparatus, including:
According to another aspect of the present disclosure, provided is an electronic device, including:
According to another aspect of the present disclosure, provided is a non-transitory computer-readable storage medium storing a computer instruction thereon, where the computer instruction is used to cause a computer to perform the method of any one of embodiments of the present disclosure.
According to another aspect of the present disclosure, provided is a computer program product, including a computer program that when executed by a processor, implements the method of any one of embodiments of the present disclosure.
In this way, according to the solution provided in the present disclosure, the initial token sequence can be folded based on the folding feature value, and a token sequence subjected to the folding (i.e., the first token sequence) with a length less than that of the initial token sequence is obtained. Further, the preset model is trained by using the token sequence subjected to the folding. As such, the model training efficiency is improved by compressing the input of the model, and the foundation for improving the model reasoning efficiency in the future is laid.
It should be understood that the content described in this part is not intended to identify critical or essential features of embodiments of the present disclosure, nor is it used to limit the scope of the present disclosure. Other features of the present disclosure will be easily understood through the following description.
Accompanying drawings are intended for a better understanding of this solution and do not constitute a limitation on the present disclosure. In the figures:
Exemplary embodiments of the present disclosure will be described hereinafter in conjunction with accompanying drawings, which include various details of the embodiments of the present disclosure to aid in understanding and which should be considered merely exemplary. Accordingly, those ordinarily skilled in the art should recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope of the present disclosure. Similarly, descriptions of well-known features and structures will be omitted from the following description for the sake of clarity and brevity.
The term “and/or” as used herein is merely a description of an association relation of associated objects, which indicates that may be three types of relations, for example, A and/or B may indicate that only A exists, or both A and B exist, or only B exists. The term “at least one” as used herein indicates any one of a plurality or any combination of at least two of the plurality, for example, including at least one of A, B and C, and may indicate the inclusion of any one or more elements selected from a set consisting of A, B and C. The terms “first” and “second” herein are used to refer to and distinguish between a plurality of similar technical terms, and are not meant to define an order, or to define only two. For example, a first feature and a second feature refer to two types/two features. The first feature may mean one or more, and the second feature may mean one or more.
In addition, in order to better illustrate the present disclosure, numerous specific details will be presented in the following detailed description. It should be understood by those skilled in the art that the present disclosure can be equally implemented without certain specific details. In some instances, methods, means, elements and circuits well known to those skilled in the art are not described in detail, in order to highlight the subject of the present disclosure.
The solution provided in the present disclosure provides a model training method to improve training efficiency of a large model.
Particularly,
Further, the method includes at least a portion of the following content. As shown in
Step S101: An initial token sequence for training a model is folded based on a folding feature value for folding a token sequence to obtain at least a first token sequence subjected to the folding.
Here, the initial token sequence represents a token sequence composed of T1 tokens, and the first token sequence obtained after the folding is performed includes some tokens of T1 tokens. Further, the first token sequence has a sequence length less than that of the initial token sequence.
For instance, in an example, the initial token sequence is folded in depth to obtain the first token sequence subjected to the folding. As such, the length of the input token is effectively compressed. In other words, the number of the tokens input to a preset model for the first time is reduced, which lays the foundation for improving the model training efficiency and the model reasoning efficiency in the future. For example, as shown in
In this way, according to the solution provided in the present disclosure, the initial token sequence can be folded according to the folding feature value to a token sequence subjected to the folding (i.e., the first token sequence) with a length less than that of the initial token sequence. Further, the preset model is trained by using the token sequence subjected to the folding. As such, the model training efficiency is improved by compressing the input of the model, which lays the foundation for improving the model reasoning efficiency in the future.
Further, in the solution provided in the present disclosure, since the input length of the model is compressed, the computational complexity and video memory resource footprints can be effectively reduced, and further the cost required for training and deploying the model can be lowered. In addition, according to the solution provided in the present disclosure, the capability of the model to solve long-text tasks (for instance, long-text summarization and long-text questions and answers) is effectively improved.
In a specific example, the preset model can be particularly a large model. Furthermore, the preset model can be particularly a large language model. Alternatively, other models are also possible, and are not limited here.
Further, the method includes at least a portion of the following content. As shown in
Here, the initial token sequence represents the token sequence composed of T1 tokens, and the first token sequence has a sequence length less than that of the initial token sequence.
Further, the first token sequence includes t11 tokens displayed after being subjected to the folding, where the t11 tokens are a part of the T1 tokens. Further, the second token sequence includes t12 tokens hidden by the tokens displayed after being subjected to the folding, where the t12 tokens are a part of the T1 tokens. Here, t11+t12=T1.
For example, continuing with
For instance, in an example, each token in the first token sequence is inputted into a corresponding position of the first layer in the preset model in an order of each token in the first token sequence. Here, “position” can particularly refer to an input position of the token.
For example, continuing with the initial token sequence shown in
In this way, according to the solution provided in the present disclosure, each token in the first token sequence subjected to the folding is used as the input of the first layer in the preset model and each token in the second token sequence is used as the inputs of other layers in the preset model except for the first layer. As such, the integrity of information of the input after the input of the model is compressed can be effectively ensured, and further the model training speed is increased while no influence on the model training effect is ensured, which lays the foundation for improving the model reasoning efficiency and reducing the cost required for training and deploying the model in the future.
Further, in an example, the N network layers are connected in series, and an output of a jth layer of the N network layers serves as an input of a (j+1)th layer of the N network layers. For instance, as shown in
In a specific example of the solution provided in the present disclosure, the input of each layer of the N network layers can be obtained in the following way, and particularly includes the following step:
Further, in an example, determining the input of the jth layer based on the folding feature value and the position where the jth layer of the N network layers is located in the N network layers may particularly include:
That is, in this example, the input of the first layer (for instance, when the value of j is 0) of the N network layers included in the preset model is the first token sequence, and inputs of other layers of the N network layers except for the first layer (i.e., when the value of j is an integer greater than 0 and less than N) are determined according to the numerical relationship between the number where the jth layer is located in the N network layers and the folding feature value. As such, provided is a refined solution for determining other token sequences in the initial token sequences except for the first token sequence are input to the preset model, which is simple and efficient.
In this way, the solution provided in the present disclosure provides a refined solution for determining how to input other token sequences (i.e., tokens covered by the folding) except for the first token sequence into the preset model based on a folding degree of the information of the input, which is simple and efficient. As such, the information loss caused by folding the information of the input of the model is effectively avoided, the computational load of each network layer is balanced, and the burden on the server when the information is processed by the model is reduced, which lays the foundation for increasing the model training speed and reducing the cost required for training and deploying the model.
Further, in a specific example, the input of the jth layer can be particularly obtained in the following way. Particularly, determining the input of the jth layer based on the numerical relationship between the number where the jth layer is located in the N network layers and the folding feature values can particularly include at least one of the following two manners.
In a first manner: the input of a non-first layer (for instance, when j has a value ranging from 1 to (N−1)) is determined. Particularly, in a case where the number (for instance, which can be understood as a value of j) where the jth layer is located in the N network layers is less than the folding feature value (for instance, in an example, if (the depth-of-layer minus 1) is less than the folding feature value s), the input of the jth layer is obtained based on an implicit output result of a (j−1)th layer and at least one token in the second token sequence.
It should be noted that in this example, the number where the jth layer is located in the N network layers can particularly refer to the value of j. Further, the depth-of-layer can particularly refer to a position where the layer is located in all the layers. For instance, for the first layer of the four layers, the depth of the first layer can be 1, and the depth of the next layer is 2.
Here, the implicit output result can be particularly understood as the implicit result obtained after the token is processed by the network layer.
It should be noted that in an example, j has a value ranging from 0 to (N−1). Further, the solution provided in the present disclosure provides an exemplary explanation made by using j having a value ranging from 0 to (N−1) as an example. It can be understood that j has a value ranging from 1 to N. In this case, corner markers can be adjusted accordingly based on an actual value, and are not limited herein.
It can be understood that the input of the first layer (for instance, the value of j is 0) is each token in the first token sequence. The input of the non-first layer can be obtained based on the manner 1. For instance, if the value of j is 1 and the folding feature value is 2, in this case, j is less than s or (the depth-of-layer minus 1) (for instance, 2 minus 1) is less than 2, then the input of the first layer can be particularly an implicit output result of the 0th layer and at least one token (i.e., one token covered by the folding) in the second token sequence. In other words, in a scenario, for some layers meeting conditions, in addition to using the implicit output result output from the previous layer (for instance, the (j−1)th layer) as the input of the next layer (for instance, the jth layer), it is necessary to introduce the token hidden by the token displayed after being subjected to the folding as another input of the next layer (for instance, the jth layer). As such, the foundation for effectively avoiding the loss of information of the original input is laid.
Further, in an example, obtaining the input of the jth layer based on the implicit output result of the (j−1)th layer and the at least one token in the second token sequence as described in the first manner may particularly include:
Here, j represents a number where a layer is located; i represents an input position of a token, i has a value depending on a value of the T1 and the folding feature value. For instance, in an example, i has a value ranging from 0 to [(T1/s)−1]. s represents the folding feature value.
For example, taking a preset model consisting of four network layers (which can be denoted as {L0, L1, L2, L3}) for example and continuing with the initial token sequence {x0, x1, x2, x3} being folded twice to obtain the first token sequence {x0, x2} and the second token sequence {x1, x3} as an example, j has a value ranging from 0 to 3, and i has a value of 0 or 1. Further, as shown in
In this way, according to the solution provided in the present disclosure, at least one token in the second token sequence is additionally introduced into the input of the network layer meeting a depth-of-layer requirement. As such, the loss of the information of the original input is effectively avoided. In other words, according to the solution provided in the present disclosure, a specific position of an input required by the compressed token is obtained by utilizing the depth dimension of the network layer while the input of the preset model is compressed, such that the loss of the information of the original input is avoided, and the model training effect is ensured while the foundation for improving the model reasoning efficiency and reducing the cost required for training and deploying the model in the future is laid.
Further, in a specific example, obtaining the input of the ith position of the jth layer based on the implicit output result hij−1 of the ith position of the (j−1)th layer and the token xj+i×s may particularly include:
For example, in an example, the input of the ith position of the jth layer can be denoted as x2, and the depth folding function can be denoted as (.,.) representing a sequence depth folding function for fusing an implicit output result of the previous layer (for instance, represented by a vector) and the additionally compressed input xj+i×s. For instance, the
(.,.) function can easily implement a bitwise addition operation.
In this case, if the number where the jth layer is located in the N network layers is less than the folding feature value s (for instance, (the depth-of-layer minus 1) is less than the folding feature value s), the input x′ of the ith position of the jth layer can be expressed as: xij=(hij−1, xj+i×s), j<s.
In this way, the solution provided in the present disclosure provides a specific solution for determining the input of the ith position of the jth layer. As such, the additionally introduced token of the second token sequence required for the input of the ith position of the jth layer can be quickly determined by using the solution, and further, no loss of the information of the compressed token can be effectively ensured. As such, the model training effect is ensured while the foundation for improving the model reasoning efficiency and reducing the cost required for training and deploying the model in the future is laid.
In a second manner: the input of the non-first layer (for instance, if j has a value of ranging from 1 to (N−1)) is determined. In a case where it is determined that the number where the jth layer is located in the N network layers is greater than or equal to the folding feature value (for instance, in an example, if (the depth-of-layer minus 1) is greater than or equal to the folding feature value s), the input of the ith position of the jth layer is obtained based on an implicit output result of the (j−1)th layer.
That is, in this example, if the number where the jth (the value of j is an integer greater than 0 and less than N) layer is located in the N network layers is greater than or equal to the folding feature value (for instance, if (the depth-of-layer minus 1) is greater than or equal to the folding feature value s), the input of the ith position of the jth layer can be obtained directly based on the implicit output result of the (j−1)th layer, without additionally introducing a token of the second token sequence. For instance, in an example, if the number where the jth (the value of j is an integer greater than 0 and less than N) layer is located in the N network layers is greater than or equal to the folding feature value (for instance, if (the depth-of-layer minus 1) is greater than or equal to the folding feature value s), the implicit output result hij−1 of the ith position of the (j−1)th layer can be directly used as the input of the ith position of the jth layer.
For example, continuing with a preset model consisting of four network layers {L0, L1, L2, L3} and the initial token sequence {x0, x1, x2, x3} being folded twice to obtain the first token sequence {x0, x2} and the second token sequence {x1, x3} as examples, j has a value ranging from 0 to 3, and i has a value of 0 or 1. Further, as shown in
For example, in an example, if the input of the ith position of the jth layer is denoted as x2, and in this case, the number (for instance, a value of j) where the jth (the non-first layer, for instance, a value of j is an integer greater than 0 and less than N) layer is located in the N network layers is greater than or equal to the folding feature value s, the input x′ of the ith position of the jth layer can be expressed as: xij=hij−1, j≥s.
That is, when the number where a layer is located is greater than or equal to the folding feature value (which can also be referred to be as a sequence depth folding multiple) s, the input of the ith position of the jth layer is the output of the previous layer, and no additionally folded and compressed tokens need to be fused.
It should be noted that according to the solution provided in the present disclosure, the length of the input required to be received by the model is linearly compressed. In order to ensure that information after the input is compressed is not lost, the compressed tokens are sequentially used as additional inputs of other layers except for the first layer in an original order of the inputs. Such a manner of the inputs fully utilizes the depth-of-layer dimension, effectively reduces the length of the input and avoids the loss of the information.
In this way, the solution provided in the present disclosure determines a specific solution for determining the input of the ith position of the jth layer. In the solution, the specific information of the inputs required by different network layers is determined by fully utilizing the depth-of-layer dimension of the network layer. As such, the loss of the information of the original input due to the compression is avoided while the input of the model is effectively compressed. Further, the model training speed is increased while the model training effect is ensured, and the foundation for improving the model reasoning ability and reducing the cost required for training and deploying the model in the future is laid.
Further, the method includes at least a portion of the following content. As shown in
Here, the initial token sequence represents a token sequence composed of T1 tokens, and the first token sequence has a sequence length less than that of the initial token sequence.
Further, the first token sequence includes t11 tokens displayed after being subjected to the folding, and the second token sequence includes t12 tokens hidden by the tokens displayed after being subjected to the folding.
Here, relevant explanations on the initial token sequence, the first token sequence and the second token sequence can refer to the above examples, and will be omitted here.
Here, relevant explanations on specific inputs of different network layers refer to the above examples, and will be omitted here.
In this way, the solution provided in the present disclosure provides a refined training solution for a preset model to obtain a trained model (i.e., the target model). As such, computational resources required for training the model are effectively reduced, and further the model training efficiency is improved. Moreover, the capability of the trained model to solve long-text tasks is improved, and the foundation for improving the model reasoning efficiency and reducing the cost required for training and deploying the model in the future is laid.
Further, in a specific example, the predicted token sequence includes a predicted token output by the last layer of the N network layers included in the preset model and predicted tokens output by other layers of the N network layers except for the last layer (for instance, the predicted tokens output by the last s (folding feature value) layers). For example, continuing with the preset model consisting of four network layers {L0, L1, L2, L3} and the initial token sequence {x0, x1, x2, x3} being folded twice to obtain the first token sequence {x0, x2} and the second token sequence {x1, x3} as examples, as shown in
Further, in an example, except for the last layer, the number where the network layer for the output predicted token is located is related to the folding feature value. For instance, the number (for instance, a value of j) where the network layer for the output predicted token is located is greater than the folding feature value. That is, it is necessary for not all network layers to output final prediction results and only some network layers (for instance, the last s layers) except for the last layer to output prediction results. As such, the foundation for improving the model reasoning efficiency and reducing the cost required for training and deploying the model is laid.
For instance, in an example, in a case where j is greater than or equal to a difference between the total number of the layers and the folding feature value s (for instance, for j having a value ranging from 0 to (N−1), in this case, particularly, j>N−1-s), the output of the ith position of the jth layer is used for predicting: x((s−1)−(N−1−j))+(i+1)×s.
Here, i has a value depending a value of the T and the folding feature value.
Here, the relevant explanation of the value of i refers to the above example, and will be omitted here.
For example, continuing with
Or, in a case where j has a value ranging from 0 to 3, the output yij of the ith position of the jth layer for the predicted token can particularly have an expression:
That is, in the last s layers (for instance, from the (N−1−s+1)th layer to the (N−1)th layer, a total of s layers) of the preset model, each layer needs to predict the next token of the input token. For instance, for the input of the ith position, the token required for an (i+1)th position is predicted.
It should be noted that yij is used for the predicted token x((s−1)−(N−1−j))+(i+1)×s, which can be particularly understood as: the token yij is the next token of the predicted token x((s−1)−(N−1−j))+(i+1)×s−1.
For example, continuing with
That is, in an output prediction phase of the solution of present disclosure, since one input position includes a plurality of compressed tokens (for instance, as shown in
In this way, the solution provided in the present disclosure provides a specific solution for determining a predicted token required for an output position of an ith position of a jth layer. In this solution, the predicted tokens required for the output of different network layers is determined by fully utilizing the depth-of-layer dimension of the network layer. As such, prediction reasoning can be completed. As such, the model reasoning and predicting efficiency can be effectively improved, and further the foundation for improving the model reasoning ability and reducing the cost required for training and deploying the model in the future is laid.
Further, the method includes at least a portion of the following content. As shown in
Here, the initial to-be-reasoned token sequence represents a token sequence composed of T2 tokens, and the first target to-be-reasoned token sequence obtained after the folding is performed includes some tokens of T2 tokens. Further, the first target to-be-reasoned token sequence has a sequence length less than that of the initial to-be-reasoned token sequence.
Here, an example of folding the initial to-be-reasoned token sequence refers to an example corresponding to
Here, the target reasoning result is a next token sequence of the target to-be-reasoned token sequence obtained from prediction.
In this way, according to the solution provided in the present disclosure, the initial to-be-reasoned token sequence can be folded according to the folding feature value to obtain a token sequence subjected to the folding (i.e., the first target to-be-reasoned token sequence) having a length less than that of the initial to-be-reasoned token sequence. Further, the model can be reasoned by the target model according to the folded token sequence. As such, the model reasoning efficiency is improved and further the user experience is effectively enhanced by compressing the input of the model.
Further, in the solution provided in the present disclosure, since the length of the input of the model is compressed, the computational complexity and the video memory resource footprints can be effectively reduced, and further the cost required for reasoning the model is lowered. In addition, according to the solution provided in the present disclosure, the capability of the model to solve long-text tasks (for instance, long-text summarization and long-text questions and answers) is effectively improved.
In a specific example, the target model can be particularly a large model. Furthermore, the target model can be particularly a large language model. Alternatively, other models are also possible, and are not limited herein.
Further, the target model is trained by using any one of the model training methods.
Further, in a specific example, folding the initial target to-be-reasoned token sequence to obtain at least the first target to-be-reasoned token sequence (for instance, step S902) may particularly include: folding the initial to-be-reasoned token sequence to obtain the first target to-be-reasoned token sequence and a second target to-be-reasoned token sequence.
Correspondingly, inputting at least the first target to-be-reasoned token sequence into the target model to obtain a target reasoning result (for instance, step S903) particularly includes:
For instance, in an example, each token in the first target to-be-reasoned token sequence can be input to a corresponding position of the first layer of the N target network layers included in the target model in an order of each token in the first target to-be-reasoned token sequence. Here, “position” can particularly refer to an input position of the token. Specific examples can refer to an example as shown in
Further, the first target to-be-reasoned token sequence includes t21 tokens displayed after being subjected to the folding, where the t21 tokens are a part of the T2 tokens. Further, the second target to-be-reasoned token sequence includes t22 tokens hidden by the tokens displayed after being subjected to the folding, where the t22 tokens are a part of the T2 tokens. Here, t21+t22=T1.
Here, relevant contents of the first target to-be-reasoned token sequence and the second target to-be-reasoned token sequence can refer to an example as shown in
In this way, according to the solution provided in the present disclosure, each token in the first target to-be-reasoned token sequence subjected to the folding is used as the input of the first layer in the target model, and each token in the second target to-be-reasoned token sequence is used as inputs of other layers in the target model except for the first layer. As such, the integrity of the information of the input can be effectively ensured after the input of the model is compressed, and further the model reasoning effect is ensured while the model reasoning speed is increased.
Further, in a specific example, obtaining the target reasoning result particularly includes:
Here, in an example, n has a value depending on the folding feature value s. Further, n=folding feature value. In this case, the target reasoning result output by the last position of the last s target network layers of the N target network layers can be obtained.
Further, the target reasoning result is at least the next token of the last token in the predicted initial to-be-reasoned token sequence. Further, the number of the tokens included in the target reasoning result is related to the folding feature value s. For instance, the number of the tokens included in the target reasoning result is equal to the folding feature value s.
For example, with a target model consisting of four target network layers {L0*, L1*, L2*, L3*} and the initial to-be-reasoned token sequence {x0, x1, x2, x3} being folded twice to obtain a first target to-be-reasoned token sequence {x0, x2} and a second target to-be-reasoned token sequence {x1, x3} as examples, as shown in
In this way, the solution provided in the present disclosure can quickly obtain a target model reasoning result, effectively saving the computational resources required for reasoning the model. Moreover, a response can be provided to a user in real time and an accurate reasoning result is quickly provided, such that the user experience is improved.
Further, in an example, the N target network layers described above are connected in series. The output of the jth layer of the N target network layers serves as the input of the (j+1)th layer of the N target network layers. It should be noted that relevant contents of the example can refer to an example in
In a specific example of the solution provided in the present disclosure, the input of each layer in the N target network layers can be obtained in the following way and particularly includes:
Further, in an example, determining the input of the jth layer of the N target network layers based on the folding feature value and the position of the jth layer of the N target network layers in the N target network layers may particularly include:
That is, in this example, the input of the first layer (for instance, when j has a value of 0) of the N target network layers included in the target model is the first target to-be-reasoned token sequence. Inputs of other layers of the N target network layers except for the first layer (i.e., when a value of j is an integer greater than 0 and less than N) need to be determined according to the numerical relationship between the number where the jth layer of the N target network layers is located in the N target network layers and the folding feature value. As such, provided is a refined solution for determining that inputs of other token sequences in the initial to-be-reasoned token sequence except for the first target to-be-reasoned token sequence into the target model, which is simple and efficient.
In this way, the solution provided in the present disclosure provides a refined solution for determining how to input other token sequences (i.e., tokens covered by the folding) except for the first target to-be-reasoned token sequence into the target model according to a folding degree of the information of the input, which is simple and efficient. As such, the loss of the information due to the folding of the information of the input of the model is effectively avoided, the computational load of each target network layer is balanced, and the burden on the server when the information is processed by the model is reduced, which lays the foundation for increasing the model reasoning speed and reducing the cost required for reasoning the model.
Further, in a specific example, the input of the jth layer of the N target network layers can be obtained particularly in the following way. Particularly, determining the input of the jth layer of the N target network layers based on the numerical relationship between the number where the jth layer of the N target network layers is located in the N target network layers and the folding feature value may particularly include at least one of the following two manners.
In a first manner: the input of the non-first layer (for example, when j has a value ranging from 1 to (N−1)) of the N target network layers is determined. Particularly, in a case where it is determined that the number (for instance, which can be understood as a value of j) where the jth layer of the N target network layers is located in the N target network layers is less than the folding feature value (for instance, in an example, if (the depth-of-layer minus 1) is less than the folding feature value s), the input of the jth layer of the N target network layers is obtained based on the implicit output result of the (j−1)th layer of the N target network layers and at least one token in the second target to-be-reasoned token sequence.
Here, relevant contents on the number where the jth layer of the N target network layers is located in the N target network layers, the depth-of-layer and the implicit output result refer to the above example, and will be omitted here.
It should be noted that in an example, j has a value ranging from 0 to (N−1). Further, the solution provided in the present disclosure is exemplarily explained by using j having a value ranging from 0 to (N−1) as an example. It can be understood that j can have a value ranging from 1 to N. In this case, corner markers can be adjusted accordingly based on an actual value, and are not limited herein.
It can be understood that the input of the first layer (for instance, j has a value of 0) of the N target network layers is each token in the first target to-be-reasoned token sequence. The input of the non-first layer of the N target network layers can be obtained based on the first manner. For instance, with j having a value of 1 and the folding feature value being 2 as examples, j is less than s or (the depth-of-layer minus 1) (for instance, 2 minus 1) is less than 2, the input of the first layer of the N target network layers can be particularly the implicit output result of the 0th layer of the N target network layers and at least one token in the second target to-be-reasoned token sequence (i.e., one token covered by the folding). In other words, in a scenario, for some layers meeting conditions, in addition to using the implicit output result of the previous layer (for instance, the (j−1)th layer of the N target network layers) as the input of the next layer (for instance, the jth layer of the N target network layers), it is necessary to additionally introduce the token hidden by the token displayed after being subjected to the folding as another input of the next layer (for instance, the jth layer of the N target network layers). As such, the foundation for effectively avoiding the loss of information of the original input is laid.
Further, in an example, obtaining the input of the jth layer of the N target network layers based on the implicit output result of the (j−1)th layer of the N target network layers and at least one token in the second target to-be-reasoned token sequence as described in the first manner can particularly include:
Here, j represents the number of a layer; and i represents the input position of the token, and has a value depending on the value of the T2 and the folding feature value. For instance, in an example, i has a value ranging from 0 to [(T2/s)−1]. s represents the folding feature value.
Here, relevant contents on this section refer to the example shown in
In this way, according to the solution provided in the present disclosure, at least one token of the second target to-be-reasoned token sequence is additionally introduced into the input of the target network layer meeting a depth-of-layer requirement. As such, the loss of information of the original input is effectively avoided. In other words, according to the solution provided in the present disclosure, a specific position required to be input by the compressed token is obtained by utilizing the depth dimension of the target network layer while the input of the preset model is compressed, such that the loss of the information of the original input is effectively avoided, and further the model reasoning speed is increased while the model training effect is ensured.
Further, in a specific example, obtaining the input of the ith position of the jth layer of the N target network layers based on the implicit output result hij−1 of the ith position of the (j−1)th layer of the N target network layers and the token xj+i×s may particularly include:
For example, in an example, the input of the ith position of the jth layer of the N target network layers can be denoted as x′, and the depth folding function can be denoted as (.,.), representing a sequence depth folding function for fusing an implicit output result of the previous layer (for instance, represented by a vector) with the additionally compressed input xj+i×s. For instance, the
(.,.) function can easily implement a bitwise addition operation.
In this case, in a case where the number where the jth layer of the N target network layers is located in the target N network layers is less than the folding feature value s (for instance, in a case where (the depth-of-layer minus 1) is less than the folding feature value s), the input xij of the ith position of the jth layer of the N target network layers can be expressed as: xij=(hij−1, xj+i×s), j<s.
In this way, the solution provided in the present disclosure provides a specific solution for determining the input of the ith position of the jth layer of the N target network layers. As such, the additionally introduced token of the second target to-be-reasoned token sequence required for the input of the ith position of the jth layer of the N target network layers can be quickly determined by using the solution, and further no loss of the compressed information of the token is ensured. As such, the model reasoning efficiency is improved while the model training effect is ensured.
In a second manner 2: the input of the non-first layer (for instance, when j has a value ranging from 1 to (N−1)) of the N target network layers is determined. Particularly, in a case where it is determined that the number where the jth layer of the N target network layers is located in the N target network layers is greater than or equal to the folding feature value (for instance, in an example, if (the depth-of-layer minus 1) is greater than or equal to the folding feature value s), the input of the ith position of the jth layer of the N target network layers is obtained based on an implicit output result of the (j−1)th layer of the N target network layers.
That is, in this example, in a case where the number where the jth layer of the N target network layers is located in the N target network layers (a value of j is an integer greater than 0 and less than N) is greater than or equal to the folding feature value (for instance, if (the depth-of-layer minus 1) is greater than or equal to the folding feature value s), the input of the ith position of the jth layer of the N target network layers can be obtained directly based on the implicit output result of the (j−1)th layer of the N target network layers, without additionally introducing a token of the second target to-be-reasoned token sequence. For instance, in an example, in a case where the number where the jth layer of the N target network layers is located in the N target network layers (a value of j is an integer greater than 0 and less than N) is greater than or equal to the folding feature value (for instance, if (the depth-of-layer minus 1) is greater than or equal to the folding feature value s), the implicit output result hij−1 of the ith position of the (j−1)th layer of the N target network layers can be directly used as the input of the ith position of the jth layer of the N target network layers. That is, when the number where a layer is located is greater than or equal to the folding feature value (which can also be referred to be as the sequence depth folding multiple) s, the input of the ith position of the jth layer is the output of the previous layer, and no additionally folded and compressed tokens need to be fused.
Here, relevant contents on this example can refer to the example shown in
It should be noted that according to the solution provided in the present disclosure, the length of the input required to be received by the model is linearly compressed. In order to ensure that the information after the input is compressed is not lost, the compressed tokens are sequentially used as additional inputs of other layers except for the first layer in an order of original inputs. Such a manner of the inputs fully utilizes the depth-of-layer dimension, effectively reduces the length of the input and avoids the loss of the information.
In this way, the solution provided in the present disclosure determines a specific solution for determining the input of the ith position of the jth layer of the N target network layers. In the solution, specific information of the inputs required by different network layers is determined by fully utilizing the depth-of-layer dimension of the target network layer. As such, the loss of the information of the original input due to the compression is avoided while the input of the model is effectively compressed. Further, the model reasoning speed is increased while the model training effect is ensured.
In summary, the solution provided in the present disclosure has the following advantages.
Firstly, the efficiency is relatively high. Compared with an improved solution of an efficient Transformer structure, the solution provided in the present disclosure is simple to implement and can improve the model training and reasoning efficiency in practical application scenarios. Moreover, the solution provided in the present disclosure will not be affected by the compression of the information of the input, thereby supporting the infinite long-text input.
Secondly, the training efficiency and the reasoning efficiency are realized. Compared with the efficient reasoning solution in low-resource scenarios, the solution provided in the present disclosure not only plays an acceleration role in a training phase, but also improves the efficiency in a reasoning phase, and achieves the integration of training and reasoning, thereby ensuring better results.
The solution provided in the present disclosure provides a model training apparatus, as shown in
In a specific example of the solution provided in the present disclosure, the first data processing unit is particularly configured to fold the initial token sequence to obtain the first token sequence and a second token sequence, where the first token sequence includes t11 tokens displayed after being subjected to the folding, and the second token sequence includes t12 tokens hidden by the tokens displayed after being subjected to the folding; and
the model training unit is particularly configured to adjust at least some network parameters in the N network layers by using each token in the first token sequence as an input of a first layer of N network layers included in the preset model and at least using each token in the second token sequence as partial inputs of other layers of the N network layers except for the first layer, to obtain the target model.
In a specific example of the solution provided in the present disclosure, the N network layers are connected in series, and an output of a jth layer of the N network layers serves as an input of a (j+1)th layer of the N network layers.
In a specific example of the solution provided in the present disclosure, the first data processing unit is further configured to determine an input of the jth layer based on the folding feature value and a position where the jth layer of the N network layers is located in the N network layers, where j has a value depending on the N.
In a specific example of the solution provided in the present disclosure, the first data processing unit is particularly configured to:
In a specific example of the solution provided in the present disclosure, the first data processing unit is particularly configured to:
In a specific example of the solution provided in the present disclosure, where the first data processing unit is particularly configured to:
In a specific example of the solution provided in the present disclosure, the first data processing unit is particularly configured to:
In a specific example of the solution provided in the present disclosure, the first data processing unit is particularly configured to:
In a specific example of the solution provided in the present disclosure, the model training unit is particularly configured to:
In a specific example of the solution provided in the present disclosure, the predicted token sequence includes a predicted token output by the last layer of the N network layers included in the preset model and predicted tokens output by other partial layers of the N network layers except for the last layer.
In a specific example of the solution provided in the present disclosure, except for the last layer, the number of the network layers for the output predicted tokens is related to the folding feature value.
In a specific example of the solution provided in the present disclosure, the number of the network layers for the output predicted tokens is greater than the folding feature value.
In a specific example of the solution provided in the present disclosure, in a case where j is greater than or equal to a difference between the total number of the layers and the folding feature value s, the predicted token output from the ith position of the jth layer is used for predicting:
where i has a value depending on a value of the T and the folding feature value.
The solution provided in the present disclosure further provides a model reasoning apparatus, as shown in
In a specific example of the solution provided in the present disclosure, the second data processing unit is particularly configured to: fold the initial to-be-reasoned token sequence to obtain the first target to-be-reasoned token sequence and a second target to-be-reasoned token sequence, where the first target to-be-reasoned token sequence includes t21 tokens displayed after being subjected to the folding, and the second target to-be-reasoned token sequence includes t22 tokens hidden by the tokens displayed after being subjected to the folding; and
In a specific example of the solution provided in the present disclosure, the N target network layers are connected in series, and an output of a jth layer of the N target network layers serves as an input of a (j+1)th layer of the N target network layers.
In a specific example of the solution provided in the present disclosure, the second data processing unit is further configured to determine an input of the jth layer of the N target network layers based on the folding feature value and a position where the jth layer of the N target network layers is located in the N target network layers, where j has a value depending on the N.
In a specific example of the solution provided in the present disclosure, the second data processing unit is particularly configured to:
In a specific example of the solution provided in the present disclosure, the second data processing unit is particularly configured to:
In a specific example of the solution provided in the present disclosure, the second data processing unit is particularly configured to:
In a specific example of the solution provided in the present disclosure, the second data processing unit is particularly configured to:
In a specific example of the solution provided in the present disclosure, the second data processing unit is particularly configured to:
In a specific example of the solution provided in the present disclosure, the model reasoning unit is particularly configured to:
The description of specific functions and examples of the units of the apparatus of the embodiments of the present disclosure can be found in the relevant descriptions of the corresponding steps in the method of the embodiments, and will be omitted here.
In the technical solution provided in the present disclosure, the acquisition, storage and application of personal information of users involved are in accordance with relevant laws and regulations and do not violate public order and morals.
According to an embodiment of the present disclosure, the present disclosure further provides an electronic device, a readable storage medium and a computer program product.
As shown in
A plurality of components in the device 1300 are connected to the I/O interface 1305, including an input unit 1306 such as a keyboard and a mouse, an output unit 1307 such as various types of displays and speakers, the storage unit 1308 such as a disk and a CD, and a communication unit 1309 such as a network card, a modem and a wireless communication transceiver. The communication unit 1309 allows the device 1300 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.
The computing unit 1301 can be various general-purpose and/or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 1301 include but are not limited to central processing units (CPUs), graphics processing units (GPUs), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital signal processors (DSPs), and any suitable processors, controllers, microcontrollers, and the like. The computing unit 1301 executes various methods and processes described above, such as a model training method or a model reasoning method. For example, in some embodiments, the model training method or the model reasoning method may be implemented as a computer software program tangibly contained in a machine-readable medium, such as the storage unit 1308. In some embodiments, some or all of the computer programs may be loaded and/or installed onto the device 1300 via the ROM 1302 and/or the communication unit 1309. When the computer programs are loaded into the RAM 1303 and executed by the computing unit 1301, one or more steps of the model training method or the model reasoning method described above can be performed. Alternatively, in other embodiments, the computing unit 1301 may be configured to perform the model training method or the model reasoning method through any other suitable means (e.g., by means of firmware).
Various implementations of systems and technologies described above in the present disclosure can be implemented in a digital electronic circuit system, an integrated circuit system, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), an application specific standard product (ASSP), a system on chip (SOC) system, a complex programmable logic device (CPLD), computer hardware, firmware, software, and/or a combination thereof. These various implementations may include: implementation in one or more computer programs, the one or more computer programs are executable and/or interpretable on a programmable system including at least one programmable processor which may be a special-purpose or general-purpose programmable processor and can receive data and instructions from a storage system, at least one input apparatus and at least one output apparatus and transmit the data and the instructions to the storage system, the at least one input apparatus and the at least one output apparatus.
A program code for implementing the method of the present disclosure can be written in any combination of one or more programming languages. These program codes can be provided to a processor or a controller of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatuses, such that the program code, when executed by the processor or the controller, implements functions/operations specified in the flowchart and/or the block diagram. The program code can be executed entirely on a machine, partially on a machine, partially on a machine as a standalone software package and partially on a remote machine, or entirely on a remote machine or a server.
In the context of the present disclosure, a machine-readable medium may be a tangible medium containing or storing a program for use by or in combination with an instruction execution system, apparatus or device. The machine-readable medium can be a machine-readable signal medium or machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus and device, or any suitable combination of the above. More specific examples of the machine-readable storage medium include an electrical connection based on one or more wires, a portable computer disk, a hard drive, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.
In order to provide the interaction with the user, a system and a technology described herein can be implemented on a computer equipped with a display apparatus (for example, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing apparatus (for example, a mouse or a trackball), through which the user can provide an input to the computer. Other types of apparatuses can be used to provide the interaction with the user. For example, feedback provided to the user can be any form of sensory feedback (for example, visual feedback, auditory feedback or tactile feedback), and can receive the input from the user in any form (including sound input, speech input or tactile input).
A system and a technology described herein can be implemented in a computing system including a backend component (for example, as a data server), or a computing system including a middleware component (for example, an application server), or a computing system including a frontend component (for example, a user computer with a graphical user interface or a web browser, through which the user can interact with the implementation of the system and the technology described herein), or a computing system including any combination of the backend component, the middleware component or the frontend component. The components of the system can be interconnected through any form or medium of digital data communication (for example, a communication network). Examples of the communication network include a local area network (LAN), a wide area network (WAN) and the Internet.
The computer system can include a client and a server. The client and the server are generally far apart from each other and typically interact through the communication network. A client-server relationship is generated by computer programs running on corresponding computers and having the client-server relationship with each other. The server can be a cloud server, or a server of a distributed system, or a server combining a blockchain.
It should be understood that steps can be reordered, added or deleted by using various forms of processes shown above. For example, the steps described in the present disclosure can be performed in parallel, or sequentially or in different orders provided that the desired results of the technical solution disclosed in the present disclosure can be achieved, and are not limited herein.
The detailed description does not constitute a limitation on the scope of protection of the present disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions can be made based on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the principles of the present disclosure shall be included within the scope of protection of the present disclosure.
| Number | Date | Country | Kind |
|---|---|---|---|
| 202411328132.7 | Sep 2024 | CN | national |