This application claims priority to Chinese Patent Application No. 202110820258.6, filed on Jul. 20, 2021, the entire content of which is incorporated herein in its entirety by reference.
The present disclosure relates to a field of a computer technology, in particular to a method of executing an operation, an electronic device, and a computer-readable storage medium, which may be used in a field of artificial intelligence, especially in a field of deep learning.
With a wide application of deep learning training, people put forward higher and higher requirements to improve a speed of deep learning training. Various operations in the deep learning training may involve a scalar operation, a vector operation, etc. In a deep learning algorithm, a complex operation, such as a tensor operation, is usually performed for various application scenarios. The tensor operation may be decomposed into multiple continuous vector operations using a compiler. A lot of computing resources are consumed for executing these vector operations. As a result, it is difficult to process a large number of vector operations in time, and even causes the system for deep learning training to quit the execution of the operation due to insufficient computing resources. Therefore, an efficiency of a large number of continuous vector operations should be improved, so as to improve a speed of the whole deep learning training.
The present disclosure provides a method of executing an operation, an electronic device, and a computer-readable storage medium.
According to an aspect of the present disclosure, a method of executing an operation in a deep learning training is provided, including:
acquiring an instruction for the operation, the operation including a plurality of vector operations;
determining, for each vector operation of the plurality of vector operations, two source operand vectors for a comparison; and
executing the vector operation on the two source operand vectors using an instruction format for the vector operation, so as to obtain an operation result including a destination operand vector.
According to an aspect of the present disclosure, an electronic device is provided, including: at least one processor; and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions, when executed by the at least one processor, cause the at least one processor to implement the method of the present disclosure described above.
According to an aspect of the present disclosure, a non-transitory computer-readable storage medium having computer instructions therein is provided, wherein the computer instructions are configured to cause a computer to implement the method of the present disclosure described above.
It should be understood that content described in this section is not intended to identify critical or important features in embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will be easily understood through the following description.
The accompanying drawings are used for better understanding of the solution and do not constitute a limitation to the present disclosure, wherein:
Exemplary embodiments of the present disclosure will be described below with reference to the accompanying drawings, which include various details of embodiments of the present disclosure to facilitate understanding and should be considered as merely exemplary. Therefore, those of ordinary skilled in the art should realize that various changes and modifications may be made to embodiments described herein without departing from the scope and spirit of the present disclosure. Likewise, for clarity and conciseness, descriptions of well-known functions and structures are omitted in the following description.
As described in the background above, with the wide application of deep learning training, people put forward higher and higher requirements to improve the speed of deep learning training Various operations in the deep learning algorithm may involve the scalar operation, the vector operation, etc. An existing tensor operation in the deep learning algorithm may be decomposed into multiple continuous vector operations. These vector operations involve computation for SETcc (condition code) operation. For example, SETlt and SETgt each belong to a type of SETcc operation, a main operation of the SETcc operation are shown in Table 1 below.
In the SETcc operation, the destination operand is set to 0 or 1 of a data type according to a result of comparing values of two source operands. The data type of the destination operand is consistent with the data type of the source operands. Element-Wise (EW) comparison operation is a common operation in a deep learning algorithm. In a process of training the algorithm, both SETlt and SETgt are used in a reverse gradient computation of the EW comparison operation. Table 2 below shows algorithms of the common EW comparison operation.
In the deep learning training, it is possible to consider how to accelerate a vector operation in an acceleration unit of a reverse training algorithm in an artificial intelligence (AI) chip processor, so as to improve a computation speed of the deep learning training process. When there are a large number of operations, the speed of computing these operations has become a main limitation of a computing ability of the artificial intelligence chip processor. In the deep learning training, the executions of a large number of vector operations usually require a large amount of computing resources. As a result, it is difficult to process a large number of vector operations in time, and even causes the system for deep learning training to quit the executions of these operations due to insufficient computing resources. Furthermore, main deep learning algorithms in existing technologies have some problems in dealing with a large number of vector operations. For example, vector acceleration units of existing CPU and GPU processors do not support the SETcc instruction. When the training of the deep learning algorithm involves the SETcc operation, two solutions are generally adopted: (1) one is to use a scalar unit to perform a serialization operation, and (2) the other is to accelerate by starting multiple cores in parallel. The solution (1) is usually used in a CPU processor of an intel/ARM manufacturer. This kind of processor usually includes a small number of cores. In view of a programming model, it is not suitable to execute the same algorithm kernel on multiple processor cores at the same time. Therefore, it is only possible to perform serial processing by using the scalar processing unit of each core. The serial processing consumes a relatively long time, and a delay of the serial processing is N (e.g., 8 or 16) times that of a parallel processing. The solution (2) is usually used in a GPU processor. GPU has a larger number of threads. In view of a programming model, it tends to divide a task onto multiple threads for execution. Different from the serial processing, the speed is improved. However, there is a problem of large overhead for synchronization between threads. Therefore, existing technologies have insufficient utilization degree of the chip processor, which results in a low performance-power consumption ratio of the chip processor, thereby affecting the efficiency of the deep learning.
In order to at least partially solve at least one of the above problems and other potential problems, embodiments of the present disclosure propose a solution of executing an operation in a deep learning training In the solution, by vectorizing an instruction for the operation, a parallelism for the operation is increased, and a computing speed of the operation is improved. Furthermore, as a plurality of vector operations are executed simultaneously, the inefficiency of CPU serialization processing is avoided. In addition, threads are not required to synchronize the completion of the same computing task, which may avoid the synchronization overhead of GPU processing. By using the technical solution of the present disclosure, the artificial intelligence chip processor is effectively utilized, so as to effectively improve the speed of the deep learning training.
According to one or more embodiments of the present disclosure, when an operation for deep learning needs to be executed, associated data is provided to the computing device 110 as input data 120. Then, the scalar processing unit 113 (also referred to as a core module) in the computing device 110 processes a basic scalar operation for the input data 120, and converts the input data 120 into a form of an instruction for the operation (e.g., SETcc instruction and vector SETcc instruction (vSETcc instruction), but the protection scope of the present disclosure is not limited to this), through operations such as instruction fetch (IF) and instruction decode (ID). The instruction for the operation may be processed by the arithmetic logic ALU and then written back to a memory of the scalar processing unit 113, or may be distributed to the vector acceleration unit 115 (also referred to as a vector acceleration module).
In embodiments of the present disclosure, based on a 32-bit instruction set of an existing architecture, an instruction vSETcc is newly proposed to support the operation on the input data 120. An instruction format is shown in Table 3. The design of the instruction format mainly involves: (1) compatibility and (2) extensibility. With respect to the compatibility, an independent opcode field is used to avoid affecting an existing instruction format. With respect to the extensibility, possible subsequent expansion requirements are fully considered on the instruction format, and a specific field is determined as a reserved field. It should be understood that the instruction vSETcc is taken as an example of implementing an operation, and those skilled in the art may use the content and spirit of the present disclosure to set instructions for implementing similar functions and new functions. As an example only, an implementation of vSETlt instruction is shown in Table 3.
As shown in Table 3, in the vSETlt instruction, a specific field (for example, xfunct field) is used as the reserved field. It should be understood that other field may also be used as the reserved field for possible subsequent expansion requirements. As also shown in Table 3, in the opcode field, an opcode involves a specific vector operation. For example, the opcode is used to determine whether the condition code belongs to one of “an object is less than another object (Less Than)”, “an object is greater than another object (Greater Than)”, or “an object is equal to another object (Equal)”. In addition, Table 3 further shows the data types of supported vector data, such as floating point (float), half floating point (bfloat), signed integer (int), unsigned integer (unsigned int), etc. It should be understood that although the above data types are shown here only, other data types may also be used, such as 16 bit signed integer (short) represented by binary complement, 64 bit signed integer (long) represented by binary complement, double precision 64 bit floating point (double) conforming to IEEE 754 standard, single 16 bit Unicode character (char), boolean representing one bit information, etc.
In the vector acceleration unit 115, an instruction (e.g., SETcc instruction) for the operation is vectorized, so that a plurality of vector operations (also referred to as vectorization operations) are executed in parallel and continuously. The scalar processing unit 113 interacts with the vector acceleration unit 115 through a simple interface, which achieves the independence of module development to a certain extent and reduces the impact on existing processor units.
It should be understood that the deep learning training environment 100 is only exemplary and not restrictive, and the deep learning training environment 100 is extensible, which may include more computing devices 110, and may provide more input data 120 to the computing devices 110, so that more computing devices 110 may be utilized by more users at the same time, and even more input data 120 is used to simultaneously or non-simultaneously determine and execute a plurality of operations for deep learning. In addition, the computing device 110 may include other units, such as a data storage unit, an information preprocessing unit, and the like.
At block 202, the computing device 110 acquires an instruction for the operation. The operation includes a plurality of vector operations. According to one or more embodiments of the present disclosure, the instruction for the operation may be the input data 120 or an instruction processed by the scalar processing unit 113 in the computing device 110.
At block 204, the computing device 110 determines, for each vector operation of the plurality of vector operations acquired at block 202, two source operand vectors for a comparison. According to one or more embodiments of the present disclosure, the source operands involved in each vector operation are distributed to vector register file (VRF), cache, or other types of temporary storage apparatuses according to a data type. As the purpose of the method 200 is to accelerate the operation under the framework of the existing chip processor, the problem to be solved is to reduce the delay of the serial processing scalar operation and reduce or avoid the synchronization overhead between different threads. In this case, the above-mentioned problem is solved in the method 200 by implementing the vectorization of the instruction for the operation and using, for example, the vSETcc instruction format.
At block 206, the computing device 110 executes the vector operation on the two source operand vectors using an instruction format for the vector operation, so as to obtain an operation result including a destination operand vector. According to one or more embodiments of the present disclosure, in the context of the present disclosure, the data to be operated, such as the data to be compared, are combined in a form of vectors, and a corresponding operation is executed for each element in the vectors. The process of obtaining the computation result is the vectorization operation or vector operation. By vectorizing the instruction for the operation, the parallelism for the operation is increased. This method may be implemented to improve the computation speed of the operation.
At block 302, the computing device 110 acquires an instruction for an operation. The operation includes a plurality of vector operations. Specific contents of the step involved in block 302 are the same as those involved in block 202, which will not be repeated here.
At block 304, the computing device 110 determines, for each vector operation of the plurality of vector operations acquired at block 302, two source operand vectors for a comparison. Specific contents of the step involved in block 304 are the same as those involved in block 204, which will not be repeated here.
At block 306, the computing device 110 performs, for each element in the two source operand vectors, a second number of element-wise comparison operations in parallel according to a corresponding data type of the element using an instruction format for the vector operation, so as to obtain the operation result including the destination operand vector. Each of the two source operand vectors has a first number of elements, and the first number is greater than or equal to the second number.
According to one or more embodiments of the present disclosure, the data to be operated, such as the data to be compared, are combined in a form of vectors, and two source operand vectors are thus obtained. The operation on the two source operand vectors will be better than an operation on source operands of two scalars, because elements of the same type are collectively processed. Each of the two source operand vectors has the first number of elements. Then, for each element in the two source operand vectors, a second number of element-wise comparison operations are performed in parallel according to a data type of the element. It should be understood that, for example, there may be a relatively small number of processors on a chip with limited resources, and thus, for the first number of elements to be operated, the number of element operations performed in the corresponding processing unit may be equal to or less than the number of the elements. For an element in the vectors on which no operation is performed, it is possible for that element to wait for the next parallel processing cycle in sequence. In other words, in the technical solution of the present disclosure, the number (i.e., the first number) of elements in the source operand vector may be greater than or equal to the number (i.e., the second number) of performed vector operations. Therefore, the technical solution of the present disclosure may be used not only on a next generation chip processor with powerful computing function, but also on an existing chip processor with limited resources so as to improve the utilization degree of the existing chip processor.
As shown in
According to one or more embodiments of the present disclosure, as the comparison sub-modules of each data type perform operations for elements of all data types, it is valid to determine the specific data type after the comparison result is determined. It should be understood that the determination of the data type may also be performed at the source operand vector, so that it may be determined that only one type of comparison sub-modules is executed before the operation. In addition, it should be understood that the specific data types listed in
With reference to
It should be understood that the number of various elements and the size of physical quantities in the description with reference to accompanying drawings of the present disclosure are only examples, and are not limitations on the scope of protection of the present disclosure. The above number and size may be arbitrarily set as required without affecting the normal implementation of embodiments of the present disclosure.
The details of the method 200 of executing an operation and the method 300 of executing an operation according to embodiments of the present disclosure have been described above with reference to
In one or more embodiments, each of the two source operand vectors has a first number of elements, and executing the vector operation on the two source operand vectors includes: performing, for each element in the two source operand vectors, a second number of element-wise comparison operations in parallel according to a corresponding data type of the element, wherein the first number is greater than or equal to the second number.
In one or more embodiments, the executing the vector operation on the two source operand vectors further includes: determining a value of a corresponding element in the destination operand vector.
In one or more embodiments, the instruction format includes a field for the two source operand vectors, a field for the destination operand vector, a field for a data type, an opcode field, and/or a reserved field.
In one or more embodiments, an opcode in the opcode field includes one of: comparing whether an object is less than another object or not; comparing whether an object is greater than another object or not; or comparing whether an object is equal to another object or not.
In one or more embodiments, the data type of the destination operand vector includes one of: floating point, half floating point, signed integer, or unsigned integer.
In one or more embodiments, each vector operation of the plurality of vector operations is executed in an order of loading, ALU operation, and storing; and executions of two adjacent vector operations among the plurality of vector operations partially overlap each other.
Through the above description with reference to
In the technical solution of the present disclosure, the collection, storage, use, processing, transmission, provision, disclosure and application of the user's personal information involved are all in compliance with the provisions of relevant laws and regulations, and necessary confidentiality measures have been taken, and it does not violate public order and good morals. In the technical solution of the present disclosure, before obtaining or collecting the user's personal information, the user's authorization or consent is obtained.
According to embodiments of the present disclosure, the present disclosure further provides an electronic device, a readable storage medium, and a computer program product.
As shown in
Various components in the electronic device 800, including an input unit 806 such as a keyboard, a mouse, etc., an output unit 807 such as various types of displays, speakers, etc., a storage unit 808 such as a magnetic disk, an optical disk, etc., and a communication unit 809 such as a network card, a modem, a wireless communication transceiver, etc., are connected to the I/O interface 805. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
The computing unit 801 may be various general-purpose and/or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include but are not limited to a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any appropriate processor, controller, microcontroller, and so on. The computing unit 801 may perform the method and processing described above, such as the method 200 and the method 300. For example, in some embodiments, the method 200 and the method 300 may be implemented as a computer software program that is tangibly contained on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of a computer program may be loaded and/or installed on the electronic device 800 via the ROM 802 and/or the communication unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the method 200 and the method 300 may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the method 200 and the method 300 in any other appropriate way (for example, by means of firmware).
Various embodiments of the systems and technologies described herein may be implemented in a digital electronic circuit system, an integrated circuit system, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), an application specific standard product (ASSP), a system on chip (SOC), a complex programmable logic device (CPLD), a computer hardware, firmware, software, and/or combinations thereof. These various embodiments may be implemented by one or more computer programs executable and/or interpretable on a programmable system including at least one programmable processor. The programmable processor may be a dedicated or general-purpose programmable processor, which may receive data and instructions from the storage system, the at least one input device and the at least one output device, and may transmit the data and instructions to the storage system, the at least one input device, and the at least one output device.
Program codes for implementing the method of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or a controller of a general-purpose computer, a special-purpose computer, or other programmable data processing devices, so that when the program codes are executed by the processor or the controller, the functions/operations specified in the flowchart and/or block diagram may be implemented. The program codes may be executed completely on the machine, partly on the machine, partly on the machine and partly on the remote machine as an independent software package, or completely on the remote machine or the server.
In the context of the present disclosure, the machine readable medium may be a tangible medium that may contain or store programs for use by or in combination with an instruction execution system, device or apparatus. The machine readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine readable medium may include, but not be limited to, electronic, magnetic, optical, electromagnetic, infrared or semiconductor systems, devices or apparatuses, or any suitable combination of the above. More specific examples of the machine readable storage medium may include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, convenient compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
In order to provide interaction with users, the systems and techniques described here may be implemented on a computer including a display device (for example, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user), and a keyboard and a pointing device (for example, a mouse or a trackball) through which the user may provide the input to the computer. Other types of devices may also be used to provide interaction with users. For example, a feedback provided to the user may be any form of sensory feedback (for example, visual feedback, auditory feedback, or tactile feedback), and the input from the user may be received in any form (including acoustic input, voice input or tactile input).
The systems and technologies described herein may be implemented in a computing system including back-end components (for example, a data server), or a computing system including middleware components (for example, an application server), or a computing system including front-end components (for example, a user computer having a graphical user interface or web browser through which the user may interact with the implementation of the system and technology described herein), or a computing system including any combination of such back-end components, middleware components or front-end components. The components of the system may be connected to each other by digital data communication (for example, a communication network) in any form or through any medium. Examples of the communication network include a local area network (LAN), a wide area network (WAN), and Internet.
The computer system may include a client and a server. The client and the server are generally far away from each other and usually interact through a communication network. The relationship between the client and the server is generated through computer programs running on the corresponding computers and having a client-server relationship with each other. The server may be a cloud server, and the server may also be a server of a distributed system, or a server combined with a block-chain.
It should be understood that steps of the processes illustrated above may be reordered, added or deleted in various manners. For example, the steps described in the present disclosure may be performed in parallel, sequentially, or in a different order, as long as a desired result of the technical solution of the present disclosure may be achieved. This is not limited in the present disclosure.
The above-mentioned specific embodiments do not constitute a limitation on the scope of protection of the present disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations and substitutions may be made according to design requirements and other factors. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present disclosure shall be contained in the scope of protection of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202110820258.6 | Jul 2021 | CN | national |