This application relates to the field of computer technologies, and in particular, to a graph instruction processing method and apparatus.
A directed graphflow computing architecture (graphflow architecture) converts a data flow and a control flow into a directed graph including N nodes. A connecting line between nodes represents a data flow or a control flow. In the directed graphflow computing architecture, a degree of parallelism of the data flow is triggered by detecting whether an input of each node in the directed graphflow computing architecture is ready, so that a plurality of nodes that are not on a same path can run concurrently.
In the directed graphflow computing architecture, a majority of graph instructions usually each have two data inputs and one or two output IDs, a minority of graph instructions each have at least three data inputs or at least three output IDs, and a buffer configured to temporarily store a data input and an output ID of one graph instruction is configured based on two data inputs and two output IDs. A graph instruction having at least three data inputs is usually a composite graph instruction whose operation is completed within one beat and that is obtained by compressing a plurality of instructions whose operation is completed within a plurality of beats, and an instruction having at least three output IDs usually needs to output an operation result of the graph instruction to an input of a plurality of other graph instructions. For example, (a) in
In the conventional technology, a graph instruction having at least three data inputs is usually processed in the following manners. Manner 1: An input port of an original graph instruction is reused for a corresponding composite graph instruction. For example, data inputs i, j, and k of the graph instructions add and sll are simultaneously read at a specific time point, and i, j, and k are transmitted to an operation unit of add-sll. In this manner, pressure of simultaneously reading the data inputs is increased. In addition, because only two data inputs need to be read for a majority of instructions, difficulty of bypass logic is increased. Manner 2: An input port of an original graph instruction is rewritten to an operation unit of a composite graph instruction through rescheduling. For example, writing of i, j, and k is rescheduled to an operation unit of add-sll. In this manner, a specific rescheduling logic unit needs to be provided, to increase implementation difficulty of hardware and increase hardware costs.
A graph instruction having at least three output IDs is usually processed in the following manners. Manner 1: A quantity of bits occupied by an output ID is expanded. For example, original 32 bits are uniformly expanded to 64 bits. However, a majority of graph instructions each have only one or two output IDs, wasting instruction space. Manner 2: A CISC encoding manner is used. To be specific, a quantity of output IDs is added to a graph instruction format, to adjust an encoding width. However, in this manner, decoding difficulty of hardware is increased, to generate large power consumption. Manner 3: An output ID is limited. To be specific, a plurality of output IDs share a same high-order bit, and respectively use different low-order bits. However, in this manner, a plurality of output IDs obtained through encoding are adjacent, and a location of the output ID is limited, to greatly limit a scheduling capability and reduce performance of a computing architecture.
This application provides a graph instruction processing method and apparatus, to improve performance of a directed graphflow computing architecture.
To achieve the foregoing objectives, the following technical solutions are used in embodiments of this application.
According to a first aspect, a graph instruction processing method is provided, and is applied to a processor. The method includes: detecting whether a first input and a second input of a first graph instruction are in a ready-to-complete state, where the first input and/or the second input are or is a dynamic data input or dynamic data inputs of the first graph instruction; obtaining static data input information of the first graph instruction from a first register when the first input and the second input are both in the ready-to-complete state, where the static data input information is used to indicate at least one input, each of the at least one input is a constant input or a temporary constant input, and the temporary constant input is an input that does not change in a period of time; and processing the first graph instruction based on the first input, the second input, and the at least one input, to obtain a first processing result.
In the technical solution, for a first graph instruction having at least three data inputs, the at least three data inputs usually include one or two dynamic data inputs and one or more static data inputs, and the one or more static data inputs may be stored in the first register. Subsequently, the processor may process the first graph instruction by detecting whether the first input and the second input of the first graph instruction are in the ready-to-complete state, and obtaining the static data input of the first graph instruction from the first register when the first input and the second input are in the ready-to-complete state, to avoid interference of intermediate-state data of the first graph instruction to another data flow, reduce a transfer operation of the static data input of the first graph instruction, increase reuse efficiency of the static data input, and improve performance of a directed graphflow computing architecture.
In a possible implementation of the first aspect, the first graph instruction is provided with an input pointer bitfield, the input pointer bitfield includes a first pointer, and the first pointer is used to indicate the first register. In the possible implementation, the first pointer is used to indicate the first register, so that the processor can quickly and effectively determine the first register, to obtain the static data input stored in the first register.
In a possible implementation of the first aspect, the at least one input includes a plurality of inputs, and the method further includes: decoding the static data input information based on a preset decoding policy associated with an operator of the first graph instruction, to obtain the plurality of inputs. In the possible implementation, when the first register stores a plurality of inputs, the processor may perform decoding based on the preset decoding policy, to obtain the plurality of inputs.
In a possible implementation of the first aspect, the first graph instruction has at least three output addresses, an output address bitfield of the first graph instruction is used to indicate one or two output addresses in the at least three output addresses, an output address other than the one or two output addresses in the at least three output addresses is stored in a second register, and the method further includes: sending the first processing result to the one or two output addresses; and obtaining output address information of the first graph instruction from the second register, and sending the first processing result to an output address indicated by the output address information. In the possible implementation, when the first graph instruction has at least three output addresses, one or more output addresses in the at least three output addresses may be stored in the second register, so that the one or more output addresses of the first graph instruction may be subsequently obtained from the second register, and the first processing result is sent to each output address, to implement result processing of a graph instruction having at least three output addresses.
In a possible implementation of the first aspect, the output address bitfield includes one output address and a second pointer, and the second pointer is used to indicate the second register; or the output address bitfield includes two output addresses, the first graph instruction is further provided with an output pointer bitfield, the output pointer bitfield includes a second pointer, and the second pointer is used to indicate the second register. In the possible implementation, the second pointer included in the output address bitfield or the output pointer bitfield is used to indicate the second register, to improve flexibility and diversity of indicating the second register.
In a possible implementation of the first aspect, the method further includes: decoding the output address information, to obtain at least one output address; and correspondingly, the sending the first processing result to an output address indicated by the output address information includes: sending the first processing result to each of the at least one output address. In the possible implementation, it can be ensured that a graph instruction having at least three output addresses in an instruction set is configured in the directed graphflow computing architecture in a regular format.
According to a second aspect, a graph instruction processing method is provided. The method includes: determining a first processing result of a first graph instruction, where the first graph instruction has at least three output addresses, an output address bitfield of the first graph instruction is used to indicate one or two output addresses in the at least three output addresses, and an output address other than the one or two output addresses in the at least three output addresses is stored in a second register; sending the first processing result to the one or two output addresses; and obtaining output address information of the first graph instruction from the second register, and sending the first processing result to an output address indicated by the output address information.
In the technical solution, when the first graph instruction has at least three output addresses, one or more output addresses in the at least three output addresses may be stored in the second register, so that it can be ensured that a graph instruction having at least three output addresses in an instruction set is configured in a directed graphflow computing architecture in a regular format. Subsequently, when the first processing result of the first graph instruction is output, one or more output addresses of the first graph instruction may be obtained from the second register, and the first processing result may be sent to each output address, to implement result processing of a graph instruction having at least three output addresses. In addition, compared with the conventional technology, in the method, instruction space is not wasted, and large power consumption is not generated, to ensure that the computing architecture has better performance.
In a possible implementation of the second aspect, the output address bitfield includes one output address and a second pointer, and the second pointer is used to indicate the second register; or the output address bitfield includes two output addresses, the first graph instruction is further provided with an output pointer bitfield, the output pointer bitfield includes a second pointer, and the second pointer is used to indicate the second register. In the possible implementation, the second pointer included in the output address bitfield or the output pointer bitfield is used to indicate the second register, to improve flexibility and diversity of indicating the second register.
In a possible implementation of the second aspect, the method further includes: decoding the output address information, to obtain at least one output address; and correspondingly, the sending the first processing result to an output address indicated by the output address information includes: sending the first processing result to each of the at least one output address. In the possible implementation, it can be ensured that a graph instruction having at least three output addresses in an instruction set is configured in the directed graphflow computing architecture in a regular format.
According to a third aspect, a graph instruction processing apparatus is provided. The apparatus includes: a detection unit, configured to detect whether a first input and a second input of a first graph instruction are in a ready-to-complete state, where the first input and/or the second input are or is a dynamic data input or dynamic data inputs of the first graph instruction; and a first operation unit, configured to obtain static data input information of the first graph instruction from a first register when the first input and the second input are both in the ready-to-complete state, where the static data input information is used to indicate at least one input, each of the at least one input is a constant input or a temporary constant input, and the temporary constant input is an input that does not change in a period of time. The first operation unit is further configured to process the first graph instruction based on the first input, the second input, and the at least one input, to obtain a first processing result.
In a possible implementation of the third aspect, the first graph instruction is provided with an input pointer bitfield, the input pointer bitfield includes a first pointer, and the first pointer is used to indicate the first register.
In a possible implementation of the third aspect, the at least one input includes a plurality of inputs, and the first operation unit is further configured to decode the static data input information based on a preset decoding policy associated with an operator of the first graph instruction, to obtain the plurality of inputs.
In a possible implementation of the third aspect, the first graph instruction has at least three output addresses, an output address bitfield of the first graph instruction is used to indicate one or two output addresses in the at least three output addresses, an output address other than the one or two output addresses in the at least three output addresses is stored in a second register, and the apparatus further includes a second operation unit. The first operation unit is further configured to send the first processing result to the one or two output addresses; and the second operation unit is configured to: obtain output address information of the first graph instruction from the second register, and send the first processing result to an output address indicated by the output address information.
In a possible implementation of the third aspect, the output address bitfield includes one output address and a second pointer, and the second pointer is used to indicate the second register; or the output address bitfield includes two output addresses, the first graph instruction is further provided with an output pointer bitfield, the output pointer bitfield includes a second pointer, and the second pointer is used to indicate the second register.
In a possible implementation of the third aspect, the first operation unit is further configured to: decode the output address information, to obtain at least one output address; and send the first processing result to each of the at least one output address.
According to a fourth aspect, a graph instruction processing apparatus is provided. The apparatus includes: a first operation unit, configured to determine a first processing result of a first graph instruction, where the first graph instruction has at least three output addresses, an output address bitfield of the first graph instruction is used to indicate one or two output addresses in the at least three output addresses, and an output address other than the one or two output addresses in the at least three output addresses is stored in a second register, where the first operation unit is further configured to send the first processing result to the one or two output addresses; and a second operation unit, configured to: obtain output address information of the first graph instruction from the second register, and send the first processing result to an output address indicated by the output address information.
In a possible implementation of the fourth aspect, the output address bitfield includes one output address and a second pointer, and the second pointer is used to indicate the second register; or the output address bitfield includes two output addresses, the first graph instruction is further provided with an output pointer bitfield, the output pointer bitfield includes a second pointer, and the second pointer is used to indicate the second register.
In a possible implementation of the fourth aspect, the first operation unit is further configured to: decode the output address information, to obtain at least one output address; and send the first processing result to each of the at least one output address.
According to another aspect of this application, a graph instruction processing apparatus is provided. The apparatus may be a graph instruction processing device or a chip built into the graph instruction processing device, the apparatus includes a processor, the processor is coupled to a memory, the memory stores instructions, and when the processor executes the instructions in the memory, the apparatus is enabled to perform the method according to any one of the first aspect or the possible implementations of the first aspect.
According to another aspect of this application, a graph instruction processing apparatus is provided. The apparatus may be a graph instruction processing device or a chip built into the graph instruction processing device, the apparatus includes a processor, the processor is coupled to a memory, the memory stores instructions, and when the processor executes the instructions in the memory, the apparatus is enabled to perform the method according to any one of the second aspect or the possible implementations of the second aspect.
According to another aspect of this application, a readable storage medium is provided. The readable storage medium stores instructions. When the readable storage medium runs on a device, the device is enabled to perform the method according to any one of the first aspect or the possible implementations of the first aspect.
According to another aspect of this application, a readable storage medium is provided. The readable storage medium stores instructions. When the readable storage medium runs on a device, the device is enabled to perform the method according to any one of the second aspect or the possible implementations of the second aspect.
According to another aspect of this application, a computer program product is provided. When the computer program product runs on a computer, the computer is enabled to perform the method according to any one of the first aspect or the possible implementations of the first aspect.
According to another aspect of this application, a computer program product is provided. When the computer program product runs on a computer, the computer is enabled to perform the method according to any one of the second aspect or the possible implementations of the second aspect.
It can be understood that the apparatus, the computer storage medium, or the computer program product of any of the graph instruction processing methods is used to perform a corresponding method provided above. Therefore, for beneficial effects that can be achieved, refer to the beneficial effects in the corresponding method provided above. Details are not described herein again.
In this application, “at least one” means one or more, and “a plurality of” means two or more. The term “and/or” describes an association relationship for describing associated objects and represents that three relationships may exist. For example, A and/or B may represent the following cases: Only A exists, both A and B exist, and only B exists, where A and B may be singular or plural. The character “/” usually indicates an “or” relationship between the associated objects. “At least one of the following items (pieces)” or a similar expression thereof refers to any combination of these items, including any combination of singular items (pieces) or plural items (pieces). For example, at least one item (piece) of a, b, or c may indicate a, b, c, a and b, a and c, b and c, or a, b, and c. Herein, a, b, and c may be singular or plural. In addition, in embodiments of this application, the words such as “first” and “second” do not limit a quantity or an execution sequence.
It should be noted that, in embodiments of this application, the term such as “example” or “for example” is used to represent giving an example, an illustration, or descriptions. Any embodiment or design scheme described as an “example” or “for example” in embodiments of this application should not be explained as being more preferred or having more advantages than another embodiment or design scheme. Exactly, use of the word “example”, “for example”, or the like is intended to present a related concept in a specific manner.
First, before embodiments of this application are described, a pipeline structure and a processor architecture in this application are described by using an example.
An instruction pipeline is a manner in which an operation of one instruction is divided into a plurality of small steps for processing, to improve efficiency of executing the instruction by a processor. Each step may be completed by a corresponding unit. A life cycle of one instruction in the pipeline structure may include: instruction fetch pipeline->decoding pipeline->scheduling (transmitting) pipeline->execution pipeline->memory access pipeline->write-back pipeline. In other words, in the pipeline structure, an execution process of one instruction is divided into at least six stages.
The instruction fetch pipeline is used for instruction fetching, that is, a process of reading the instruction from a memory.
The decoding pipeline is used for instruction decoding, that is, a process of translating the instruction fetched from the memory.
The scheduling (transmitting) pipeline is used for instruction scheduling and transmitting (instruction dispatch and issue). To be a specific, a register is read, to obtain an operand, and the instruction is sent to a corresponding execution unit for execution based on a type of the instruction.
The execution pipeline is used to perform instruction execution based on the type of instruction, to complete a computing task. Instruction execution is a process of performing a real operation on the instruction. For example, if the instruction is an addition operation instruction, an addition operation is performed on the operand; if the instruction is a subtraction operation instruction, a subtraction operation is performed; and if the instruction is graph computing, a graph computing operation is performed.
Memory access pipeline: Memory access is a process in which a memory access instruction is used to read data from the memory or write data to the memory, and is mainly executing a read/write (load/store) instruction.
Write-back pipeline: Write-back is a process of writing an instruction execution result back to a general-purpose register group. If the instruction is a normal operation instruction, a result value comes from a computing result of an “execution” stage; and if the instruction is a memory read instruction, the result comes from data read from the memory in an “access” stage.
In the pipeline structure, the operation steps are performed on each instruction in the processor, but different operation steps of a plurality of instructions can be executed simultaneously, so that an overall instruction flow speed can be accelerated and a program execution time period can be shortened.
Specifically, the processor 10 may include at least one processor core 101, and the processor core 101 may include an instruction scheduling unit 1011, a graph computing flow unit 1012 connected to the instruction scheduling unit 1011, and at least one generic operation unit 1013. The instruction scheduling unit 1011 runs in a transmitting pipeline stage of the processor core 101, to complete scheduling and distribution of a to-be-executed instruction. The graph computing flow unit 1012 and the at least one generic operation unit 1013 each runs as an execution unit (which may also referred to as a functional unit) of the processor 10 in an execution pipeline stage (execute stage), to complete various types of computing tasks. Specifically, the processor 10 may directly allocate a graph computing task in the to-be-executed instruction to the graph computing flow unit 1012 for execution by using the instruction scheduling unit 1011, to implement a function of accelerating a general purpose processor by using a graph computing mode. The general-purpose computing task in the to-be-executed instruction is scheduled to the at least one generic operation unit 1013 for execution, to implement a general-purpose computing function. Optionally, based on different computing tasks, the processor 10 may independently invoke only the graph computing flow unit 1012 to execute a task, or independently invoke the at least one generic operation unit 1013 to execute a task, or may invoke both the graph computing flow unit 1012 and the at least one generic operation unit 1013 to execute a task in parallel. It can be understood that the instruction scheduling unit 1011 may be connected to the graph computing flow unit 1012 and the at least one generic operation unit 1013 through a bus or in another manner, to perform direct communication. A connection relationship illustrated in
In a possible implementation,
Optionally, as shown in
It should be noted that functional modules inside the processor in
It can be understood that a pipeline structure and a processor architecture of the instruction are merely some example implementations provided in embodiments of this application, and the pipeline structure and the processor architecture in embodiments of this application include but are not limited to the foregoing implementations.
The following describes in detail a directed graphflow computing architecture in this application. Refer to
1. Graph building stage (graph build): Each node in a graph building block is configured with one operation instruction and a maximum of two target nodes by reading N instructions from an instruction memory. There are a total of zero to 15 computing nodes when N is equal to 16. Once a graph is built, an operation and a connection of each node are solidified (read-only). For example, an operation instruction in a computing node 0 is an add instruction. In other words, an addition operation is performed. An operation instruction in a computing node 1 is an sll instruction. In other words, a shift operation is performed. An operation instruction in a computing node 3 is an xor instruction. In other words, an exclusive OR operation. For a computing node 5, operation results of the computing node 0 and the computing node 1 are used as inputs into the computing node, to perform an Id operation (namely, a fetch operation); for a computing node 6, operation results of the computing node 2 and the computing node 3 are used as inputs into the computing node, to perform an add operation (namely, an addition operation), and so on. An operation process of another computing node is not described one by one.
2. Execution stage (graph execute): An external module transmits an input (liveIn), to initiate a data flow. All computing nodes run in parallel. For each node, provided that an input into the node arrives, the node can perform an operation and transmit a result to a next computing node. The node is in a waiting state until the input arrives. Running continues until the data flow arrives at an end node (tm). Input parameters of some computing nodes (for example, computing nodes 0, 1, 2, and 3) are input externally. To be specific, start-up data needs to be input from the external memory unit 1017. Some other computing nodes (for example, computing nodes 5, 6, 8, 9, 10, 11, 12, 13, 14, and 15) need to internally obtain computing results output by a computing node connected to the computing nodes, can perform an operation, and can input, into the computing node connected to the computing nodes, a result obtained after the operation.
Based on the directed graphflow computing architecture provided in this application, when an instruction scheduling unit 1011 of the processor 10 schedules a control instruction to a graph computing flow unit 1012 to execute an instruction task, a plurality of control instructions of different functions are included, to indicate the graph computing flow unit 1012 to execute a corresponding graph computing function. In terms of a time sequence, the control instruction provided in this application mainly includes: graph building start instruction->parameter transmission instruction->graph computing start instruction->parameter return instruction.
The graph building start instruction carries a target address. When receiving the graph building start instruction, the graph computing flow unit 1012 may read graph building block information from the memory unit 1017 based on the target address. The graph building block information may include an operation method of each of N computing nodes in the directed graphflow computing architecture, a connection between the N computing nodes, and sequence information. The parameter transmission instruction may carry identifiers of M computing nodes and input parameters respectively corresponding to the identifiers of the M computing nodes. The M computing nodes are some or all of the N nodes. When receiving the parameter transmission instruction, the graph computing flow unit 1012 may respectively input, into the M computing nodes, the input parameters respectively corresponding to the identifiers of the M computing nodes. After receiving the graph computing start instruction, the graph computing flow unit 1012 may determine whether current graph building is completed, and if current graph building is completed, start to execute a corresponding computing task. The parameter return instruction may carry identifiers of K computing nodes and result registers respectively corresponding to the identifiers of the K computing nodes. The graph computing flow unit 1012 may control to separately send computing results of the K computing nodes to a result write-back unit 1014, and the result write-back unit 1014 writes the results of the corresponding computing nodes to the corresponding result registers.
For example,
After a directed graphflow computing architecture is introduced, the following describes a graph instruction in the directed graph computing architecture. In the directed graph computing architecture, a majority of graph instructions each usually have two data inputs and one or two output addresses, a minority of graph instructions each have at least three data inputs or at least three output addresses, and a buffer configured to temporarily store a data input of one graph instruction and an instruction format used to indicate an output address are configured based on two data inputs and two output addresses. In other words, one graph instruction is provided with buffer space of only two inputs. For example, if buffer space of one input is 64 bits, buffer space corresponding to one graph instruction is 128 bits. An instruction format of one graph instruction is usually an instruction ID+opcode (operator)+dest0+dest1, and dest0 and dest1 are two output addresses.
For a graph instruction having two data inputs or a graph instruction having two output addresses, a processor may process the corresponding graph instruction based on an existing configuration. For a graph instruction having at least three data inputs, the at least three data inputs usually includes one or two dynamic data inputs and one or more static data inputs. In an existing configuration of a buffer of a data input, in this embodiment of this application, the one or more static data inputs are stored in a register, to implement processing of the graph instruction. A detailed process is described as follows. For a graph instruction having at least three output addresses, in an existing configuration of an instruction format, in this embodiment of this application, one or more output addresses are stored in the register, to implement processing of the graph instruction. A detailed process is described as follows.
S301: Detect whether a first input and a second input of a first graph instruction are in a ready-to-complete state, where the first input and/or the second input are or is a dynamic data input or dynamic data inputs of the first graph instruction.
A graph instruction may be a graph instruction in an instruction set of a directed graph computing architecture, the instruction set may include a plurality of graph instructions, and the first graph instruction may be a graph instruction having at least three inputs in the instruction set. For example, the first graph instruction may be a composite graph instruction obtained by compressing a plurality of graph instructions. The at least three inputs of the first graph instruction may include a dynamic data input and a static data input. The dynamic data input may be a data input that dynamically changes, and the data input that dynamically changes may also be referred to as a variable. The static data input may be a data input that does not change or a data input that does not change in a period of time. The data that does not change may also be referred to as a constant, and the data that does not change in the period of time may also be a temporary constant.
In addition, the first graph instruction may correspond to one or more ready bitfields and one or more valid bitfields. The valid bitfield of the first graph instruction may be used to indicate whether the first graph instruction requires input data flow information, and the data flow information may include two inputs in the at least three inputs of the first graph instruction. The two inputs may be respectively the first input and the second input. The first input and the second input each may be a dynamic data input, or one of the first input and the second input may be a dynamic data input and the other may be a static data input (for example, the first graph instruction have only one dynamic data input). The first input and the second input may also be respectively referred to as a left input and a right input. The ready bitfield of the first graph instruction may be used to indicate whether the data flow information of the first graph instruction is ready, for example, indicate whether the first input and the second input in the data flow information are ready. For example, a graph instruction of an AND operation “a+b+2” is used as an example. A valid bitfield of the graph instruction may be used to indicate that data flow information required by the graph instruction includes a first input a and a second input b, and a ready bitfield of the graph instruction may be used to indicate whether a and b are ready.
Further, in this embodiment of this application, whether the first input and the second input of the first graph instruction are in the ready-to-complete state may mean that the first input and the second input are both ready (which may also be referred to as that both arrive). For example, the ready bitfield of the first graph instruction indicates that the first input and the second input of the first graph instruction both arrive. In other words, the first input and the second input of the first graph instruction are both in the ready-to-complete state.
Specifically, the processor may detect whether the first input and the second input of the first graph instruction are ready (which is also referred to as whether the first input and the second input of the first graph instruction arrive). When the first input and the second input are both ready (for example, the ready bitfield of the first graph instruction indicates that a left input and a right input of the first graph instruction both arrive), the processor may determine that the first input and the second input of the first graph instruction are both in the ready-to-complete state. If neither the first input nor the second input of the first graph instruction is ready or one of the first input and the second input is not ready, the processor may determine that neither the first input nor the second input of the first graph instruction is in the ready-to-complete state.
S302: Obtain static data input information of the first graph instruction from a first register when the first input and the second input are both in the ready-to-complete state, where the static data input information is used to indicate at least one input.
The first register may be a register configured to store the static data input of the first graph instruction, and the static data input of the first graph instruction may be specifically stored in the first register in a graph building stage. The at least one input may be an input other than the first input and the second input in the at least three inputs of the first graph instruction.
In addition, the first graph instruction may be correspondingly provided with an input pointer bitfield, the input pointer bitfield may include a first pointer, and the first pointer may be used to indicate the first register. For example, when the processor stores the at least one input in the first register in the graph building stage, the processor may set the input pointer bitfield of the first graph instruction to the first pointer of the first register.
Specifically, when the first input and the second input are both in the ready-to-complete state, the processor may determine the first register based on the first pointer, and obtain the static data input information of the first graph instruction from the first register. When the at least one input indicated by the static data input information includes only one input, the processor may directly obtain the one input from the first register. When the at least one input indicated by the static data input information includes at least a plurality of inputs (namely, two or more inputs), the processor may further decode the static data input information based on a preset decoding policy associated with an operator of the first graph instruction, to obtain the plurality of inputs. Herein, the preset decoding policy associated with the operator of the first graph instruction may be used to indicate a dependency relationship among a plurality of operators included in the first graph instruction, related indication information of a data input corresponding to each operator, or the like.
For example, as shown in
S303: Process the first graph instruction based on the first input, the second input, and the at least one input, to obtain a first processing result.
Specifically, when the processor obtains the at least one input, the processor may process the first graph instruction based on the first input, the second input, and the at least one input. To be specific, the processor may sequentially complete related computing of the plurality of operators based on the dependency relationship between the plurality of operators of the first graph instruction and a data input corresponding to each operator, to obtain the first processing result.
In this embodiment of this application, for the first graph instruction having at least three inputs, the first input and the second input of the first graph instruction may be input through an input port, the first input and the second input may be dynamic data inputs, and at least one another input is stored in the first register. Subsequently, the processor may detect whether the first input and the second input are in the ready-to-complete state, and when the first input and the second input are in the ready-to-complete state, obtain the at least one input from the first register, to process the first graph instruction based on the first input, the second input, and the at least one input, to avoid interference of intermediate-state data of the first graph instruction to another data flow, reduce a transfer operation of the static data input of the first graph instruction, and increase reuse efficiency of the static data input.
Further, the first graph instruction may have at least three output addresses, an output address bitfield of the first graph instruction may include one or two output addresses in the at least three output addresses, and an output address other than the one or two output addresses in the at least three output addresses is stored in the second register. Correspondingly, as shown in
S304: Send the first processing result to the one or two output addresses.
The first graph instruction may be correspondingly provided with the output address bitfield, and a bit width of the output address bitfield may be fixed. For example, the bit width of the output address bitfield may be 32 bits. The output address bitfield may include one output address or two output addresses. When the output address bitfield includes one output address and the output address is a first output address, the processor may send the first processing result to the first output address. When the output address bitfield includes two output addresses and the two output addresses are a first output address and a second output address, the processor may respectively send the first processing result to the first output address and the second output address.
For example, when the processor includes a plurality of operation units and an operation unit that processes the first graph instruction is a first operation unit, the first operation unit may specifically send the first processing result to the one or two output addresses.
S305: Obtain output address information of the first graph instruction from the second register, and send the first processing result to an output address indicated by the output address information.
The second register may be used to store the output address other than the one or two output addresses in the at least three output addresses. For example, in the graph building stage, the processor may store the output address other than the one or two output addresses in the at least three output addresses in the second register.
In addition, when the output address bitfield includes one output address, the output address bitfield may further include a second pointer, and the second pointer may be used to indicate the second register. When the output address bitfield includes two output addresses, the first graph instruction may be further provided with an output pointer bitfield, the output pointer bitfield may include a second pointer, and the second pointer may be used to indicate the second register.
Specifically, the processor may determine the second register based on the second pointer, and obtain the output address information of the first graph instruction from the second register. The processor may further decode the output address information, to obtain one or more output addresses. The one or more output addresses are the output address other than the one or two output addresses in the at least three output addresses. The processor may send the first processing result to each of the one or more output addresses.
Optionally, when the processor includes a plurality of operation units and an operation unit that processes the first graph instruction is a first operation unit, the first operation unit may specifically send the first processing result to a second operation in the plurality of operation units, and the second operation unit may send the first processing result to each of the one or more output addresses. The second operation unit may make full use of a bus bandwidth, and send the first processing result to the one or more output addresses by using a bus resource in an idle state without disturbing another data flow, to realize efficient use of a bus communications load. Herein, the first operation unit and the second operation unit may be specifically a graph computing flow unit in the foregoing provided processor architecture.
As shown in
Further, when the first graph instruction has a plurality of inputs and a plurality of output addresses, the first register and second register may be a same register, so that the first pointer and second pointer may be a same pointer. For example, as shown in
In this embodiment of this application, for the first graph instruction having at least three data inputs, the at least three data inputs usually include one or two dynamic data inputs and one or more static data inputs, and the one or more static data inputs may be stored in the first register. Subsequently, the processor may process the first graph instruction by detecting whether the first input and the second input of the first graph instruction are in the ready-to-complete state, and obtaining the static data input of the first graph instruction from the first register when the first input and the second input are in the ready-to-complete state, to avoid interference of intermediate-state data of the first graph instruction to another data flow, reduce a transfer operation of the static data input of the first graph instruction, increase reuse efficiency of the static data input, and improve performance of a directed graphflow computing architecture.
Further, when the first graph instruction has at least three output addresses, one or more output addresses in the at least three output addresses may be stored in the second register, so that it can be ensured that a graph instruction having at least three output addresses in an instruction set is configured in a directed graphflow computing architecture in a regular format. Subsequently, when the first processing result of the first graph instruction is output, one or more output addresses of the first graph instruction may be obtained from the second register, and the first processing result may be sent to each output address, to implement result processing of a graph instruction having at least three output addresses. In addition, compared with the conventional technology, in the method, instruction space is not wasted, and large power consumption is not generated, to ensure that a computing architecture has better performance.
S401: Determine a first processing result of a first graph instruction, where the first graph instruction has at least three output addresses.
The first graph instruction may be any graph instruction in an instruction set of a directed graph computing architecture. For example, the first graph instruction may be a graph instruction having two inputs, or may be a graph instruction having at least three inputs. Specifically, when the first graph instruction is a graph instruction having two inputs, the processor may determine the first processing result of the first graph instruction in a method in the conventional technology; or when the first graph instruction is a graph instruction having at least three inputs, the processor may determine the first processing result of the first graph instruction in the graph instruction processing method for a graph instruction having a plurality of inputs provided in this application.
In addition, the first graph instruction may be correspondingly provided with an output address bitfield, and a bit width of the output address bitfield may be fixed. For example, the bit width of the output address bitfield may be 32 bits. For the at least three output addresses of the first graph instruction, the output address bitfield of the first graph instruction may be used to indicate one or two output addresses in the at least three output addresses, and an output address other than the one or two output addresses in the at least three output addresses may be stored in a second register. For example, when the output address bitfield includes one output address and the output address is a first output address, an output address other than the first output address in the at least three output addresses may be stored in the second register. When the output address bitfield includes two output addresses and the two output addresses are a first output address and a second output address, an output address other than the first output address and the second output address in the at least three output addresses may be stored in the second register.
S402: Send the first processing result to the one or two output addresses included in the output address bitfield.
Specifically, when the output address bitfield of the first graph instruction includes the first output address, the processor may send the first processing result to the first output address; or when the output address bitfield of the first graph instruction includes the first output address and the second output address, the processor may respectively send the first processing result to the first output address and the second output address.
For example, when the processor includes a plurality of operation units and an operation unit that processes the first graph instruction is a first operation unit, the first operation unit may specifically send the first processing result to the first output address, or the first output address and the second output address.
S403: Obtain output address information of the first graph instruction from the second register, and send the first processing result to an output address indicated by the output address information.
In a graph building stage, the processor may store the output address other than the one or two output addresses in the at least three output addresses in the second register. When the output address bitfield includes one output address, the output address bitfield may further include a second pointer, and the second pointer may be used to indicate the second register. When the output address bitfield includes two output addresses, the first graph instruction may be further provided with an output pointer bitfield, the output pointer bitfield may include a second pointer, and the second pointer may be used to indicate the second register.
Specifically, the processor may determine the second register based on the second pointer, and obtain the output address information of the first graph instruction from the second register. The processor may further decode the output address information, to obtain one or more output addresses. The one or more output addresses are the output address other than the one or two output addresses in the at least three output addresses. The processor may send the first processing result to each of the one or more output addresses.
Optionally, when the processor includes a plurality of operation units and an operation unit that processes the first graph instruction is a first operation unit, the first operation unit may specifically send the first processing result to a second operation in the plurality of operation units, and the second operation unit may send the first processing result to each of the one or more output addresses. The second operation unit may make full use of a bus bandwidth, and send the first processing result to the one or more output addresses by using a bus resource in an idle state without disturbing another data flow, to realize efficient use of a bus communications load. Herein, the first operation unit and the second operation unit may be specifically a graph computing flow unit in the foregoing provided processor architecture.
For example, it is assumed that a plurality of graph instructions included in an instruction set of a directed graph are shown in
In this embodiment of this application, when the first graph instruction has at least three output addresses, one or more output addresses in the at least three output addresses may be stored in the second register, so that it can be ensured that a graph instruction having at least three output addresses in an instruction set is configured in a directed graphflow computing architecture in a regular format. Subsequently, when the first processing result of the first graph instruction is output, one or more output addresses of the first graph instruction may be obtained from the second register, and the first processing result may be sent to each output address, to implement result processing of a graph instruction having at least three output addresses. In addition, compared with the conventional technology, in the method, instruction space is not wasted, and large power consumption is not generated, to ensure that a computing architecture has better performance.
In embodiments of this application, the graph instruction processing apparatus may be divided into functional modules based on the foregoing method examples. For example, each functional module may be obtained through division for a corresponding function, or two or more functions may be integrated into one processing module. The integrated module may be implemented in a form of hardware, or may be implemented in a form of a software functional module. It should be noted that, in embodiments of this application, division into the modules is an example and is merely logical function division, and may be other division in an actual implementation.
When each functional module is obtained through division for each corresponding function,
In a possible embodiment, the detection unit 501 is configured to support the apparatus to perform S301 in the method embodiments, the first operation unit 502 is configured to support the apparatus to perform S302, S303, and S304 in the method embodiments, and the second operation unit 503 is configured to support the apparatus to perform S305 in the method embodiments. In another possible embodiment, the first operation unit 502 is configured to support the apparatus to perform S401 and S402 in the method embodiments, and the second operation unit 503 is configured to support the apparatus to perform S403 in the method embodiments.
It should be noted that, all related content of steps in the foregoing method embodiments may be cited in function descriptions of corresponding functional modules. Details are not described herein again.
Based on a hardware implementation, the detection unit 501 in this application may be an instruction scheduling unit, and the first operation unit 502 and the second operation unit 503 may be a graph computing flow unit.
An embodiment of this application further provides a graph instruction processing apparatus. A specific structure of the apparatus is shown in
In a possible embodiment, the instruction scheduling unit 1011 is configured to support the apparatus to perform S301 in the method embodiments; there may be two graph computing flow units 1012, one graph computing flow unit 1012 is configured to support the apparatus to perform S302, S303, and S304 in the method embodiments, and the other graph computing flow unit 1012 is configured to support the apparatus to perform S305 in the method embodiments. In another possible embodiment, there may be two graph computing flow units 1012, one graph computing flow unit 1012 is configured to support the apparatus to perform S401 and S402 in the method embodiments, and the other graph computing flow unit 1012 is configured to support the apparatus to perform S403 in the method embodiments.
In this embodiment of this application, for a first graph instruction having at least three data inputs, the at least three data inputs usually include one or two dynamic data inputs and one or more static data inputs, and the one or more static data inputs may be stored in a first register. Subsequently, a processor may process the first graph instruction by detecting whether a first input and a second input of the first graph instruction are in a ready-to-complete state, and obtaining a static data input of the first graph instruction from the first register when the first input and the second input are in the ready-to-complete state, to avoid interference of intermediate-state data of the first graph instruction to another data flow, reduce a transfer operation of the static data input of the first graph instruction, increase reuse efficiency of the static data input, and improve performance of a directed graphflow computing architecture.
Further, when the first graph instruction has at least three output addresses, one or more output addresses in the at least three output addresses may be stored in a second register, so that it can be ensured that a graph instruction having at least three output addresses in an instruction set is configured in a directed graphflow computing architecture in a regular format. Subsequently, when a first processing result of the first graph instruction is output, one or more output addresses of the first graph instruction may be obtained from the second register, and the first processing result may be sent to each output address, to implement result processing of a graph instruction having at least three output addresses. In addition, compared with the conventional technology, in the method, instruction space is not wasted, and large power consumption is not generated, to ensure that a computing architecture has better performance.
All or some of the foregoing embodiments may be implemented by using software, hardware, firmware, or any combination thereof. When software is used to implement embodiments, the foregoing embodiments may be implemented partially in a form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the procedures or functions according to embodiments of this application are all or partially generated. The computer may be a general-purpose computer, a dedicated computer, a computer network, or another programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or may be transmitted from a computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center to another website, computer, server, or data center in a wired (for example, a coaxial cable, an optical fiber, or a digital subscriber line (DSL)) or wireless (for example, infrared, radio, or microwave) manner. The computer-readable storage medium may be any usable medium accessible by a computer, or a data storage device, such as a server or a data center, integrating one or more usable media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium. The semiconductor medium may be a solid state drive (SSD).
Based on this, an embodiment of this application further provides a readable storage medium. The readable storage medium stores instructions. When the readable storage medium runs on a device, the device is enabled to perform one or more steps in the method embodiment in
An embodiment of this application further provides a readable storage medium. The readable storage medium stores instructions. When the readable storage medium runs on a device, the device is enabled to perform one or more steps in the method embodiment in
An embodiment of this application further provides a computer program product. When the computer program product runs on a computer, the computer is enabled to perform one or more steps in the method embodiment in
Another aspect of this application provides a computer program product. When the computer program product runs on a computer, the computer is enabled to perform one or more steps in the method embodiment in
In conclusion, the foregoing descriptions are merely specific implementations of this application, but are not intended to limit the protection scope of this application. Any variation or replacement within the technical scope disclosed in this application shall fall within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.
This application is a continuation of International Application No. PCT/CN2020/110868, filed on Aug. 24, 2020, the disclosure of which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2020/110868 | Aug 2020 | US |
Child | 18172492 | US |