The present application is a U.S. National Stage of International Application No. PCT/CN2023/107827, filed on Jul. 18, 2023, which claims the benefit of priority to Chinese Application No. 202211148079.3, filed on Sep. 21, 2022, the contents of all of which are incorporated by reference herein in their entireties for all purposes.
TECHNICAL FIELD
The present disclosure relates to the field of deep learning, and in particular, to deep learning image classification oriented to heterogeneous computing devices.
In recent years, deep learning models have been widely used in different scenes, including object detection, speech recognition, machine translation, etc. In these applications, researchers improve accuracy and generalization ability of deep learning models by increasing the number of trainable parameters. For example, a state-of-the-art language model Megatron-NLG has 530 billion parameters, and its accuracy in a next word prediction task of LAMBADA is 87.15%. Large-scale deep learning model based reasoning needs a lot of memory space to store parameters and intermediate variables. However, a memory size of each device is limited and usually unable to carry large-scale deep learning models. For example, a GPT-3 model with 175 billion parameters requires 350 GB of GPU memory, which is far beyond the memory size of any commercially off-the-shelf GPU.
In addition, with the popularity of the Internet of Things (IoT), large-scale deep learning model based reasoning using multiple IoT devices, such as cell phones and smart sensors, is proposed to meet the privacy, latency, and budget requirements of IoT applications. Therefore, people usually hope that a large-scale deep learning model can be divided into multiple sub-models (each sub-model includes at least one operator), and the multiple sub-models are distributed to multiple computing devices for running, so as to meet latency requirements of large-scale deep learning model based reasoning. Generally speaking, end-to-end reasoning latency of the large-scale deep learning model is hoped to be as small as possible, and assignment of operators should consider computing time of each operator on computing devices and network conditions between devices.
This process is modeled as an integer linear programming (ILP) model by an existing method, but this method has the following problems.
Firstly, this method cannot be extended to large-scale computing devices. Typically, the existing modeling can only adapt to three computing devices, which is not applicable to the case that use a large number of IoT devices for reasoning.
Secondly, this method does not take into account that computing time of an operator is different on different devices. However, there are differences in computing power, memory sizes and network transmission capacity (calculated by bandwidths) of different devices. If it is considered that computing time of an operator on any computing device is the same, it will lead to a result of operator assignment cannot guarantee the optimal end-to-end reasoning latency.
In order to realize collaborative reasoning of a deep learning model on large-scale heterogeneous computing devices and optimal end-to-end image classification latency, the present disclosure adopts the following technical solution.
A deep learning image classification method oriented to heterogeneous computing devices, including:
where, the one or more constraint condition includes that:
Further, the original directed acyclic graph can be represented as:
G=(V,E),
Setting a serial number of each of the computing devices to k, and a set of the serial numbers of the plurality of computing devices is K,
represents completion time of a last operator in the deep learning model.
Further, in the one or more constraint condition, the completion time Ci of the ith computing task or communication task is less than or equal to start time Si of a direct or indirect immediately following computing task or communication task j, and can be represents as:
Ci≤Sj, ∀i∈
Further, in the one or more constraint condition, when the ith operator is assigned to the kth computing device for execution, the completion time Ci of the ith operator is computing start time Si of the ith operator plus time pik required for the kth computing device to execute the ith operator, and can be represented as:
&i xik1, Ci=Si+Σk∈Kpikxik, ∀i∈V.
Further, in the one or more constraint condition, one of the operators is calculated by one of the computing devices, and is not interrupted in a computing process, a sum of values of the assignment decision parameters xik on the K computing devices is 1, and can be represents as:
Σk∈Kxik1.
Further, in the one or more constraint condition, a memory size mi occupied by operators on each of the computing devices k cannot exceed a memory size Memk of the computing device k, and can be represents as:
Σi∈Emixik≤Memk, ∀k∈K.
Further, in the one or more constraint condition, when two operators i and j without a sequential relationship in the original directed acyclic graph are assigned on a computing device k for execution, one operator is executed by the computing device k at a time, and can be represented as:
Si≥Cj−Msδij−Ml(2−xik−xik),
Sj≥Ci−Ms(1−δij)−Ml(2−xik−xik),
Further, in the one or more constraint condition, when two operators that transmit data to each other are assigned to a same computing device for execution, transmission latency of the communication task q between these two operators can be ignored; when two operators i and j that transmit data to each other are assigned to different computing devices k′ and k″, the communication task q between the operators selects at most one transmission channel k′→k″, xik′=xjk″=1, and transmission latency between start time Sq and end time Cq of the communication task q cannot be ignored, there is a data transmission latency, which can be represented as:
zq≤2−xikxjk, ∀q∈
zq≥xik−xjk, ∀q∈
zq≥xjk−xik, ∀q∈
uqk′k″>xik′+xjk″−1, ∀q∈
zq=Σk′∈KΣk″∈Kuqk′k″,
Cq=Sq+zqpqk′k″comm, ∀q∈
Further, in the one or more constraint condition, when there are a plurality of communication tasks q and r between two of the computing devices, one communication task q or r is executed at a time, and can be represented as:
Sq≥Cr−Msδqr−Ml(2−zq−zr)+Mr(xak+xck−xbk−xdk−2),
Sr≥Cq−Ms(1−δqr)−Ml(2−zq−zr)+Mr(xak+xck−xbk−xdk−2),
Sq≥Cr−Msδqr−Ml(2−zq−zr)+Mr(xbk+xdk−xak−xck−2),
Sr≥Cq−Ms(1−δqr)−Ml(2−qq−zr)+Mr(xbk+xdk−xakxck−2),
A deep learning image classification apparatus oriented to heterogeneous computing devices, including one or more memories and one or more processors, where executable codes are stored in the memory, and when the executable codes are executed by the one or more processors, the deep learning image classification method oriented to heterogeneous computing devices is realized.
According to the deep learning image classification methods and apparatuses oriented to heterogeneous computing devices, by establishing a new directed acyclic graph to match with an original directed acyclic graph modeled from a deep learning model, parameters based on computing devices, computing tasks, communication tasks and corresponding constraint conditions are constructed, so as to reasonably assign a plurality of operators in the deep learning model to a plurality of computing devices with a goal of minimizing a reasoning completion time of the deep learning model, thereby effectively improving the efficiency of executing the deep learning model for image classification by a plurality of computing devices.
Specific implementations of the present disclosure are described in detail below in conjunction with the accompanying drawings. It should be understood that the specific implementations described herein are only used to illustrate and explain the present disclosure, and are not used to limit the present disclosure.
As shown in
At step S1, a deep learning model is modeled as an original directed acyclic graph, and computing time of operators in the deep learning model to be executed on each of a plurality of computing devices and data transmission time of data between operators to be transferred between two of the computing devices are acquired. Nodes of the original directed acyclic graph represent the operators of the deep learning model (which can also be called computing tasks), and directed edges of the original directed acyclic graph represent data transmission between the operators.
As shown in
Specifically, as shown in
G=(V,E).
Nodes V={η1,η2,η3,η4} in the original directed acyclic graph G represent operators of the deep learning model,
Directed edges E={l1,l2,l3,l4} in the original directed acyclic graph G represent data transmission between the operators of the deep learning model,
At step S2, a new directed acyclic graph is generated by replacing each directed edge in the original directed acyclic graph with a new node to represent a communication task between two of the computing tasks, and adding new directed edges between the new nodes and original nodes to maintain a topology of the original directed acyclic graph.
As shown in
Nodes
Directed edges Ē={
At step S3, a plurality of parameters are constructed for the plurality of computing devices based on the original directed acyclic graph and the new directed acyclic graph. The plurality of parameters include processing time and a memory overhead for each of the computing devices to run each of the computing tasks, transmission latency for each of the communication tasks, one or more immediately following task for each of the computing tasks based on the original directed acyclic graph, and one or more immediately following task for each of the computing tasks or each of the communication tasks based on the new directed acyclic graph.
For example, for placing the deep learning model Inceptionv4 Block C on a plurality of heterogeneous computing devices for running as shown in
For another example, for placing the deep learning model shown in
At step S4, assignment decision parameters, communication decision parameters and time decision parameters are set, where the assignment decision parameter represents assigning the computing tasks to corresponding computing devices for execution, the communication decision parameters represent communication time of the communication tasks, and the time decision parameters represent start time of the computing tasks.
Specifically, for an assignment decision parameter xik∈{0,1}, xik=1 represents that a task i is assigned to a kth device for execution, while xik=0 indicates that the task i is not assigned to the kth device for execution. For example, the assignment decision parameter xi1=1 represents that the task i is assigned to a 1st device for execution.
For communication decision parameters uqk′k″∈{0,1} and zq, uqk′k″=1 represents that a communication task q existing between the task i and a task j, where (i,q),(q,j)∈Ē, selects a transmission channel k′→k″, otherwise uqk′k″=0; and zq=Σk′∈KΣk″∈Kuqk′k″32 1 represents communication time of the communication task q existing between the task i and the task j, where (i,q),(q,j)∈Ē.
A time decision parameter Si∈+ represents start time of the task i. Generally speaking, the time decision parameter will affect the scheduling algorithm oriented to heterogeneous computing devices to some extent.
At step S5, one or more constraint condition is constructed based on the parameters set in the step S3 and the step S4, and the operators of the deep learning model are assigned to the plurality of computing devices for execution with a goal of minimizing reasoning completion time of the deep learning model.
For the computing task i∈V, completion time of an ith operator is represented by Ci. Reasoning completion time of a deep learning model is determined by completion time of its last operator, so
Ci can be used to represent the reasoning completion time of the deep learning model. Correspondingly, the goal is to minimize this reasoning completion time, which is represented by
Specifically, as shown in
Ci=Cη
Correspondingly, constraint conditions can include the following.
For the ith computing or communication task, the end time Ci must be less than or equal to start time Si of its direct or indirect immediately following task j, which can be represented as:
Ci≤Si,
When the ith operator is assigned to the kth computing device, its completion time can be represented as calculation start time plus time required for calculation, that is:
Ci=Si+Σk∈Kpikxik,
In addition, an operator is only calculated by a computing device, and cannot be interrupted in a computing process. Therefore, a sum of the assignment decision parameters xik on all K computing devices is 1, and can be represented as:
Σk∈Kxik=1.
Memory size occupied by operators on each device cannot exceed memory size of the device, and can be represented as:
Σi∈Vmixik≤Memk, ∀k∈K.
Only one operator can be executed by a computing device at a time. Therefore, for two operators i and j without a sequential relationship in the directed acyclic graph, when they are assigned to a same device for execution, their execution time cannot overlap, and this relationship can be established by the following model:
Si≥Cj−Msδij−Ml(2−xik−xik), and
Sj≥Ci−Ms(1−δij)−Ml(2−xik−xik),
When two operators that transmit data to each other are assigned to a same device for execution, data transmission time between these two operators can be ignored. However, when two operators i and j that transmit data to each other are assigned to different computing devices k′ and k″ for execution, the communication task q between the operators i and j selects at most one transmission channel k′→k″, that is xik′=xjk″=1, and transmission latency between start time Sq and end time Cq of the communication task q cannot be ignored, that is, there is the data transmission latency. Therefore, for the data communication task q, q∈
zq≤2−xikxjk, ∀q∈
zq≥xik−xjk, ∀q∈
zq≥xjk−xik, ∀q∈
uqk′k″>xik′+xjk″−1, ∀q∈
zq=Σk′∈KΣk″∈Kuqk′k″,
Cq=Sq+zqpqk′k″comm, ∀q∈
In addition, when there are a plurality of communication tasks between two computing devices, only one communication task can be executed at a time, so this relationship is established by the following model:
Sq≥Cr−Msδqr−Ml(2−zq−zr)+Mr(xak+xck−xbk−xdk−2),
Sr≥Cq−Ms(1−δqr)−Ml(2−zq−zr)+Mr(xak+xck−xbk−xdk−2),
Sq≥Cr−Msδqr−Ml(2−zq−zr)+Mr(xbk+xdk−xak−xck−2), and
Sr≥Cq−Ms(1−δqr)−Ml(2−qq−zr)+Mr(xbk+xdk−xakxck−2).
Sq represents the start time of the communication task q;
Specifically, for the directed acyclic graphs shown in
At step S6, an image is input into one or more of the plurality of computing devices to classify the image based on the deep learning model that minimizes the reasoning completion time, where the one or more computing devices are assigned to execute one or more operators that executed first in the deep learning model.
In an embodiment of the present disclosure, for an image classification task based on deep learning, an image of size 3*224*224 may be processed using a segmented deep learning model. Taking a deep learning model of
Corresponding to the aforementioned embodiment of the deep learning image classification method oriented to heterogeneous computing devices, the present disclosure further provides an embodiment of a deep learning image classification apparatus oriented to heterogeneous computing devices.
Referring to
Embodiments of the deep learning image classification apparatus oriented to heterogeneous computing devices in the present disclosure can be applied to any device with data processing capability, which can be a device or apparatus such as a computer. Embodiments of the apparatus can be realized by software, or by hardware or a combination of hardware and software. Taking realized by software as an example, as an apparatus in a logical sense, is formed by reading corresponding computer program instructions from non-volatile memory into memory and running them through the processor of any device with data processing capability in which it is located. In terms of hardware, as shown in
The process of realizing the functions and roles of each unit in the above apparatus is detailed in the process of realizing the corresponding steps in the above method and will not be repeated here.
For the apparatus embodiment, because it basically corresponds to the method embodiment, it is only necessary to refer to the method embodiment for the relevant part of the description. The apparatus embodiments described above are only schematic, in which the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place or distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the present disclosure. It can be understood and implemented by a person of ordinary skill in the art without creative labor.
Embodiments of the present disclosure further provides a computer-readable storage medium, on which a program is stored, which, when executed by a processor, realizes the deep learning image classification method oriented to heterogeneous computing devices in the above embodiments.
The computer-readable storage medium can be an internal storage unit of any device with data processing capability described in any of the previous embodiments, such as a hard disk or a memory. The computer-readable storage medium can also be an external storage device of any device with data processing capability, such as a plug-in hard disk, smart media card (SMC), SD card, flash card, etc. provided on the device. Further, the computer-readable storage medium can further include both internal storage units and external storage devices of any device with data processing capability. The computer-readable storage medium is configured to store the computer program and other programs and data required by any equipment with data processing capability, and can further be configured to temporarily store data that has been output or will be output.
The above embodiments are only used to illustrate the technical solution of the present disclosure, but not to limit it. Although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that the technical solution described in the foregoing embodiments can still be modified, or some or all of its technical features can be replaced by equivalents. However, these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the scope of the technical solution of the embodiment of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202211148079.3 | Sep 2022 | CN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2023/107827 | 7/18/2023 | WO |
Number | Name | Date | Kind |
---|---|---|---|
20030085932 | Samra | May 2003 | A1 |
20150302317 | Norouzi | Oct 2015 | A1 |
20220398460 | Dalli | Dec 2022 | A1 |
Number | Date | Country |
---|---|---|
109561148 | Apr 2019 | CN |
111814966 | Oct 2020 | CN |
112297014 | Feb 2021 | CN |
112381211 | Feb 2021 | CN |
112297014 | Apr 2021 | CN |
114612761 | Jun 2022 | CN |
114898217 | Aug 2022 | CN |
115249315 | Oct 2022 | CN |
Entry |
---|
Feng Shou “Research on Directed Acyclic Graph Based Hierarchicalmulti-Label Classification Method” “(C)1994-202I China Academic Journal Electronic Publishing House.” Jun. 2019 , 141 pages. |
Arthur Cartel Foahom Gouabou etc. “Ensemble Method of Convolutional Neural Networks with Directed Acyclic Graph Using Dermoscopic Images: Melanoma Detection Application”“sensors” Jun. 10, 2021 ,19 pages. |
ISA State Intellectual Property Office of the People's Republic of China, International Search Report Issued in Application No. PCT/CN2023/107827, dated Oct. 19, 2023, WIPO, 6 pages. |
ISA State Intellectual Property Office of the People's Republic of China, Written Opinion of the International Searching Authority Issued in Application No. PCT/CN2023/107827, dated Oct. 19, 2023, WIPO, 8 pages. (Submitted with Machine Translation). |
Li Na et al. “Research on Scheduling Method of Layered Heterogeneous Signal Processing Platform”, Electronic Science and Technology, vol. 35, Issue 2, Feb. 15, 2022, 7 pages. (Submitted with Machine Translation). |