METHOD AND APPARATUS FOR LEARNING ACTIVATED NEURONS RESPONSES TRANSFER USING SPARSE ACTIVATION MAPS IN KNOWLEDGE DISTILLATION

Information

  • Patent Application
  • 20240242086
  • Publication Number
    20240242086
  • Date Filed
    December 14, 2023
    9 months ago
  • Date Published
    July 18, 2024
    2 months ago
  • CPC
    • G06N3/096
    • G06N3/045
    • G06N3/048
  • International Classifications
    • G06N3/096
    • G06N3/045
    • G06N3/048
Abstract
The method for learning activated neurons responses transfer using sparse activation maps (SAMs) in knowledge distillation according to an embodiment is performed on a computing device including one or more processors and a memory that stores one or more programs executed by the one or more processors. The method includes extracting teacher SAMs by extracting a feature map from a learning model of the teacher network based on input data and filtering the extracted feature map, extracting student SAMs by extracting a feature map from a learning model of the student network based on the input data and filtering the extracted feature map, computing a loss function by comparing the extracted teacher SAMs with the extracted student SAMs, and updating the learning model of the student network based on the computed loss function.
Description
CROSS-REFERENCE TO RELATED APPLICATION AND CLAIM OF PRIORITY

This application claims the benefit under 35 USC § 119 of Korean Patent Application Nos. 10-2023-0004776, filed on Jan. 12, 2023, and 10-2023-0039977, filed on Mar. 27, 2023, in the Korean Intellectual Property Office, the entire disclosure of which are incorporated herein by reference for all purposes.


BACKGROUND
1. Field

Embodiments of the present disclosure relate to a technique for learning activated neurons responses transfer using sparse activation maps (SAMs) in knowledge distillation.


2. Description of Related Art

A heavy and complex deep learning architecture consumes high computational resources and a lot of energy. Such a deep learning architecture is not suitable for deployment in environments requiring low computational resources and real-time applications (e.g., 6G network, mobile, IoT, AI-enabled edge, cloud and fog computing infrastructure). A lightweight artificial intelligence (AI) system is needed in order to support a real-time service on a low-performance device.


However, a lightweight deep learning architecture performs worse than the heavy and complex deep learning architecture because the lightweight deep learning architecture lacks generalization ability (ability to use real data rather than training data).


In order to solve this problem, a lightweight deep learning architecture is needed that shows the best performance on devices with fewer computational resources.


SUMMARY

Embodiments of the present disclosure are intended to provide a learning method for transferring activated neuron responses so that a student network has the same neuron activation boundaries as a teacher network based on a loss function computed by comparing a learning model of the teacher network and a learning model of the student network.


According to an exemplary embodiment of the present disclosure, there is provided a method for learning activated neurons responses transfer using sparse activation maps (SAMs) in knowledge distillation performed on a computing device including one or more processors and a memory that stores one or more programs executed by the one or more processors, the method including extracting teacher sparse activation maps (SAMs) by extracting a feature map from a learning model of the teacher network based on input data and filtering the extracted feature map, extracting student sparse activation maps (SAMs) by extracting a feature map from a learning model of the student network based on the input data and filtering the extracted feature map, computing a loss function by comparing the extracted teacher sparse activation maps with the extracted student sparse activation maps, and updating the learning model of the student network based on the computed loss function.


The extracting of the teacher sparse activation maps may further include extracting the feature map using a convolution layer from the learning model of the teacher network based on the input data, extracting an activation map by filtering the extracted feature map using an activation function, and extracting the teacher sparse activation maps by filtering the extracted activation map using a filter function.


The extracting of the teacher sparse activation maps may further include normalizing the activation map extracted from the learning model of the teacher network to a preset range, and re-filtering the normalized activation map through the activation function.


The extracting of the student sparse activation maps may further include extracting the feature map using a convolution layer from the learning model of the student network based on the input data, extracting an activation map by filtering the extracted feature map using an activation function, and extracting the student sparse activation map by filtering the extracted activation map using a filter function.


The extracting of the student sparse activation maps may further include normalizing the activation map extracted from the learning model of the student network to a preset range, and re-filtering the normalized activation map through the activation function.


The activation function may be a rectified linear unit (ReLU) function, and the filter function may be a function obtained by moving the coordinates of the ReLU function and adjusting a passband.


The loss function may be a loss function obtained by comparing losses due to differences in distance between the extracted teacher sparse activation map and the extracted student sparse activation map.


The loss function may be calculated by Equation 2 below.












res

(


𝒮
S

,

𝒮
T


)

=




i
=
1

L





(


𝒮
i
S

-

𝒮
i
T


)



2
2






[

Equation


2

]







(where Lres(⋅) is the loss function, SS is the sparse activation maps of the trained model of the teacher network, ST is the sparse activation maps of the trained model of the student network, L is the number of SAMs).





BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 is a block diagram for illustratively describing a computing environment including a computing device suitable for use in exemplary embodiments.



FIG. 2 is a block diagram illustrating a learning model of a general knowledge distillation-based teacher-student network.



FIG. 3 is a flowchart illustrating a method for learning activated neurons responses transfer using SAMs in knowledge distillation according to an embodiment of the present disclosure.



FIGS. 4A and 4B are a diagram for describing an activation function and a filter function according to an embodiment of the present disclosure.



FIG. 5 is a diagram illustrating a process of extracting sparse activation maps (SAMs) according to an embodiment of the present disclosure.



FIG. 6 is a diagram illustrating a process of computing a loss function using the sparse activation maps (SAMs) of a learning model extracted from the teacher network and sparse activation maps of a learning model extracted from the student network according to an embodiment of the present disclosure.





DETAILED DESCRIPTION

Hereinafter, a specific embodiment of the present disclosure will be described with reference to the drawings. The following detailed description is provided to aid in a comprehensive understanding of the methods, apparatus and/or systems described herein. However, this is illustrative only, and the present disclosure is not limited thereto.


In describing the embodiments of the present disclosure, when it is determined that a detailed description of related known technologies may unnecessarily obscure the subject matter of the present disclosure, a detailed description thereof will be omitted. In addition, terms to be described later are terms defined in consideration of functions in the present disclosure, which may vary according to the intention or custom of users or operators. Therefore, the definition should be made based on the contents throughout this specification. The terms used in the detailed description are only for describing embodiments of the present disclosure, and should not be limiting. Unless explicitly used otherwise, expressions in the singular form include the meaning of the plural form. In this description, expressions such as “comprising” or “including” are intended to refer to certain features, numbers, steps, actions, elements, some or combination thereof, and it is not to be construed to exclude the presence or possibility of one or more other features, numbers, steps, actions, elements, some or combinations thereof, other than those described.


In the following description, “transfer,” “communication,” “transmission,” “reception,” of a signal or information and other terms of similar meaning include not only direct transmission of a signal or information from one component to another component, but also transmission of the signal or information through another component. In particular, “transferring” or “transmitting” a signal or information to a component indicates a final destination of the signal or information and does not mean a direct destination. This is the same for “receiving” the signal or information. In addition, in this specification, the fact that two or more pieces of data or information are “related” means that if one data (or information) is acquired, at least part of the other data (or information) can be acquired based on it.


Meanwhile, embodiments of the present disclosure may include a program for performing the methods described in this specification on a computer, and a computer-readable recording medium containing the program. The computer-readable recording medium may include program instructions, local data files, local data structures, etc., singly or in combination. The media may be those specifically designed and constructed for the present disclosure, or may be those commonly available in the computer software field. Examples of the computer-readable recording medium include a hard disk, a magnetic medium such as a floppy disk and a magnetic tape, an optical recording medium such as CD-ROM and DVD, and a hardware device specially configured to store and perform program instructions, such as ROM, RAM, flash memory, etc. Examples of the program may include not only machine language code such as that generated by a compiler, but also high-level language code that can be executed by the computer using an interpreter, etc.



FIG. 1 is a block diagram for illustratively describing a computing environment 10 including a computing device suitable for use in exemplary embodiments. In the illustrated embodiment, respective components may have different functions and capabilities other than those described below, and may include additional components in addition to those described below.


The illustrated computing environment 10 includes a computing device 12. In an embodiment, the computing device 12 may be a device for performing learning of activated neuron responses transfer using SAMs in knowledge distillation according to an embodiment of the present disclosure.


The computing device 12 includes at least one processor 14, a computer-readable storage medium 16, and a communication bus 18. The processor 14 may cause the computing device 12 to operate according to the exemplary embodiment described above. For example, the processor 14 may execute one or more programs stored on the computer-readable storage medium 16. The one or more programs may include one or more computer-executable instructions, which, when executed by the processor 14, may be configured so that the computing device 12 performs operations according to the exemplary embodiment.


The computer-readable storage medium 16 is configured so that the computer-executable instruction or program code, program data, and/or other suitable forms of information are stored. A program 20 stored in the computer-readable storage medium 16 includes a set of instructions executable by the processor 14. In one embodiment, the computer-readable storage medium 16 may be a memory (volatile memory such as a random access memory, non-volatile memory, or any suitable combination thereof), one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, other types of storage media that are accessible by the computing device 12 and capable of storing desired information, or any suitable combination thereof.


The communication bus 18 interconnects various other components of the computing device 12, including the processor 14 and the computer-readable storage medium 16.


The computing device 12 may also include one or more input/output interfaces 22 that provide an interface for one or more input/output devices 24, and one or more network communication interfaces 26. The input/output interface 22 and the network communication interface 26 are connected to the communication bus 18. The input/output device 24 may be connected to other components of the computing device 12 through the input/output interface 22. The exemplary input/output device 24 may include a pointing device (such as a mouse or trackpad), a keyboard, a touch input device (such as a touch pad or touch screen), a speech or sound input device, input devices such as various types of sensor devices and/or photographing devices, and/or output devices such as a display device, a printer, a speaker, and/or a network card. The exemplary input/output device 24 may be included inside the computing device 12 as a component configuring the computing device 12, or may be connected to the computing device 12 as a separate device distinct from the computing device 12.



FIG. 2 is a block diagram illustrating a learning model of a general knowledge distillation-based teacher-student network.


As illustrated in FIG. 2, the learning model of a knowledge distillation-based teacher-student network includes a teacher network and a student network. Each network is composed of a neural network in which artificial neurons are connected into a network. In this knowledge distillation-based teacher-student network, as each block of the teacher network is related and each block of the student network is related, a knowledge transfer for all related blocks occurs, and a loss function is also determined for all blocks.


Typically, the teacher network refers to a network that has already performed similar prior learning or a more extensive network, and has a larger size than the student network. Here, the network having the larger size means a network having more layers, kernels, and nodes, or a network composed of such combinations thereof. In addition, the student network follows the learning model extracted from the teacher network, and has network components similar to but smaller in size than those of the teacher network.


One of the important technical features of the present disclosure is that the student network learns activation boundaries of the teacher network. The present disclosure can provide a learning method for transferring activated neuron responses so that a student network has the same neuron activation boundaries as a teacher network based on a loss function computed by comparing a learning model of the teacher network and a learning model of the student network. That is, the present disclosure can improve the nonlinearity of the student network, enhance generalization ability of the student network, and generate a learning model with performance close to that of the teacher network, through the learning method for transferring activated neuron responses.



FIG. 3 is a flowchart illustrating a method for learning activated neurons responses transfer using SAMs in knowledge distillation according to an embodiment of the present disclosure. As described above, the method for learning activated neurons responses transfer using SAMs in knowledge distillation according to an embodiment of the present disclosure may be performed in a computing device 12 including one or more processors and a memory that stores one or more programs to be executed by the one or more processors. To this end, the method for learning activated neurons responses transfer using SAMs in knowledge distillation may be implemented in the form of a program or software including one or more computer-executable instructions and stored on the memory.


In addition, in the illustrated flow chart, the method is described by being divided into a plurality of steps, but at least some of the steps may be performed in a different order, may be performed together in combination with other steps, omitted, may be performed by being divided into detailed steps, or may be performed by being added with one or more steps (not illustrated).


At step 302, the computing device 12 extracts a feature map from the learning model of the teacher network based on input data, filters the extracted feature map, and extracts teacher sparse activation maps (sparse activation maps (SAMs) extracted from the learning model of the teacher network).


In an exemplary embodiment, the computing device 12 may extract the feature map using a convolution layer in the learning model of the teacher network based on the input data. In addition, the computing device 12 may extract an activation map from the extracted feature map using an activation function and extract sparse activation maps from the extracted activation map using a filter function.


In this case, the activation function may be a rectified linear unit (ReLU) function. The ReLU function is a function that outputs (filters) 0 if the input value is less than 0, and outputs the input value as is if the input value is greater than 0. In the present disclosure, the activation map, which is a matrix in which values less than 0 are set to 0 (deactivated) and values greater than 0 are left as is (activated), may be obtained from the feature map.


In addition, the filter function is a function obtained by moving and resizing the ReLU function. In the present disclosure, the sparse activation map, which is a matrix in which values (low activation values) smaller than a preset value are filtered out and only values (higher activation values) larger than the preset value are left, may be obtained from the activation map.



FIGS. 4A and 4B are a diagram for describing the activation function and the filter function according to an embodiment of the present disclosure, and FIG. 5 is a diagram illustrating a process of extracting the sparse activation maps (SAMs) according to an embodiment of the present disclosure.


Referring to FIGS. 4A and 4B, the activation function (FIG. 4A) is the ReLU function, and the filter function (FIG. 4B) is a function obtained by moving (coordinates) and adjusting (range or passband) the RuLU function.


As illustrated in FIG. 5, in the present disclosure, the activation map is extracted by filtering the feature map through the activation function, the extracted activation map is normalized to a range [−ξ, ξ] and the activation map is re-filtered through the activation function, thereby capable of extracting the sparse activation maps (SAMs). Here, ξ may be a temperature for adjusting a range of a separating hyperplane. For example, when ξ=20, the best performance may be achieved.


That is, it is possible to generate an activation map having activated neurons (of which input value is greater than or equal to 0) by filtering and excluding deactivated neurons (of which input value is less than 0) from the feature map through the activation function. In this case, the activation map includes neurons activated with a low value (activation value lower than the preset value) and neurons activated with a high value (activation value higher than the preset value). Accordingly, the sparse activation maps containing only neurons activated with the high value can be generated by filtering out and excluding the neurons activated with the low value through the filter function from the activation map. Through this, the nonlinearity of the student network can be improved by transferring highly activated neuron responses.


In this case, the sparse activation maps can be defined as Equation 1 below.












S

i
,
j
,
k


=


φ

(

A

i
,
j
,
k


)

=



(


A

i
,
j
,
k



ξ
×


A




)





"\[RightBracketingBar]"




i
=
0

,

j
=
0

,

k
=
0




i
=
C

,

j
=
H

,

k
=
W






[

Equation


1

]







(where S is sparse activation maps, A is the activation map, ξ is a range adjustment value, ϕ(⋅) is the activation function, φ(⋅) is the filter function, and the subscripts i, j, k are a three-dimensional array (feature map and activation map are made up of three-dimensional matrices), W is the width, H is the height, and C is the channel)


At step 304, the computing device 12 extracts the feature map from the learning model of the student network based on the input data, filters the extracted feature map, and extracts the student sparse activation maps (sparse activation maps (SAMs) extracted from the learned model of the student network). Here, the process of extracting the student sparse activation maps from the learning model of the student network is the same as the process of extracting the teacher sparse activation maps from the learning model of the teacher network.


At step 306, the computing device 12 computes a loss function (Lres(SS, ST)) by comparing the extracted teacher sparse activation maps and the extracted student sparse activation maps. Here, SS may refer to the sparse activation maps of the learned model of the teacher network, ST may refer to the sparse activation maps of the learned model of the student network, and Lres(SS, ST) may refer to the loss function obtained by comparing the two sparse activation maps.


In an exemplary embodiment, the loss function may be computed by using a distance difference between the sparse activation map of the trained model in the teacher network and the sparse activation map of the trained model in the student network.


The loss function may be calculated by Equation 2 below.












res

(


𝒮
S

,

𝒮
T


)

=




i
=
1

L





(


𝒮
i
S

-

𝒮
i
T


)



2
2






[

Equation


2

]







(where Lres(⋅) is the loss function, SS is the sparse activation maps of the learned model of the teacher network, ST is the sparse activation maps of the learned model of the student network, L is the number of SAMs)


Here, the loss function is calculated by calculating a distance using the L2-Norm method, and can be calculated by adding up the squares of the difference (distance) between the teacher sparse activation map and the student sparse activation map when i is from 1 to L.



FIG. 6 is a diagram illustrating a process of computing a loss function using the sparse activation maps (SAMs) of the learning model extracted from the teacher network and the sparse activation maps of the learning model extracted from the student network according to an embodiment of the present disclosure.


As illustrated in FIG. 6, the loss function of the entire network can be calculated using the loss function obtained by comparing the sparse activation maps for each block (in this case, the number of blocks is L). That is, the computing device 12 may extract the activation maps (ALT, ALS) from the feature maps (FLT, FLS) output for each block of the learning model of the teacher network and the student network through the activation function ϕ(⋅), and may extract the sparse activation maps (SLT, SLS) from the extracted activation maps through the filter function φ(⋅). In addition, the computing device 12 may calculate the loss function Lres (SS, ST) by comparing the sparse activation maps SLT extracted from the learning model of the teacher network and the sparse activation maps SLS extracted from the learning model of the student network.


At step 308, the computing device 12 updates the learning model of the student network based on the computed loss function.


That is, the computing device 12 can update the learning model of the student network through weights that minimize the computed loss function. In this way, training can be repeated until the learning model of the learning network has the same neuron activation boundaries as the learning model of the teacher network.


Therefore, the method for learning activated neurons responses transfer using SAMs in knowledge distillation according to an embodiment of the present disclosure can improve the nonlinearity of the student network, enhances the generalization ability thereof, and generate the learning model with performance close to that of the teacher network, through the learning method for transferring the activated neuron responses.


Meanwhile, in the method for learning activated neurons responses transfer using sparse activation maps in knowledge distillation according to an embodiment of the present disclosure, the learning model of the student network can be trained by using not only input data but also augmented data. Here, augmented data can be generated by transforming input data. In addition, in the method for learning activated neurons responses transfer using sparse activation maps in knowledge distillation, a teacher assistance network and a student assistance network for learning augmented data may be further included.


The computing device 12 can train the learning model of the teacher network based on the input data and train the learning model of the teacher assistant network based on the augmented data.


In addition, the computing device 12 can train the learning model of the student network based on the input data and train the learning model of the student assistant network based on the augmented data.


In addition, the computing device 12 can update the student network using the loss of knowledge distillation between the teacher network and the student network, cross-entropy loss between predicted values and actual values, loss of knowledge distillation between teacher assistant networks and student assistant networks, and loss between the sparse activation maps of the teacher network and the sparse activation maps of the student network.


That is, the overall loss function can be calculated by Equation 3 below.











NeuRes

=



β
1




kd


+


β
2




ce


+



ax
-
st


+


res






[

Equation


3

]







Here, Lkd is the loss of knowledge distillation between teacher network and student network, Lce is the cross-entropy loss between predicted and actual values, Lax-st is the loss of knowledge distillation between teacher assistant networks and student assistant networks, Lres is the loss between the sparse activation map of the teacher network and the sparse activation map of the student network, and β1 and β2 are balancing hyperparameters.


According to embodiments of the present disclosure, through the learning method for transferring activated neuron responses, it is possible to improve the nonlinearity of the student network, enhance generalization ability of the student network, and generate a learning model with performance close to that of the teacher network.


Although representative embodiments of the present disclosure have been described in detail, a person skilled in the art to which the present disclosure pertains will understand that various modifications may be made thereto within the limits that do not depart from the scope of the present disclosure. Therefore, the scope of rights of the present disclosure should not be limited to the described embodiments, but should be defined not only by claims set forth below but also by equivalents to the claims.

Claims
  • 1. A method for learning activated neurons responses transfer using sparse activation maps (SAMs) in knowledge distillation performed on a computing device including one or more processors and a memory that stores one or more programs executed by the one or more processors, the method comprising: extracting teacher sparse activation maps (SAMs) by extracting a feature map from a learning model of the teacher network based on input data and filtering the extracted feature map;extracting student sparse activation maps (SAMs) by extracting a feature map from a learning model of the student network based on the input data and filtering the extracted feature map;computing a loss function by comparing the extracted teacher sparse activation maps with the extracted student sparse activation maps; andupdating the learning model of the student network based on the computed loss function.
  • 2. The method of claim 1, wherein the extracting of the teacher sparse activation maps further comprises: extracting the feature map using a convolution layer from the learning model of the teacher network based on the input data;extracting an activation map by filtering the extracted feature map using an activation function; andextracting the teacher sparse activation maps by filtering the extracted activation map using a filter function.
  • 3. The method of claim 2, wherein the extracting of the teacher sparse activation maps further comprises: normalizing the activation map extracted from the learning model of the teacher network to a preset range; andre-filtering the normalized activation map through the activation function.
  • 4. The method of claim 1, wherein the extracting of the student sparse activation maps further comprises: extracting the feature map using a convolution layer in the learning model of the student network based on the input data;extracting an activation map by filtering the extracted feature map using an activation function; andextracting the student sparse activation map by filtering the extracted activation map using a filter function.
  • 5. The method of claim 4, wherein the extracting of the student sparse activation maps further comprises: normalizing the activation map extracted from the learning model of the student network to a preset range; andre-filtering the normalized activation map through the activation function.
  • 6. The method of claim 3, wherein the activation function is a rectified linear unit (ReLU) function, and the filter function may be a function obtained by moving the coordinates of the ReLU function and adjusting a passband.
  • 7. The method of claim 1, wherein the loss function is a loss function obtained by comparing losses due to differences in distance between the extracted teacher sparse activation map and the extracted student sparse activation map.
  • 8. The method of claim 7, wherein the loss function is calculated by Equation 2:
  • 9. A computing device comprising: one or more processors;a memory; andone or more programs stored in the memory, the one or more programs configured to be executed by the one or more processors, the one or more programs include:an instruction for extracting teacher sparse activation maps (SAMs) by extracting a feature map from a learning model of the teacher network based on input data and filtering the extracted feature map;an instruction for extracting student sparse activation maps (SAMs) by extracting a feature map from a learning model of the student network based on the input data and filtering the extracted feature map;an instruction for computing a loss function by comparing the extracted teacher sparse activation maps with the extracted student sparse activation maps; andan instruction for updating the learning model of the student network based on the computed loss function.
  • 10. The computing device of claim 9, wherein the instruction for extracting of the teacher sparse activation maps further comprises: an instruction for extracting the feature map using a convolution layer from the learning model of the teacher network based on the input data;an instruction for extracting an activation map by filtering the extracted feature map using an activation function; andan instruction for extracting the teacher sparse activation maps by filtering the extracted activation map using a filter function.
  • 11. The computing device of claim 10, wherein the instruction for extracting of the teacher sparse activation maps further comprises: an instruction for normalizing the activation map extracted from the learning model of the teacher network to a preset range; andan instruction for re-filtering the normalized activation map through the activation function.
  • 12. The computing device of claim 9, wherein the instruction for extracting of the student sparse activation maps further comprises: an instruction for extracting the feature map using a convolution layer in the learning model of the student network based on the input data;an instruction for extracting an activation map by filtering the extracted feature map using an activation function; andan instruction for extracting the student sparse activation map by filtering the extracted activation map using a filter function.
  • 13. The computing device of claim 12, wherein the instruction for extracting of the student sparse activation maps further comprises: an instruction for normalizing the activation map extracted from the learning model of the student network to a preset range; andan instruction for re-filtering the normalized activation map through the activation function.
  • 14. The computing device of claim 11, wherein the activation function is a rectified linear unit (ReLU) function; and the filter function may be a function obtained by moving the coordinates of the ReLU function and adjusting a passband.
  • 15. The computing device of claim 9, wherein the loss function is a loss function obtained by comparing losses due to differences in distance loss between the extracted teacher sparse activation map and the extracted student sparse activation map.
  • 16. The computing device of claim 15, wherein the loss function is calculated by Equation 2:
Priority Claims (2)
Number Date Country Kind
10-2023-0004776 Jan 2023 KR national
10-2023-0039977 Mar 2023 KR national