The present invention relates generally to models of neural networks, for implementing a machine-learning based function. More specifically, the present invention relates to training machine-learning models, and/or implementing a machine-learning based function on machine-learning models.
Training of neural networks is a computationally intensive task. The significance of understanding and modelling the training dynamics is growing as increasingly larger networks are being trained.
A neural network (NN) or an artificial neural network (ANN), e.g., a neural network implementing a machine learning (ML) or artificial intelligence (AI) function, may refer to an information processing paradigm that may include nodes, referred to as neurons, organized into layers, with links between the neurons. The links may transfer signals between neurons and may be associated with weights. A NN may be configured or trained for a specific task, e.g., pattern recognition or classification. Training a NN for the specific task may involve adjusting these weights based on examples. Each neuron of an intermediate or last layer may receive an input signal, e.g., a weighted sum of output signals from other neurons, and may process the input signal using a linear or nonlinear function (e.g., an activation function). The results of the input and intermediate layers may be transferred to other neurons and the results of the output layer may be provided as the output of the NN. Typically, the neurons and links within a NN are represented by mathematical constructs, such as activation functions and matrices of data elements and weights. A processor, e.g., CPUs or graphics processing units (GPUs), or a dedicated hardware device may perform the relevant calculations.
Embodiments of the invention may include an algorithm and model based on the correlation of the parameters' dynamics, which dramatically reduces the dimensionality. This algorithm and model may be referred to herein as Correlation Mode Decomposition (CMD). The algorithm is adapted to split the parameter space into groups of parameters, ansl referred to herein as “modes”, which behave in a highly correlated manner through the training epochs. The inventors have achieved a remarkable dimensionality reduction with this approach, where a network of 11M parameters such as a ResNet-18 network can be modelled well using just a few modes. The inventors have observed each typical time profile of a mode spread throughout the network in all layers. Moreover, retraining the network using the dimensionality reduced model of the present invention may induce a regularization which may yield better generalization capacity on the test set. Such representation can facilitate better future training acceleration techniques.
The inventors have observed that while the network parameters may behave non-smoothly in the training process, many of them are highly correlated and can be grouped into “modes”, characterized by their correlation to one common evolution profile. The present invention may thus include an algorithm, referred to herein as “Correlation Mode Decomposition” (CMD).
The CMD algorithm may model the network's dynamics in an efficient way in terms of dimensionality and computation time, facilitating significant reduction of dimensionality.
Experimental results have shown applicability of this approach to several popular architectures in computer vision (e.g., ResNet18). However, it may be appreciated by a person skilled in the art that application of the CMD algorithm should not be limited by any way to any specific NN or ML application.
Embodiments of the invention may include analysis of time-profiles, which in the neural-network setting is equivalent to examining the behavior of the network parameters, as they evolve through epochs of gradient descent.
Previous studies have shown that time-profiles and correlation analysis is beneficial in modeling nonlinear physical phenomena. These studies aimed to decompose the dynamic to orthogonal components both in space and in time. Imposing orthogonality in space and time, however, may be too strong of a constraint, leading to a limited solution space.
More recent studies in variational image-processing have shown that gradient descent with respect to homogeneous functionals (of various degrees) induce typical time profiles. For instance, total-variation flow can be modelled by piecewise linear time profiles. These profiles stem from the behavior of basic elements with respect to the gradient operator, referred to as nonlinear eigenfunctions. The theory developed there shows that the time profiles are generally not orthogonal. Orthogonality was shown in certain settings for the spatial structures (“spectral components”).
As elaborated herein, embodiments of the invention may generalize these concepts for the neural network case. A principal difference in the modelling is that unlike the variational case, here there is no guaranteed homogeneity, and the system is too complex to be modelled analytically. Embodiments of the invention may thus resort to data-driven time profiles, which change with network architectures and learning tasks.
Embodiments of the invention may include a method of training a NN model, by at least one processor. Embodiments of the method may include providing a NN model that includes a plurality of NN parameters, and training the NN model based on a training dataset over a plurality of training epochs, to implement a predefined ML function.
According to some embodiments, one or more (e.g., each) training epoch may include adjusting a value of at least one NN parameter based on gradient descent calculation; calculating a profile vector, representing evolution of the at least one NN parameter through the plurality of training epochs; calculating an approximated value of the at least one NN parameter, based on the profile vector; and replacing the at least one NN parameter value with the approximated value, to obtain an approximated version of the NN model.
Embodiments of the invention may include a system for implementing a machine-learning (ML)-based function. Embodiments of the system comprising: a non-transitory memory device, where modules of instruction code are stored, and at least one processor associated with the memory device, and configured to execute the modules of instruction code. Upon execution of said modules of instruction code, the at least one processor may be configured to provide a NN model that includes a plurality of NN parameters, and train the NN model based on a training dataset over a plurality of training epochs, to implement a predefined ML function. One or more (e.g., each) training epoch may include adjusting a value of at least one NN parameter based on gradient descent calculation; calculating a profile vector, representing evolution of the at least one NN parameter through the plurality of training epochs; calculating an approximated value of the at least one NN parameter, based on the profile vector; and replacing the at least one NN parameter value with the approximated value, to obtain an approximated version of the NN model.
The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings in which:
It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.
One skilled in the art will realize the invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The foregoing embodiments are therefore to be considered in all respects illustrative rather than limiting of the invention described herein. Scope of the invention is thus indicated by the appended claims, rather than by the foregoing description, and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to obscure the present invention. Some features or elements described with respect to one embodiment may be combined with features or elements described with respect to other embodiments. For the sake of clarity, discussion of same or similar features or elements may not be repeated.
Although embodiments of the invention are not limited in this regard, discussions utilizing terms such as, for example, “processing,” “computing,” “calculating,” “determining,” “establishing”, “analyzing”, “checking”, or the like, may refer to operation(s) and/or process(es) of a computer, a computing platform, a computing system, or other electronic computing device, that manipulates and/or transforms data represented as physical (e.g., electronic) quantities within the computer's registers and/or memories into other data similarly represented as physical quantities within the computer's registers and/or memories or other information non-transitory storage medium that may store instructions to perform operations and/or processes.
Although embodiments of the invention are not limited in this regard, the terms “plurality” and “a plurality” as used herein may include, for example, “multiple” or “two or more”. The terms “plurality” or “a plurality” may be used throughout the specification to describe two or more components, devices, elements, units, parameters, or the like. The term “set” when used herein may include one or more items.
Unless explicitly stated, the method embodiments described herein are not constrained to a particular order or sequence. Additionally, some of the described method embodiments or elements thereof can occur or be performed simultaneously, at the same point in time, or concurrently.
Reference is now made to
Computing device 1 may include a processor or controller 2 that may be, for example, a central processing unit (CPU) processor, a chip or any suitable computing or computational device, an operating system 3, a memory 4, executable code 5, a storage system 6, input devices 7 and output devices 8. Processor 2 (or one or more controllers or processors, possibly across multiple units or devices) may be configured to carry out methods described herein, and/or to execute or act as the various modules, units, etc. More than one computing device 1 may be included in, and one or more computing devices 1 may act as the components of, a system according to embodiments of the invention.
Operating system 3 may be or may include any code segment (e.g., one similar to executable code 5 described herein) designed and/or configured to perform tasks involving coordination, scheduling, arbitration, supervising, controlling or otherwise managing operation of computing device 1, for example, scheduling execution of software programs or tasks or enabling software programs or other modules or units to communicate. Operating system 3 may be a commercial operating system. It will be noted that an operating system 3 may be an optional component, e.g., in some embodiments, a system may include a computing device that does not require or include an operating system 3.
Memory 4 may be or may include, for example, a Random-Access Memory (RAM), a read only memory (ROM), a Dynamic RAM (DRAM), a Synchronous DRAM (SD-RAM), a double data rate (DDR) memory chip, a Flash memory, a volatile memory, a non-volatile memory, a cache memory, a buffer, a short term memory unit, a long term memory unit, or other suitable memory units or storage units. Memory 4 may be or may include a plurality of possibly different memory units. Memory 4 may be a computer or processor non-transitory readable medium, or a computer non-transitory storage medium, e.g., a RAM. In one embodiment, a non-transitory storage medium such as memory 4, a hard disk drive, another storage device, etc. may store instructions or code which when executed by a processor may cause the processor to carry out methods as described herein.
Executable code 5 may be any executable code, e.g., an application, a program, a process, task, or script. Executable code 5 may be executed by processor or controller 2 possibly under control of operating system 3. For example, executable code 5 may be an application that may train an ML model, and/or implement a ML-based function as further described herein. Although, for the sake of clarity, a single item of executable code 5 is shown in
Storage system 6 may be or may include, for example, a flash memory as known in the art, a memory that is internal to, or embedded in, a micro controller or chip as known in the art, a hard disk drive, a CD-Recordable (CD-R) drive, a Blu-ray disk (BD), a universal serial bus (USB) device or other suitable removable and/or fixed storage unit. Data pertaining to a ML model to be trained may be stored in storage system 6, and may be loaded from storage system 6 into memory 4 where it may be processed by processor or controller 2. In some embodiments, some of the components shown in
Input devices 7 may be or may include any suitable input devices, components, or systems, e.g., a detachable keyboard or keypad, a mouse and the like. Output devices 8 may include one or more (possibly detachable) displays or monitors, speakers and/or any other suitable output devices. Any applicable input/output (I/O) devices may be connected to Computing device 1 as shown by blocks 7 and 8. For example, a wired or wireless network interface card (NIC), a universal serial bus (USB) device or external hard drive may be included in input devices 7 and/or output devices 8. It will be recognized that any suitable number of input devices 7 and output device 8 may be operatively connected to Computing device 1 as shown by blocks 7 and 8.
A system according to some embodiments of the invention may include components such as, but not limited to, a plurality of central processing units (CPU) or any other suitable multi-purpose or specific processors or controllers (e.g., similar to element 2), a plurality of input units, a plurality of output units, a plurality of memory units, and a plurality of storage units.
Reference is now made to
As shown in
As shown in
As elaborated herein, system 10 may produce an approximated version 200 of ML model 100. The terms ML model 200 and NN model 200 may thus also be used herein interchangeably.
According to some embodiments, system 10 may include, or may be associated with a training module 110. Training module 110 may be configured to train NN model 100 to implement an underlying ML-based function (e.g., a Natural Language Processing (NLP) function, an image analysis function, and the like), based on a plurality of data samples 20 (e.g., training data samples 20A). The training process may, for example, be a supervised training algorithm, that may employ Stochastic Gradient Descent (SGD) to modify NN parameters 100P over a plurality of training epochs, as known in the art.
According to some embodiments, system 10 may analyze evolution of NN parameters 100P during the training process, and subsequently produce an approximated version, or approximated model 200 of the trained NN model 100, based on the analysis. System 10 may subsequently implement the ML-based function by inferring approximated model 200 on data samples 20 (e.g., test data samples 20B).
Embodiments of the invention may include a practical application by improving functionality of a computing system: As elaborated herein, by inferring the approximated version of NN model 200 on incoming data samples 20 (e.g., rather than inferring the trained NN model 100 on data samples 20), system 10 may improve implementation of the underlying ML function. This improvement is manifested, for example by improved metrics of accuracy, as elaborated herein.
According to some embodiments, system 10 may include a monitoring module 120, configured to monitor evolution of one or more (e.g., each) NN parameter 100P during the training process (e.g., over the plurality of training epochs). For example, monitoring module 120 may calculate, for one or more NN parameters 100P of the plurality of NN parameters 100P a profile vector 120PV that represents evolution or change of a value of the NN parameter over time (e.g., throughout the plurality of training epochs).
For example, a specific profile vector 120PV may be, or may include a vector of numerical values, representing values of a specific, respective NN parameter 100P at different points in time during the training process, e.g., following each training epoch.
According to some embodiments, system 10 may include a clustering module 130, configured to analyze profile vectors 120PV, to determine disjoint sets of NN parameters 100P.
Clustering module 130 may thus group, or cluster the NN parameters 100P such that each group or cluster may have, or may be characterized by a different prototypical profile vector 130PPV. These groups may be referred to herein interchangeably as “clusters” or “modes” 130M. The prototypical profile vectors 130PPV of each cluster 130M may be calculated as best representing the plurality of profile vectors 120PV of member parameters 100P according to a predetermined distance metric 130DM. For example, a prototypical profile vector 130PPV of a mode 130M may include point-mean values of corresponding entries of member profile vectors 120PV.
According to some embodiments, clustering module 130 may group or cluster the plurality of NN parameters 100P into a plurality of modes 130M, based on their respective profile vectors 120PV.
For example, clustering module 130 may calculate a distance metric (e.g., a Euclidean distance) 130DM between pairs of profile vectors 120PV, each representing a specific NN parameter 100P. Clustering module 130 may subsequently cluster the NN parameters 100P into multidimensional clusters, or modes 130M based on the calculated di stance metric 130DM.
As elaborated herein, each NN parameter 100P may correspond to, or be represented by a specific prototypic profile vector 130PPV. Therefore, each cluster or mode may be regarded herein as clustering member both NN parameters 100P and/or clustering their respective member profile vectors 120PV.
For example, NN model may implement a binary classification function between images of cats and dogs. In this non-limiting example, NN model 100 may be a Convolutional Neural Network (CNN) model, as known in the art. The inventors have experimentally implemented this CNN model 100 by a model referred to herein as “SimpleNet2”. The SimpleNet2 model used in this example was a NN model that included several convolution layers, followed by max-pooling layers, fully-connected (FC) layers and Rectified Linear Unit (ReLU) activation layers, culminating at a total of 94,000 NN parameters 100P.
The term NN parameter 100P may be used herein to refer to any elements of a NN model that may be adapted during training to perform an underlying function. For example, NN parameters of SimpleNet2 may include NN weight values, that may be changed during a training stage by an SGD algorithm, to facilitate the underlying function of classifying images of cats and dogs.
Reference is also made to
When examining the evolution of NN parameters 100P (e.g., NN weights) during the training process, it has been observed that the general characteristics of the profile vectors 120PV are very similar throughout NN model 100. Following normalization of the mean, variance, and sign there are essentially very few characteristic profile vectors 120PV which represent the entire dynamics. Moreover, these profiles 120PV are spread throughout the NN model 100, and can be extracted by uniform sampling of a small subset of the entire network parameters 100P. To illustrate this, the inventors have sampled 1000 weights of the NN parameters 100P (e.g., approximately 1% of SimpleNet2's NN parameters) and clustered them into 3 main modes, based on the correlations between the weights in this sub set.
Panel (c) of
As known in the art, Principal Component Analysis (PCA) is a statistical technique for reducing the dimensionality of a dataset, thereby increasing interpretability of data while preserving the maximum amount of information, and enabling visualization of multidimensional data. Panel (a) of
As known in the art, t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. Panel (b) of
The PCA representation of panel (a) shows clear separation of the modes 130M among the sampled parameters 100P. The t-SNE plot of panel (b) shows that the two smaller modes (M1, M2) are represented as concentrated clusters, whereas the main mode (M0) is more spread out.
According to some embodiments, clustering module 130 may be configured to group or cluster NN parameters 100P into clusters or modes 130M based on a metric of correlation between NN parameters 100P.
For example, for one or more pairs of NN parameters 100P, clustering module 130 may calculate a distance metric 130DM such as a correlation value, representing correlation between (i) a profile vector 120PV of a first NN parameter 100P of the pair and (ii) a profile vector 120PV of a second NN parameter 100P of the pair. Clustering module 130 may subsequently group at least a portion of the plurality of NN parameters 100P as members of the plurality of modes or clusters 130M, based on the calculated correlation values 130DM. For example, clustering module 130 may assign NN parameters 100P corresponding to respective, highly correlated profile vectors 120PV as members of the same mode or cluster 130M.
Reference is now made to
In the example of
The inventors have hypothesized that the dynamics of parameters of NN model 100 can be clustered into very few, highly correlated groups or modes 130M.
Let N be the number of network parameters 100P, and M be the number of modes or clusters 130M, also denoted {C1, . . . , CM}. It may be appreciated that the number of clusters 130M M may be much smaller (e.g., in several orders of magnitude) than number of network parameters 100P N. For example, N may be in the order of thousands, whereas M may be in the order of a dozen.
A correlation between two profile vectors 120PV may be defined according to equation Eq. 1A, below:
where the u and v represent the profile vectors 120PV,
and ū
ū=u−Σ
k=0
T
u
k Eq. 1B
In Eq. 1B, T is the length or number of entries (e.g., number of epochs) of profile vectors 120PV.
In Eq. 1A, ⋅,⋅
represents the Euclidean inner product over the indices of profile vectors 120PV (e.g., the epoch axis), as in Eq.1C below:
u,v
=Σ
k=0
T
u
k
v
k Eq. 1C
A small threshold parameter ϵ, 0<ϵ<<1 (e.g., ϵ=0.01) may be defined. Based on these definitions, a pair of profile vectors 120PV (denoted as wi,wj) may be determined as correlated, and thereby clustered as members of a mode 130M, when Eq. 1D below is satisfied:
|corr(wi,wj)|≥1−ϵ, ∀wi,wj∈Cm, m=1, . . . ,M. Eq. 1D
Thus, any two parameters 100P (e.g., w1, w2) of a specific mode m (wm1, wm2) which are perfectly correlated (or anti-correlated), yielding |corr(wm1, wm2)|=1, can be expressed as an affine transformation of each other as in Eq. 2A, below:
w
m
1
=a·w
m
2
+b, where a,b∈R. Eq. 2A
This leads to the approximation of the dynamics, as elaborated in equation Eq. 2B, below:
w
m
i
≈ai·w
m
r
+b
i
, ∀w
i
,w
r
∈C
m
, m∈{1, . . . ,M} Eq. 2B
In Eq. 2B, wmr may represent a reference NN parameter (e.g., weight) 140RP corresponding to a reference profile vector 140RV in the mth cluster or mode 130M. Additionally, ai, bi may represent affine coefficients 140AC corresponding to the ith NN parameter (e.g., weight) 100P of the respective mth cluster or mode 130M (or the the ith parameter vector 120PV the in the mth cluster or mode 130M). Additionally, wmi may represent a reconstructed version, or approximated version 200P of the ith NN parameter 100P (e.g., weight 100P) of NN model 100.
In other words, as shown by Eq. 2B, system 10 may represent, or reconstruct an approximation of one or more (e.g., each) NN parameter 100P or weight 100P wmi, based on (i) a mode-specific reference NN parameter 100P wmr and (ii) affine coefficients 140AC ai, bi that are parameter-specific (e.g., the ith NN parameter).
Embodiments of the invention may include several options for choosing the number of modes M. For example, clustering module 130 may find a minimum threshold so that the cophenetic distance between any two original observations in the same cluster does not exceed the threshold, where no more than M clusters are formed.
In another example, clustering module 130 may form clusters so that the original observations in each cluster may have no greater a cophenetic distance than a desired threshold.
For example, to find affine coefficients 140AC a and b, embodiments of the invention may perform the following computation of equation Eq. 3, below:
where Wm∈|C
By defining the matrix à 140AC as [A B], and
the relation of Eq. 4 may be achieved:
where F is the Frobenius norm. This yields the solution of:
Ã=W
m
{tilde over (w)}
r,m
T({tilde over (w)}r,m{tilde over (w)}r,mT)−1 Eq. 5
thereby calculating affine coefficients 140AC.
Reference is now made to
As shown in
As elaborated herein, reference NN parameter 140RP may, for example, be an NN parameter 100P that corresponds to a central member reference profile vector 140RV. The term “central” may be used in this context to indicate a specific member NN parameter 100P that is located nearest a center of a multidimensional space defined by the respective cluster 130. Alternatively, the central member NN parameter 100P may be defined as one having a minimal distance metric value from all other member NN parameters 100P of that cluster 130 (e.g., as shown in
Additionally, analysis module 140 may calculate specific affine coefficients 140AC (e.g., ai, bi of Eq. 2B) for one or more (e.g., each) member (e.g., ith member) NN parameter 100P.
Reference is also made to
Estimating the correlation between N variables typically needs an order of N2 computations (every variable with every other variable). This can be problematic for large networks where N can be in the order of millions, or even billions. However, as summarized in Eqs. 1A-1D and Eqs. 2A-2B, embodiments of the invention (e.g., clustering module 130) may perform clustering of NN parameters 100P without computing the full correlation matrix. Instead, clustering module 130 may perform clustering of NN parameters 100P with computational complexity in the order of N·M·T, where M is the number of modes and T is the number of epochs.
For example, instead of computing the entire correlation matrix, clustering module 130 may compute the correlations only between the network weights 100P and the reference weights 140RP of each mode, which were found earlier in a sampling phase. The estimation procedure is described in Algorithm 1. The complexity is approximated as K2·T+(N−K)·M·T≈N·M·T, where the number of sampled parameters K can be in the order of 30×M to provide sufficient statistics. In their experiments, the inventors have used the value of K=1000, under the assumption that the K sampled weights may represent all essential modes 130M of NN model 100.
According to some embodiments, and as elaborated herein (e.g., in relation to Eqs. 2A-2B), system 10 may calculate an approximated value 200P of at least one NN parameter 100P based on the grouping of NN parameters 100P into modes.
In other words, system 10 may represent, or reconstruct an approximation of one or more (e.g., each) NN parameter 100P or weight 100P wmi, based on (i) a mode-specific reference NN parameter 100P wmr and (ii) parameter-specific affine coefficients 140AC (ai, bi) of NN parameter 100P wmi that are members of that mode.
Additionally, or alternatively, for at least one (e.g., each) NN parameter 100P wmi of the plurality of NN parameters 100P, system 10 may calculate an approximated value 200P of the at least one NN parameter wmi, based on the corresponding profile vector 120PV.
In other words, for at least one (e.g., each) mode 130M, analysis module 140 may select a first NN parameter 100P, associated with the at least one mode (e.g., wmi), as a reference NN parameter 140RP wmr. Mode analysis module 140 may subsequently calculate a value of one or more affine function coefficients 140AC (e.g., A and B of Algorithm 2, or ai, bi of Eq. 2B), representing a transform between reference NN parameter 140RP wmr and at least one corresponding second NN parameter 100P wmi, associated with the at least one mode 130M. System 10 may subsequently calculate an approximated value 200P wmi of the at least one second NN parameter 100P wmi based on: (i) the reference NN parameter 140RP wmr, and (ii) the one or more corresponding affine function coefficient values 140AC (ai, bi of Eq. 2B), as elaborated herein (e.g., in relation to Eq. 2B).
Additionally, or alternatively, system 10 may be configured to replace at least one NN parameter 100P wmi value in the trained NN model 100 with a respective calculated, approximated NN parameter 200P wmi value. System 10 may thus produce, or obtain an approximated version 200 of trained NN model 100.
Reference is now made to
It may be evident from
As elaborated herein, NN model 100 may be trained to implement a specific, underlying ML function. In other words, NN model 100 may be inferred on incoming data samples 20B (e.g., images of cats and dogs) to apply the specific, underlying ML function (e.g., classify, or distinguish between types of depicted animals), based on the training. In this example, an output of the ML function (30 of
According to some embodiments, system 10 may utilize approximated NN model 200, instead of NN model 100 to implement the underlying ML function. In other words, at an inference stage, or a testing stage, system 10 may receive at least one input data sample 20B (e.g., image of an animal), and may infer the approximated version 200 of NN model 100 on input data sample 20B, to implement the ML function (e.g., to classify the depicted animal) on the input data sample 20B.
Reference is now made to
Panel (a) of
It may be observed that CMD may follow GD well during the training process. Additionally, for the testing, or validation set, CMD is more stable, and may surpass GD for both quality criteria.
Reference is now made to
As shown in step S1005, the at least one processor may receive (e.g., via input 7 of
As shown in step S1010, and as elaborated herein (e.g., in relation to
As shown in step S1015, and as elaborated herein (e.g., in relation to
As shown in step S1020, processor 2 may replace at least one NN parameter 100P value in the trained NN model 100 with a respective calculated approximated value 200P, to obtain an approximated version 200 of the trained NN model 100.
Reference is now made to
As shown in step S2005, the at least one processor may receive (e.g., via input 7 of
As shown in step S2010, and as elaborated herein (e.g., in relation to
As a non-limiting example, training dataset 20A may be a set of animal pictures, annotated by the animals' types, and the predefined ML function may include identification of cats from dogs in new, incoming data samples 20B of images.
Processor 2 may train NN model 100 continuously, or repeatedly over time, as shown by the arrow connecting step S2030 to step 2015. Each epoch of the training process may include at least one operation as described herein in steps S2015-S2030.
Additionally, as elaborated herein, the training process may be performed in at least two stages:
At a preliminary stage, NN model may be initially trained such that NN weights 100P of model 100 are adjusted, e.g., based on Gradient Descent (GD) calculation. The NN parameters or weights 100P of the NN model may subsequently be grouped or clustered into modes (e.g., 130 of
At a subsequent stage, NN model may be trained such that NN weight 100P values are gradually replaced with approximated values (e.g., 200P of
As shown in step S2015, and as elaborated herein (e.g., in relation to
As shown in step S2020, the at least one processor 2 may employ a monitoring module (120 of
As shown in steps S2025, and 2030, and as elaborated herein (e.g., in relation to Eq. 2B), the at least one processor 2 may calculate an approximated value 200P of the at least one NN parameter 100P, based on the profile vector 120PV. The at least one processor 2 may subsequently replace the at least one NN parameter 100P value with the approximated value 200, to obtain an approximated version 200 of the NN model 100.
As elaborated herein, the approximated version 200 of the NN model 100 may present several benefits for implementing ML functions:
During training or testing phases, where weights 100P are gradually replaced by their respective approximated values 200P, the required calculation of GD for adjusting weights in NN 100 diminishes over time, thereby saving processing time and resources.
Additionally, the substitute NN model, which may be based upon approximation values 200P may be significantly smaller than brute-force trained NN models, as typically performed in the art, allowing ease of deployment, storage and application of the underlying ML function.
For example, and as shown in steps S2035 and S2040, during inference of the substitute, approximated NN model 200, the at least one processor 2 may receive an input data sample (e.g., 20B of
It may be appreciated that NN model 100 may be implemented as a separate software and/or hardware module from the approximated version 200, as depicted in the non-limiting example of
As elaborated herein, the process of training NN 100 (and creating approximated NN version 200) includes a preliminary stage, during which processor 2 may utilize training module 110 to train NN model 100 based on training dataset 20A, over a first bulk of training epochs. During, or subsequent to this preliminary training, preliminary profile vectors 129PV are formed, as elaborated herein.
Processor 2 may employ a clustering module (130 of
As elaborated herein (e.g., in relation to) processor 2 may proceed to calculate the approximated values 200P of member NN parameters 100P based on the grouping into modes.
For example, and as elaborated herein (e.g., in relation to Eqs. 1A-1D), for one or more pairs of NN parameters 100P, clustering module 130 may calculate a correlation value representing a correlation between (i) a profile vector 120PV of a first NN parameter 100P of the pair and (ii) a profile vector 120PV of a second NN parameter 100P of the pair. Clustering module 130 may then group or cluster the plurality of NN parameters as members of modes 130M based on the calculated correlation values, e.g., by grouping together NN parameters 100P that have highly correlated (e.g., beyond a predefined threshold) profile vectors 120PV.
As elaborated herein, for one or more (e.g., each) modes 130M, processor 2 employ an analysis module 140 to obtain or select a NN parameter 100P, member of mode 130M, as a reference parameter (140RP of
For example, analysis module 140 may identify a central member NN parameter 100P of the mode 130M, as one located nearest a center of a multidimensional space defined by the respective mode 130M, and subsequently select the profile vector 120PV of the central member NN parameter 100P as the reference profile vector 140RV.
In another example, analysis module 140 may identify a central member NN parameter 100P of the mode 130M as one having a minimal distance value from other member NN parameters of the mode 130M, in the multidimensional space defined by mode 130M, according to a predetermined distance metric (e.g., a Euclidean distance). Processor 2 may select the profile vector 120PV of the central member NN parameter 100P as the reference profile vector 140RV.
As elaborated herein (e.g., in relation to Eqs. 3-5), for one or more (e.g., each) NN parameter 100P, analysis module 140 may calculate a value of one or more affine function coefficients 140AC (e.g., ai, bi of Eq. 2B). Affine function coefficients values 140AC may represent a transform between the reference NN parameter 140RP of a mode 130M and at least one second NN parameter 100P, member of the same mode 130M.
As elaborated herein (e.g., in relation to Eq. 2B) analysis module 140 may calculate the approximated value 200P of the at least one second NN parameter based on: (i) the reference NN parameter value 140RP (e.g., wmr of Eq. 2B), and (ii) the one or more corresponding affine function coefficient values 140AC (e.g., ai, bi of Eq. 2B).
Additionally, or alternatively, analysis module 140 may, for at least one mode 130M of the plurality of modes, obtain a reference profile vector, characterizing evolution of NN parameters of the mode through the plurality of training epochs, as elaborated herein, e.g., in relation to
Analysis module 140 may calculate a value of one or more affine function coefficients 140AC (e.g., ai, bi of Eq. 2B), associated with one or more specific NN parameters 100P of the same mode 130M. As elaborated herein, the affine function coefficients 140AC may represent a transform between (i) the profile vectors 120PV of the one or more specific NN parameters 100P and (ii) the reference profile vector 140RV.
Analysis module 140 may subsequently calculate the approximated value of the one or more specific NN parameters based on: (i) the reference profile vector 140RV, and (ii) the one or more affine function coefficient values 140AC, as elaborated herein (e.g., in relation to Eq. 2B).
According to some embodiments, during training of NN model 100, and subsequent generation of substitute NN version 200, monitoring module 120 may optimize utilization of processing resources (e.g., CPU cycles and/or memory).
For example, for at least one NN parameter 100P of the plurality of NN parameters, analysis module 110 may recalculate the associated affine function coefficient values 140AC between training epochs. It may be appreciated that during early stages of training, affine function coefficient values 140AC may be jittery, and may become more stable as the training of NN 100 gradually converges.
Monitoring module 120 may monitor coefficient values 140AC, to determine a status of stability 120ST of the affine function coefficient values, among consecutive training epochs, according to a predetermined stability metric. For example, status of stability 120ST may be a numerical value, representing the percentage or portion of jitter in the value of a coefficient 140AC, between two consecutive epochs. Other values stability status 120ST may also be possible.
According to some embodiments, training module may train, or freeze specific weights or parameters 100P of NN 100, based on the stability status 120ST.
For example, when stability status 120ST of a specific parameter 100P surpasses a predetermined threshold, that parameters 100P may be deemed stable. Training module 110 may then refrain from calculating gradient descent of the at least one NN parameter 100P, thereby reducing system complexity and processing power. In other words, training module 110 may proceed to calculate gradient descent, and adjust weights 100P only for weights 100P that their stability status 120ST has not surpassed the predetermined threshold.
In some embodiments, training module may then replace the value of that stable NN parameter 100P from one that is calculated between epochs, e.g., by calculation of gradient descent, to the approximated value 200P, which is related to the value of reference parameter 140RP e.g., via Eq. 2B.
As elaborated herein, system 10 may provide a practical application, by providing several improvement to functionality of computer-based systems, configured to implement ML based functions.
For example, by inferring the approximated version of NN model 200 on incoming data samples 20 (e.g., rather than inferring the trained NN model 100 on data samples 20), system 10 may provide an improvement in implementation of the underlying ML function. This improvement is manifested, for example by improved metrics of accuracy, as elaborated herein. In other words, by using the approximated model (as in
Additionally, embodiments of the invention may use the CMD modelling of approximated ML model 200 to accelerate, and/or improve training of an ML model.
For example, the CMD algorithm may be used to efficiently retrain an ML model following some unexpected change e.g., in a loss function or dataset. In other words, ML model 100 (or ML model 200) may be initially trained, based on a given dataset and loss function. Subsequently, in a condition that a parameter in the loss function should be changed, or when there is a change or drift in the dataset, the previously trained model (100/200) should be retrained. Embodiments of the invention may expediate this retraining procedure by (a) only training or amending reference weights 140RP (e.g., using a gradient-descent algorithm) and (b) apply the required changes to the rest of NN params 100P by using the affine function coefficient 140AC as elaborated herein (e.g., in relation to Eq. 2B).
In another example, system 10 may expedite the training process by employing an iterative training algorithm. In each iteration of this iterative training algorithm, system 10 may simultaneously infer the CMD model on the fly, while using the approximated model to deduce values of the network parameters.
Unless explicitly stated, the method embodiments described herein are not constrained to a particular order or sequence. Furthermore, all formulas described herein are intended as examples only and other or different formulas may be used. Additionally, some of the described method embodiments or elements thereof may occur or be performed at the same point in time.
While certain features of the invention have been illustrated and described herein, many modifications, substitutions, changes, and equivalents may occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention.
Various embodiments have been presented. Each of these embodiments may of course include features from other embodiments presented, and embodiments not specifically described may include various features described herein.
This application claims the benefit of priority of U.S. Patent Application No. 63/396,658, filed Aug. 10, 2022, entitled “SYSTEM AND METHOD OF TRAINING A NEURAL NETWORK MODEL” which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63396658 | Aug 2022 | US |