The present invention relates generally to the field of neural networks, and more particularly, to a method and system for optimizing activation functions to improve performance of neural networks.
Activation functions are an important choice in neural network design. They determine how the outputs of neurons are transformed into inputs for subsequent layers. Activation functions can have a significant impact on the performance of neural networks, and researchers have designed a variety of activation functions with different properties.
Well-constructed activation functions can markedly enhance neural network performance across diverse machine learning tasks. However, it is difficult for humans to construct optimal activation functions for all machine learning tasks. Automated methods can evaluate thousands of unique functions, and as a result, often optimize better activation functions than those designed by humans. However, such automated approaches have their own limitations. This limitation results in computationally inefficient ad hoc algorithms that may not scale to large models and datasets. Further, current activation function search algorithms are also expensive, making it difficult to optimize activation functions for new tasks.
In light of the aforementioned drawbacks, there is a need for a system and a method which provides for optimization of activation functions to improve performance of neural networks.
In various embodiments of the present invention, a method for optimizing activation functions to improve efficiency of neural networks is provided. The method is implemented by a processor executing instructions stored in a memory. The method comprises creating activation function benchmark datasets by training one or more neural network architectures with a plurality of activation functions. The activation function benchmark datasets comprise results obtained through training of the neural network architectures for various tasks and are indicative of performance of the activation functions on specific tasks. The method further comprises identifying one or more features indicative of performance of the activation functions in the activation benchmark datasets. Further, the method comprises creating a metric space by analyzing the identified features of the activation functions. The metric space is a low-dimensional representation of the activation functions, is representative of informative features associated with the activation functions and has a predefined distance metric. The method further comprises optimizing the activation functions based on the metric space by employing dimensionality reduction techniques to visualize the metric space. The visualization of metric space shows clusters of high performing activation functions segregated from poorly performing activation functions. The method further comprises determining characteristics of the optimized activation functions which are predictive of performance of feature sets of the optimized activation function. Finally, the method comprises selecting activation functions from the optimized activation, wherein the selected activation functions are applied to improve real-world tasks carried out by neural networks.
In various embodiments of the present invention, a system optimizing activation functions to improve efficiency of neural networks is provided. The system comprises a memory storing program instructions, a processor executing instructions stored in the memory and an activation function optimization engine executed by the processor and configured to create activation function benchmark datasets by training one or more neural network architectures with a plurality of activation functions. The activation function optimization engine identifies one or more features indicative of performance of the activation functions in the activation benchmark datasets. The activation function optimization engine creates a metric space by analyzing the identified features of the activation function in the benchmark datasets. The metric space is a low-dimensional representation of the activation functions, is representative of informative features associated with the activation functions and has a predefined distance metric. The activation function optimization engine further optimizes the activation functions based on the metric space by employing dimensionality reduction techniques to visualize the metric space. The visualization shows clusters of high performing activation functions segregated from poorly performing activation functions. The activation function optimization engine determines characteristics of the optimized activation functions which are predictive of performance of feature sets of the optimized activation function. The activation function optimization engine selects activation functions from the optimized activation functions based on the determination. The selected activation functions are applied to improve real-world tasks carried out by neural networks.
In various embodiments of the a computer program product is provided. The computer program product comprises a non-transitory computer-readable medium having computer program code stored thereon, the computer-readable program code comprising instructions that, when executed by a processor, causes the processor to create activation function benchmark datasets by training one or more neural network architectures with a plurality of activation functions. The processor identifies one or more features indicative of performance of the activation functions in the activation benchmark datasets. Further, the processor creates a metric space by analyzing the identified features of the activation function in the benchmark datasets. The metric space is a low-dimensional representation of the activation functions, is representative of informative features associated with the activation functions and has a predefined distance metric. Furthermore, the processor optimizes the activation functions based on the metric space by employing dimensionality reduction techniques to visualize the metric space, wherein clusters of high performing activation functions are segregated from poorly performing activation functions. Yet further, the processor determines characteristics of the optimized activation functions which are predictive of performance of feature sets of the optimized activation function in each of the activation function dataset. Finally, the processor selects activation functions from the optimized activation functions based on the determination. The selected activation functions are applied to improve real-world tasks carried out by neural networks.
The present invention is described by way of embodiments illustrated in the accompanying drawings wherein:
The present invention discloses a system and a method which provides for optimization of activation functions to improve efficiency of neural networks. The present invention optimizes activation functions by combining data-driven insights with precomputed results and streamlines activation function optimization through systematic benchmark dataset generation, feature extraction, metric space construction, and selection of suitable functions, thereby providing a comprehensive approach to enhance neural network performance.
In various embodiments of the present invention, an existing machine learning system that is used on some real-world tasks is employed and a search algorithm is applied to optimize a better activation function that improves performance on the task. The optimized activation function then replaces the old activation function, and the result is a machine learning system with improved accuracy on the task.
The disclosure is provided in order to enable a person having ordinary skill in the art to practice the invention. Exemplary embodiments herein are provided only for illustrative purposes and various modifications will be readily apparent to persons skilled in the art. The general principles defined herein may be applied to other embodiments and applications without departing from the scope of the invention. The terminology and phraseology used herein is for the purpose of describing exemplary embodiments and should not be considered limiting. Thus, the present invention is to be accorded the widest scope encompassing numerous alternatives, modifications, and equivalents consistent with the principles and features disclosed herein. For purposes of clarity, details relating to technical material that is known in the technical fields related to the invention have been briefly described or omitted so as not to unnecessarily obscure the present invention.
The present invention would now be discussed in context of embodiments as illustrated in the accompanying drawings.
At step 102, activation benchmark datasets are created. In an embodiment of the present invention, the activation benchmark datasets are created by training one or more neural network architectures with a plurality of systematically generated activation functions. The activation benchmark datasets facilitate analysis of activation function properties at a large scale in order to determine the better performing activation functions. In an exemplary embodiment of the present invention, the activation benchmark datasets are created by training convolutional, residual and vision transformer architectures with a plurality of activation functions for various tasks. In an example, the activation benchmark dataset Act-Bench-CNN contains training results for a plurality of activation functions when paired with convolutional neural network for tasks associated with image dataset such as CIFAR-10. In another example, the activation benchmark dataset Act-Bench-ResNet contains training results for a plurality of activation functions when paired with residual neural network for tasks associated with CIFAR-10. In yet another example, the activation benchmark dataset Act-Bench-ViT contains training results for a plurality of activation functions when paired with MobileViTv2-0.5 neural network for tasks associated with Imagenette. The training results indicate the performance of the activation functions on specific tasks carried out by the neural networks mentioned above.
For example,
At step 104, features of activation functions which are highly indicative of performance of the activation functions are identified. In an embodiment of the present invention, various features hypothesized to be predictive of performance of the activation functions in the created benchmark datasets are evaluated by Uniform Manifold Approximation and Projection (UMAP) technique. From the various features hypothesized to be predictive of performance, the features which are highly indicative of activation function's performance are identified. In an exemplary embodiment of the present invention, two activation function features are identified which are highly indicative of activation function's performance i.e., the spectrum of the Fisher information matrix (FIM) associated with the model's predictive distribution at initialization, and (2) the activation function's output distribution. FIM helps in understanding how a neural network works and helps in understanding how well the network can learn and make predictions, and how stable a neural network can be. Different activation functions can lead to different FIM eigenvalues for the same network. Further, the activation function's output distribution is indicative of shape of an activation function. The shape defines how the activation function reacts to different input values. It can be either linear or nonlinear, and this impacts the neural network's ability to handle complex patterns and make predictions.
At step 106, a metric space is created by analyzing the identified features of the activation functions. In an embodiment of the present invention, FIM eigenvalues are calculated which are used to filter out the poorly performing activation functions. Distance metric between different FIM eigenvalues measures how similar two activation functions are based on their FIM eigenvalues. With this distance metric, the FIM eigenvalues can be used to create a metric space for activation functions. However, FIM eigenvalues incorporate multiple sources of information which incorporate noise in the FIM eigenvalues. Advantageously, the present invention combines the FIM eigenvalues with activation function outputs distribution to overcome the limitation with FIM eigenvalues. The combination of features can accurately predict the performance of activation functions.
In an embodiment of the present invention, to understand the shape of activation functions, the function's outputs are sampled across a range of input values. Proper network initialization ensures that input values follow a standard normal distribution (mean=0, standard deviation=1). Consequently, a vector of expected activation function outputs is obtained by randomly sampling inputs. Euclidean distance between two activation functions is then used to measure the dissimilarity of their shapes to obtain activation function output values. This provides a reliable and cost-effective means of comparing activation functions. The activation function output values along with the FIM eigenvalues form a metric space, which serves as a potent surrogate search space, enabling efficient identification of optimal activation functions.
In order to effectively optimize activation functions, the metric space (surrogate space) needs to be low-dimensional, represent informative features associated with the activation function, and have a predefined distance metric. At step 108, a dimensionality reduction technique is employed to visualize the metric space. In an exemplary embodiment of the present invention, Uniform Manifold Approximation and Projection (UMAP) dimensionality reduction technique is employed to visualize the features i.e., FIM and activation function output distribution of the benchmark datasets. This visualization leads to a combined metric space that is used to optimize activation functions for improving neural networks. UMAP helps in analyzing the patterns and groupings of activation functions that perform similarly. UMAP creates a metric space to show how different activation functions are related to each other.
Middle row of
In the bottom row of
At step 110, the visualizations are further analyzed to determine activation functions characteristics which are predictive of performance in each dataset. In an embodiment of the present invention, activation functions with similar shapes are determined that lead to similar performances, and those with different shapes often produce different results. As shown in
At step 112, activation functions based on analysis of visualization by UMAP techniques are selected. Traditionally, selection of activation functions involves training a neural network to assess the performance of each potential activation function. However, in an embodiment of the present invention, a significant advantage is realized by employing pre-computed results from the metric space associated with the benchmark datasets. This advantageous approach enables experimentation with diverse search algorithms and the execution of repeated trials, thereby facilitating a comprehensive comprehension of the results' significance.
Conventional activation function search processes typically necessitate complete training of a neural network for the comprehensive evaluation of each candidate function, a procedure often characterized by significant computational expenses. However, employing benchmark datasets in the present embodiment offers a distinct advantage. As these datasets already contain pre-computed results, they afford the opportunity to explore diverse search algorithms and conduct multiple iterations for the purpose of establishing statistical significance of the outcomes.
In an example embodiment of the present invention, for searching and selection of activation functions search algorithms were evaluated on the three created benchmark datasets. Firstly, a random search was included as a baseline reference without employing the FIM eigenvalues and activation function outputs. Then, three algorithms viz. specifically weighted k-nearest regression with k=3 (KNR), Random Forest Regression (RFR), and Support Vector Regression (SVR), were subjected to evaluation. These algorithms were employed in their default configurations as provided by the scikit-learn package, without hyperparameter tuning. Each algorithm was supplied with distinct activation function features, aimed at assessing their potential to forecast performance. The search algorithms initiate their evaluation process by assessing eight activation functions including the activation function Rectified Linear Unit relu(x) and seven other activation functions viz. exponential linear unit elu(x), scaled exponential linear unit selu(x), sigmoid(x), softplus(x), softsign(x), swish(x) and tanh(x) are selected to form a starting point for the metric space. Typically, using existing methods, such evaluations involve the arduous task of training from the ground up, however, the created benchmark datasets of the present invention, streamlined the process by allowing retrieval of pre-computed results from the metric space. Subsequently, the algorithms utilized the validation accuracy of the eight activation functions to forecast the performance of all unexamined functions within the dataset. The activation function projected to have the highest predicted accuracy was then subjected to evaluation. The performance of this assessed activation function was subsequently incorporated into the roster of established results. This iterative procedure continued until 100 distinct activation functions had been evaluated. In an embodiment of the present invention, each experiment, involving a distinct search algorithm, activation function feature set, and benchmark dataset, was iterated 100 times for comprehensive evaluation.
Also, as shown in
In an embodiment of the present invention, the selected activation functions are applied to new datasets and search spaces to exhibit its functions. The experiments were expanded in two significant dimensions. Firstly, while the neural network architectures remained consistent, the experiments involved larger and more challenging datasets, namely, All-CNN-C on CIFAR-100, ResNet-56 on CIFAR-100, and MobileViTv2-0.5 on ImageNet. Secondly, a considerably larger activation function search space was explored, comprising 425,896 distinct activation functions, rooted in four-node computation graphs. This extensive and diverse space, unlike the precomputed benchmark datasets, was not predetermined, thereby subjecting the findings of the benchmark experiments to validation in a real-world production environment. Building upon the benchmark results, the K-nearest regression (KNR) with k=3 was employed as the chosen search algorithm. The search initiatives commenced by evaluating eight established activation functions, including ELU, ReLU, SELU, sigmoid, Softplus, Softsign, Swish, and tanh. Subsequently, eight parallel workers executed the evaluation of activation functions anticipated to exhibit superior performance based on predictions.
Advantageously, each experiment consistently led to the optimization of new activation functions that surpassed the performance of established baseline functions. Remarkably, the searches demonstrated high efficiency by necessitating only a limited number of evaluations to achieve performance improvements. Impressively, in the context of the search involving ResNet-56 on CIFAR-100, an activation function outperforming all baselines was identified within only the second evaluation. Tables 1 to 3 display accuracy results with various activation functions, in accordance with an embodiment of the present invention. In CIFAR-100 experiments, the results represent the median test accuracy from three runs, whereas in ImageNet experiments, the results reflect the validation accuracy from a single run.
In an embodiment of the present invention, the engine 902 comprises a processor 904 and a memory 906. In various embodiments of the present invention, the engine 902 is configured to optimize activation functions for improving neural networks. The various units of the engine 902 are operated via the processor 904 specifically programmed to execute instructions stored in the memory 906 for executing respective functionalities of the units of the engine 902 in accordance with various embodiments of the present invention.
In an embodiment of the present invention, the engine 902 comprises a benchmark dataset generation unit 910, a characterization unit 912, a metric space construction unit 914, a dimension reduction unit 916 and an activation function evaluation unit 918.
In an embodiment of the present invention, the benchmark dataset generation unit 910 is configured to receive inputs from the input unit 908 for generating activation benchmark datasets. In an exemplary embodiment of the present invention, the inputs are associated with the neural network and the datasets for which the activation functions are to be optimized and the search space that are to be explored for the activation functions. In an embodiment of the present invention, the benchmark dataset generation unit 910 creates activation benchmark datasets by training one or more neural network architectures with a plurality of systematically generated activation functions. The activation benchmark datasets facilitate analysis of activation function properties at a large scale in order to determine the better performing activation functions. In an embodiment of the present invention, the activation benchmark datasets are created by training convolutional, residual and vision transformer neural network architectures with a plurality of activation functions for various tasks. In an exemplary embodiment of the present invention, the activation benchmark dataset Act-Bench-CNN contains training results for a plurality of activation functions when paired with convolutional neural network for task associated with image datasets such as CIFAR-10. In another exemplary embodiment of the present invention, the activation benchmark dataset Act-Bench-ResNet contains training results for a plurality of activation functions when paired with residual neural network for tasks associated with CIFAR-10. In yet another exemplary embodiment of the present invention, the activation benchmark dataset Act-Bench-ViT contains training results for a plurality of activation functions when paired with MobileViTv2-0.5 neural network for tasks associated with Imagenette. The training results indicate the performance of the activation functions on specific tasks carried out by the neural networks mentioned above.
In an embodiment of the present invention, the characterization unit 912 is configured to extract features from the created benchmark datasets and perform characterization of the extracted features to identify features from the extracted features of the activation functions in the benchmark datasets that are highly indicative of activation function's performance. In an exemplary embodiment of the present invention, two activation function features are identified which are highly indicative of activation function's performance i.e., (1) the spectrum of the FIM associated with the model's predictive distribution at initialization, and (2) the activation function's output distribution. Both the features contribute unique information of an activation function. FIM helps in understanding how a neural network works and helps in understanding how well the network can learn and make predictions, and how stable a neural network can be. Different activation functions can lead to different FIM eigenvalues for the same network. Further, the activation function's output distribution is indicative of the shape of an activation function. The shape defines how the activation function reacts to different input values. It can be either linear or nonlinear, and this impacts the neural network's ability to handle complex patterns and make predictions.
In an embodiment of the present invention, the metric space construction unit 914 is configured to create a metric space by analyzing the identified features of the activation functions. The metric space construction unit 914 combines the FIM and activation function output distribution features of activation functions to create a metric space having a low-dimensional representation of the activation functions. In an embodiment of the present invention, metric space construction unit 914 creates the metric space for computing distances between activation functions. The low-dimensional representation of the FIM matrix and activation function's output distribution renders the metric space a practical surrogate, to optimize activation functions.
In order to effectively optimize activation functions, the surrogate space needs to be low-dimensional, represent informative features, and have a predefined distance metric. In an embodiment of the present invention, the dimension reduction unit 916 is configured to visualize the metric space by employing a dimensionality reduction technique. In an embodiment of the present invention, the dimension reduction unit 916 employs UMAP dimensionality reduction technique to visualize the features i.e., FIM and activation function output distribution of the created benchmark datasets. This visualization leads to a combined surrogate space that is used to optimize activation functions for improving neural networks. UMAP helps in analyzing the patterns and groupings of activation functions that perform similarly.
In an embodiment of the present invention, the activation function evaluation unit 918 is configured to determine activation function characteristics which are predictive of performance in each dataset. In an embodiment of the present invention, the activation function evaluation unit 918 determines the activation function characteristics after analysis of the visualization. Activation functions with similar shapes lead to similar performances, and those with different shapes often produce different results. The activation function evaluation unit 918 subsequently selects activation functions based on analysis of visualization. Typically, selection of activation functions comprise searching for activation functions by training a neural network from scratch in order to evaluate each candidate activation function. In an embodiment of present invention, advantageously, with the benchmark datasets, all of the results are already pre-computed in the metric space. This information makes it possible to experiment with different search algorithms and conduct repeated trials to understand the significance of the results.
In various embodiments, the present invention can be applied to any real-world task where image recognition systems are used such as: (1) medical imaging including, but not limited, diagnosing diseases and interpreting x-rays, MRIs, and CT scans, (2) autonomous vehicles, especially identifying pedestrians, other vehicles, and road signs, (3) retail, where products need to be identified to manage inventory, (4) agriculture, in order to identify crop growth, diseases, and optimal harvest time, (5) face recognition, for example in airports and in mobile phones.
The communication channel(s) 1008 allows communication over a communication medium to various other computing entities. The communication medium provides information such as program instructions, or other data in a communication media. The communication media includes, but is not limited to, wired or wireless methodologies implemented with an electrical, optical, RF, infrared, acoustic, microwave, Bluetooth or other transmission media.
The input device(s) 1010 may include, but not limited to, a keyboard, mouse, pen, joystick, trackball, a voice device, a scanning device, touch screen or any another device that is capable of providing input to the computer system 1002. In an embodiment of the present invention, the input device(s) 1010 may be a sound card or similar device that accepts audio input in analog or digital form. The output device(s) 1012 may include, but not limited to, a user interface on CRT or LCD, printer, speaker, CD/DVD writer, or any other device that provides output from the computer system 1002.
The storage 1014 may include, but not limited to, magnetic disks, magnetic tapes, CD-ROMs, CD-RWs, DVDs, flash drives or any other medium which can be used to store information and can be accessed by the computer system 1002. In various embodiments of the present invention, the storage 1014 contains program instructions for implementing the described embodiments.
The present invention may suitably be embodied as a computer program product for use with the computer system 1002. The method described herein is typically implemented as a computer program product, comprising a set of program instructions which is executed by the computer system 1002 or any other similar device. The set of program instructions may be a series of computer readable codes stored on a tangible medium, such as a computer readable storage medium (storage 1014), for example, diskette, CD-ROM, ROM, flash drives or hard disk, or transmittable to the computer system 1002, via a modem or other interface device, over either a tangible medium, including but not limited to optical or analogue communications channel(s) 1008. The implementation of the invention as a computer program product may be in an intangible form using wireless techniques, including but not limited to microwave, infrared, Bluetooth or other transmission techniques. These instructions can be preloaded into a system or recorded on a storage medium such as a CD-ROM, or made available for downloading over a network such as the internet or a mobile telephone network. The series of computer readable instructions may embody all or part of the functionality previously described herein.
The present invention may be implemented in numerous ways including as a system, a method, or a computer program product such as a computer readable storage medium or a computer network wherein programming instructions are communicated from a remote location.
While the exemplary embodiments of the present invention are described and illustrated herein, it will be appreciated that they are merely illustrative. It will be understood by those skilled in the art that various modifications in form and detail may be made therein without departing from or offending the scope of the invention.