SYSTEMS OF NEURAL NETWORKS COMPRESSION AND METHODS THEREOF

Information

  • Patent Application
  • 20240135180
  • Publication Number
    20240135180
  • Date Filed
    November 21, 2023
    6 months ago
  • Date Published
    April 25, 2024
    28 days ago
Abstract
Systems and methods provide improved neural network compression by training, based on training data and an optimization problem, a deep neural network to produce a trained deep neural network by iteratively updating a weight matrix of the deep neural network according to, at each iteration, minimizing a rank value of the weight matrix until a memory capacity metric is satisfied, minimizing a loss function based on the training data and the weight matrix and updating the weight matrix, and terminating the iterations upon the loss function being minimized within the memory capacity metric. Tensor decomposition is used to compress the trained deep neural network based on the rank value and the weight matrix to obtain a trained tensor decomposition format deep neural network. The trained tensor decomposition format deep neural network is retrained with the training data to obtain a fine-tuned trained tensor decomposition format deep neural network.
Description
FIELD OF THE TECHNOLOGY

The present disclosure generally relates to computer-based platforms and systems configured for neural network compression including tensor decomposition-based compression techniques for training and inferencing with tensorized neural networks and methods thereof.


BACKGROUND

Deep Neural Network (DNNs) have widespread applications in many tasks, such as image classification, video recognition objective detection, and image caption. For most embedded and Internet-of-Things (IoT) systems, the sizes of DNN models are too large, thereby causing high storage and computational demands and severely hindering the practical deployment of DNNs.


SUMMARY OF THE DISCLOSURE

This invention proposes a systematic framework for tensor decomposition-based model compression by applying an optimization technique including a dual problem approach, such as, e.g., Alternating Direction Method of Multipliers (ADMM). By formulating TT decomposition-based model compression as an optimization problem with constraints on tensor ranks, it leverages ADMM technique to systemically solve the optimization problem in an iterative way. During this procedure, the entire DNN model is trained in the original structure instead of tensor decomposed, but gradually enjoys the desired low tensor rank characteristics. The model may be decomposed to tensor decomposed and fine-tuned to finally obtain a high-accuracy tensor decomposed format DNN model.


In some aspects, the techniques described herein relate to a method including: receiving, by at least one processor, at least: i) a memory capacity metric defining a maximum memory size available; ii) a model identifier that identifies a deep neural network, the deep neural network including at least one weight matrix that is arranged in at least one layer and includes a plurality of weights; iii) a training data including a plurality of input data records and a plurality of ground truth data records; wherein each input data record is associated with a corresponding ground truth data record; wherein the corresponding ground truth data record defines a ground truth output; training, by the at least one processor, based on the training data and an optimization problem, the deep neural network to produce a trained deep neural network by iteratively updating the at least one layer of the deep neural network, at each iteration, with: a rank value representing a tensor rank to apply to the at least one weight matrix, and the plurality of weights of the at least one weight matrix upon decomposition to the tensor of the rank value; wherein the optimization problem is configured to: minimize the rank value at each iteration until the memory capacity metric is satisfied, minimize a loss function based at least in part on the training data and the plurality of weights, backpropagate the loss function to the plurality weights, and stop iteratively updating the at least one layer upon the memory capacity metric being satisfied and the loss function being minimized within the memory capacity metric; utilizing, by the at least one processor, a tensor-based decomposition to compress the trained deep neural network based at least in part on the rank value and the plurality of weights to obtain a trained tensor decomposition format deep neural network; training, by the at least one processor, the trained tensor decomposition format deep neural network based at least in part on the training data to obtain a fine-tuned trained tensor decomposition format deep neural network; and deploying, by the at least one processor, the fine-tuned trained tensor decomposition format deep neural network to at least one hardware device that satisfies the memory capacity metric.


In some aspects, the techniques described herein relate to a method, wherein the tensor-based decomposition includes at least one of: tensor train decomposition, tensor ring decomposition, singular value decomposition (SVD) higher-order singular value decomposition (HOSVD), Tucker decomposition, hierarchical Tucker decomposition, or block term decomposition.


In some aspects, the techniques described herein relate to a method, wherein the optimization problem includes a dual problem optimization method.


In some aspects, the techniques described herein relate to a method, wherein the dual problem optimization method includes dual ascent method (DAM).


In some aspects, the techniques described herein relate to a method, wherein the dual problem optimization method includes dual decomposition.


In some aspects, the techniques described herein relate to a method, wherein the dual problem optimization method Alternating Direction Method of Multipliers (ADMM).


In some aspects, the techniques described herein relate to a method, further including: determining, by the at least one processor, a minimum tensor rank based at least in part on the memory capacity metric; and terminating, by the at least one processor, the iteratively updating of the at least one layer of the deep neural network upon the rank value reaching the minimum tensor rank.


In some aspects, the techniques described herein relate to a method, wherein the deep neural network includes at least one of: a transformer, a multi-layer perceptron, a convolutional neural network, or a recurrent neural network.


In some aspects, the techniques described herein relate to a method, wherein the transformer includes at least one vision transformer.


In some aspects, the techniques described herein relate to a method, wherein the at least one hardware device includes an embedded internet-of-things (IoT) device.


In some aspects, the techniques described herein relate to a system including: at least one processor; and at least one non-transitory computer readable medium having software instructions stored thereon, wherein, upon execution of the software instructions, the at least one processor is configured to: receive at least: i) a memory capacity metric defining a maximum memory size available; ii) a model identifier that identifies a deep neural network, the deep neural network including at least one weight matrix that is arranged in at least one layer and includes a plurality of weights; iii) a training data including a plurality of input data records and a plurality of ground truth data records; wherein each input data record is associated with a corresponding ground truth data record; wherein the corresponding ground truth data record defines a ground truth output; train based on the training data and an optimization problem, the deep neural network to produce a trained deep neural network by iteratively updating the at least one layer of the deep neural network, at each iteration, with: a rank value representing a tensor rank to apply to the at least one weight matrix, and the plurality of weights of the at least one weight matrix upon decomposition to the tensor of the rank value; wherein the optimization problem is configured to: minimize the rank value at each iteration until the memory capacity metric is satisfied, minimize a loss function based at least in part on the training data and the plurality of weights, backpropagate the loss function to the plurality weights, and stop iteratively updating the at least one layer upon the memory capacity metric being satisfied and the loss function being minimized within the memory capacity metric; utilize a tensor-based decomposition to compress the trained deep neural network based at least in part on the rank value and the plurality of weights to obtain a trained tensor decomposition format deep neural network; train the trained tensor decomposition format deep neural network based at least in part on the training data to obtain a fine-tuned trained tensor decomposition format deep neural network; and deploy the fine-tuned trained tensor decomposition format deep neural network to at least one hardware device that satisfies the memory capacity metric.


In some aspects, the techniques described herein relate to a system, wherein the tensor-based decomposition includes at least one of: tensor train decomposition, tensor ring decomposition, singular value decomposition (SVD) higher-order singular value decomposition (HOSVD), Tucker decomposition, hierarchical Tucker decomposition, or block term decomposition.


In some aspects, the techniques described herein relate to a system, wherein the optimization problem includes a dual problem optimization system.


In some aspects, the techniques described herein relate to a system, wherein the dual problem optimization system includes dual ascent system (DAM).


In some aspects, the techniques described herein relate to a system, wherein the dual problem optimization system includes dual decomposition.


In some aspects, the techniques described herein relate to a system, wherein the dual problem optimization system Alternating Direction System of Multipliers (ADMM).


In some aspects, the techniques described herein relate to a system, wherein, upon execution of the software instructions, the at least one processor is further configured to: determine a minimum tensor rank based at least in part on the memory capacity metric; and terminate the iteratively updating of the at least one layer of the deep neural network upon the rank value reaching the minimum tensor rank.


In some aspects, the techniques described herein relate to a system, wherein the deep neural network includes at least one of: a transformer, a multi-layer perceptron, a convolutional neural network, or a recurrent neural network.


In some aspects, the techniques described herein relate to a system, wherein the transformer includes at least one vision transformer.


In some aspects, the techniques described herein relate to a system, wherein the at least one hardware device includes an embedded internet-of-things (IoT) device.





BRIEF DESCRIPTION OF THE DRAWINGS

Various embodiments of the present disclosure may be further explained with reference to the attached drawings, wherein like structures are referred to by like numerals throughout the several views. The drawings shown are not necessarily to scale, with emphasis instead generally being placed upon illustrating the principles of the present disclosure. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for teaching one skilled in the art to variously employ one or more illustrative embodiments.



FIG. 1 illustrates a system for implementing an efficient dual problem optimization-based tensor decomposed format model in a resource constrained environment according to one or more embodiments of the present disclosure.



FIG. 2A illustrates a matrix via using the TT decomposition of its reshaped tensor in accordance with some embodiments of the present disclosure.



FIG. 2B illustrates a tensor decomposed inference scheme in accordance with some embodiments of the present disclosure.



FIG. 3 illustrates a converting computation on CONV layer to matrix multiplication in accordance with some embodiments of the present disclosure. Here H′=H−f+1 and W′=W−f+1.



FIG. 4 illustrates an example of redundant computations in conventional tensor decomposed inference scheme in accordance with some embodiments of the present disclosure.



FIG. 5 illustrates a partially paralleling the inputs in Stage-3 in accordance with some embodiments of the present disclosure. Redundant computations are partially reduced.



FIGS. 6A and 6B illustrate an example of compact tensor decomposed inference scheme in accordance with some embodiments of the present disclosure.



FIG. 7 illustrates an example 2-PE processing scheme with 3 MAC units in each PE in accordance with some embodiments of the present disclosure.



FIG. 8 illustrates a convolution using an input 3D tensor, and d-dimensional tensor kernel and output 3D tensor in accordance with some embodiments of the present disclosure.



FIG. 9 illustrates a TT convolutional computation in accordance with some embodiments of the present disclosure.



FIG. 10 illustrates the compact TT convolutional computation in accordance with some embodiments of the present disclosure.



FIG. 11 illustrate a general case diagram for the reduced TT convolutional computation in accordance with some embodiments of the present disclosure.



FIG. 12 illustrates an overall architecture of TIE in accordance with some embodiments of the present disclosure.



FIG. 13 illustrates a data allocation in weight SRAM in accordance with some embodiments of the present disclosure. In some embodiments, the SRAM may be defined by a data-path and weight SRAM.



FIG. 14 illustrates a perform-on-the-fly transform using well-designed working SRAM read access scheme in accordance with some embodiments of the present disclosure.



FIG. 15 illustrates a layout and performance metrics of TIE in accordance with some embodiments of the present disclosure.



FIG. 16 illustrates performance comparison between EIE and TIE on different benchmarks in accordance with some embodiments of the present disclosure.



FIG. 17 illustrates a flexibility of TIE on different decomposition ranks in accordance with some embodiments of the present disclosure.



FIG. 18 depicts an illustration of an example Tensor Train Decomposition with four Tensor Train (TT) cores in accordance with one or more embodiments of the present disclosure.



FIG. 19 illustrates the steps of the framework of embodiments of the present disclosure.



FIG. 20A, FIG. 20B, and FIG. 20C illustrate training loss, Frobenius norm and test accuracy in dual optimization problem-regularized training procedure with different p in accordance with one or more embodiments of the present disclosure.



FIG. 21A, FIG. 21B, and FIG. 21C illustrate a change of (a) training loss and (b) Top-1 test accuracy, and (c) number of parameters during the training process for ResNet-18 on ImageNet dataset in accordance with aspects of embodiments of the present disclosure.



FIGS. 22A and 22B illustrate (a) Rank variations for two example component convolutional layer during training, and (b) final rank distribution using embodiments of the present method of ResNet-18 on ImageNet dataset in accordance with aspects of embodiments of the present disclosure.



FIG. 23 illustrates a block diagram of an exemplary computer-based system and platform.



FIG. 24 illustrates a block diagram of another exemplary computer-based system and platform.



FIGS. 25 and 26 illustrate schematics of exemplary implementations of the cloud computing/architecture(s) in which the exemplary inventive computer-based systems/platforms, the exemplary inventive computer-based devices, and/or the exemplary inventive computer-based components of the present disclosure may be specifically configured to operate.





DETAILED DESCRIPTION

The present disclosure describes systems and methods to enable deployment of Deep Neural Networks (DNNs) to chip sets having constrained resources. DNNs may be employed in widespread applications for machine learning-based recognition, prediction and segmentation, such as, e.g., in many computers vision tasks including image classification, video recognition objective detection, and image caption, prediction tasks, and other modelling, prediction, segmentation, classification, regression and other tasks or any combination thereof. Such applications may advantageously be deployed to resource constrained environments, such as embedded and Internet-of-Things (IoT) systems (e.g., smart home devices and security systems, wearable health monitors, ultra-high speed wireless internet, etc.), portable computing devices (e.g., smartphones, tablets, laptop computers, wearable devices, etc.), or other power, energy, memory, and/or processing constrained environments or any combination thereof.


Providing DNN models that have storage and processing requirements within the bounds of such resource-constrained environments is a difficult challenge due to the many parameters and nodes of DNNs. Compression of DNNs provides a potential avenue to producing DNNs with resource requirements within resource-constrained environments. Current implementations of compression may include:

    • Sparsification of trained models, which is the most popular but may have uneven effects on hardware;
    • Quantization of trained models, which is inherently hardware friendly but has limited compression ratios; and
    • Tensor decomposition of trained models, which has very high compression ratios, but often has bad performance, particularly with convolutional neural networks (CNN).


In some embodiments, the compression, inferencing and training techniques of the present disclosure solves the problems of existing methods using the optimization technique as an approach to the tensor decomposition compression approach.


Tensor decomposition, uniquely, may provide ultra-high compression ratio, especially for recurrent neural network (RNN) models. The advanced tensor decomposition approaches, such as tensor train (TT) and tensor ring (TR) and including dual problem singular value decomposition (SVD), higher-order singular value decomposition (HOSVD), Tucker decomposition, hierarchical Tucker decomposition, block term decomposition among others or any combination thereof, may bring more than 1,000 times parameter reduction to the input-to-hidden layers of RNN models. Tensor-decomposed models are also well suited for hardware-based acceleration. However, current TT-based approaches have significant accuracy losses. The technique has much less loss for all neural network models and especially for CNN models.


In some embodiments, a systematic framework is described for tensor decomposition-based model compression by applying an optimization technique such as ADMM. By formulating TT decomposition-based model compression as an optimization problem with constraints on tensor ranks, it leverages ADMM technique to systemically solve this optimization problem in an iterative way. During optimization, the DNN model is trained in the original structure instead of tensor decomposed, while leveraging decreased tensor rank characteristics. In some embodiments, the uncompressed model may be decomposed to tensor decomposed and fine-tuned to obtain a final high-accuracy tensor decomposed format DNN model.


This framework is general in applicability, and therefore works for both CNNs and RNNs, among other neural networks, and may be modified to fit other tensor decomposition approaches. The present disclosure provides the framework for different DNN models for image classification and video recognition tasks as examples, though any suitable neural network task may be employed. Experimental results show that dual problem optimization-based tensor decomposed format models demonstrate very high compression performance with high accuracy.



FIG. 1 illustrates a system for implementing an efficient dual problem optimization-based tensor decomposed format model in a resource constrained environment according to one or more embodiments of the present disclosure.


In some embodiments, a model deployment system 110 may be configured to train, compress and deploy machine learning models, such as DNNs, to a resource constrained environment 100. In some embodiments, the model deployment system 110 may communicated with the resource constrained environment 100 via a network 102, or by any other suitable direct or indirect communication (e.g., a suitable wired and/or wireless hardware interface and/or via portable storage devices such as flash drives or USB storage drives, etc.). In some embodiments, the network 102 may include any suitable wired or wireless network with any suitable hardware and/or software configurations, such as, e.g., WiFi, Local Area Network (LAN), telecommunications network, the Internet, Bluetooth, among other networking hardware and/or protocols or any combination thereof.


In some embodiments, the resource constrained environment 100 may include any suitable device and/or system of devices having constraints on resources, such as, e.g., energy constraints, storage constraints, memory constraints, processor constraints, or any other suitable constraints that limit the performance of the resource constrained environment 100 in inferencing tasks using a DNN model. For example, the resource constrained environment 100 may include, e.g., a user computing device (a laptop computer or desktop computer), a mobile computing device (smartphone, tablet, wearable device, augmented reality device, virtual reality device, etc.), an Internet-of-Things (IoT) device (e.g., security camera, smart assistant device, smart TV, smart lights, smart thermostat, smart appliance, etc.), networking equipment, or any other suitable resource constrained device or system of devices or any combination thereof


In some embodiments, the model deployment system 110 may include hardware components such as a processor 111, which may include local or remote processing components. In some embodiments, the processor 111 may include any type of data processing capacity, such as a hardware logic circuit, for example an application specific integrated circuit (ASIC) and a programmable logic, or such as a computing device, for example, a microcomputer or microcontroller that include a programmable microprocessor. In some embodiments, the processor 111 may include data-processing capacity provided by the microprocessor. In some embodiments, the microprocessor may include memory, processing, interface resources, controllers, and counters. In some embodiments, the microprocessor may also include one or more programs stored in memory.


Similarly, the model deployment system 110 may include storage 112, such as one or more local and/or remote data storage solutions such as, e.g., local hard-drive, solid-state drive, flash drive, database or other local data storage solutions or any combination thereof, and/or remote data storage solutions such as a server, mainframe, database or cloud services, distributed database or other suitable data storage solutions or any combination thereof. In some embodiments, the storage 111 may include, e.g., a suitable non-transient computer readable medium such as, e.g., random access memory (RAM), read only memory (ROM), one or more buffers and/or caches, among other memory devices or any combination thereof


In some embodiments, the model deployment system 110 may implement a Dual problem training engine 120 configured for training a DNN, a tensor decomposition engine 130 configured for compressing a trained DNN and a trained compressed DNN fine-tuning engine 140 configured for fine-tuning the compressed trained DNN before deploying a dual optimization problem-based tensor decomposed format model 106. In some embodiments, the terms “computer engine” and “engine” identify at least one software component and/or a combination of at least one software component and at least one hardware component which are designed/programmed/configured to manage/control other software and/or hardware components (such as the libraries, software development kits (SDKs), objects, etc.).


Examples of hardware elements may include processors, microprocessors, circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, application specific integrated circuits (ASIC), programmable logic devices (PLD), digital signal processors (DSP), field programmable gate array (FPGA), logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth. In some embodiments, the one or more processors may be implemented as a Complex Instruction Set Computer (CISC) or Reduced Instruction Set Computer (RISC) processors; x86 instruction set compatible processors, multi- core, or any other microprocessor or central processing unit (CPU). In various implementations, the one or more processors may be dual-core processor(s), dual-core mobile processor(s), and so forth.


Examples of software may include software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, application program interfaces (API), instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof. Determining whether an embodiment is implemented using hardware elements and/or software elements may vary in accordance with any number of factors, such as desired computational rate, power levels, heat tolerances, processing cycle budget, input data rates, output data rates, memory resources, data bus speeds and other design or performance constraints.


In some embodiments, the model deployment system 110 may receive a request 104 for a trained DNN for deployment to the resource constrained environment 100. In some embodiments, the request 104 may include parameters for the trained DNN such as, e.g., a model identifier identifying a DNN architecture or DNN type or DNN task or any combination thereof, a resource capacity metric identifying a maximum amount of resources for a local DNN model, a training data set and/or training data set identifier associated with input data records and ground truth data records for training the DNN model to perform the DNN task, among other parameters or any suitable combination thereof. In some embodiments, the resource capacity metric may include, e.g., a memory capacity metric defining a maximum memory size available, a processing capacity metric defining a maximum processing performance available, an energy capacity metric defining a maximum energy use available, among others or any combination thereof


In some embodiments, the DNN model identified by the model identifier may include a suitable deep neural network having at least one weight matrix that is arranged in at least one layer. Each layer of the model may have weights that may be trained to correlate an input to an output.


In some embodiments and, optionally, in combination of any embodiment described above or below, an exemplary implementation of a deep neural network may be executed as follows:

    • a. define Neural Network architecture/model,
    • b. transfer the input data to the exemplary neural network model,
    • c. train the exemplary model incrementally,
    • d. determine the accuracy for a specific number of timesteps,
    • e. apply the exemplary trained model to process the newly-received input data,
    • f. optionally and in parallel, continue to train the exemplary trained model with a predetermined periodicity.


In some embodiments and, optionally, in combination of any embodiment described above or below, the exemplary trained neural network model may specify a neural network by at least a neural network topology, a series of activation functions, and connection weights. For example, the topology of a neural network may include a configuration of nodes of the neural network and connections between such nodes. In some embodiments and, optionally, in combination of any embodiment described above or below, the exemplary trained neural network model may also be specified to include other parameters, including but not limited to, bias values/functions and/or aggregation functions. For example, an activation function of a node may be a step function, sine function, continuous or piecewise linear function, sigmoid function, hyperbolic tangent function, or other type of mathematical function that represents a threshold at which the node is activated. In some embodiments and, optionally, in combination of any embodiment described above or below, the exemplary aggregation function may be a mathematical function that combines (e.g., sum, product, etc.) input signals to the node. In some embodiments and, optionally, in combination of any embodiment described above or below, an output of the exemplary aggregation function may be used as input to the exemplary activation function. In some embodiments and, optionally, in combination of any embodiment described above or below, the bias may be a constant value or function that may be used by the aggregation function and/or the activation function to make the node more or less likely to be activated.


In some embodiments, the model deployment system 110 may instantiate the DNN training 120 to train a DNN according to the request 104. In some embodiments, based on the model identifier, the Dual problem training engine 120 may access a DNN model library 115 in the storage 112. In some embodiments, the DNN model library 115 may include uninitialized DNN models having one or more architectures. For example, the DNNs in the DNN model library 115 may include, e.g., one or more architectural designs of a support vector machine, transformer, a multi-layer perceptron, autoencoder, a convolutional neural network (CNN), or a recurrent neural network (RNN), among others or any combination thereof. Thus, based on the DNN architecture, type, and/or model identified by the model identifier of the request 104, the Dual problem training engine 120 may query and retrieve the associated DNN model in the DNN model library 115. Alternatively, the request 104 may include the DNN model itself rather than or in addition to the model identifier. Thus, the Dual problem training engine 120 may retrieve the DNN model directly from the request 104.


In some embodiments, the Dual problem training engine 120 may also load a training data set based on the training data parameter of the request 104. In some embodiments, the storage 112 may include a library of training data 114, e.g., organized by task, training data set identifier, or any other suitable catalog of training data set. Thus, based on the request 104, the Dual problem training engine 120 may query the storage 112 and retrieve the training data set associated with the request 104 for training the DNN model. Alternatively, the request 104 may include the training data set itself. Thus, the Dual problem training engine 120 may retrieve the training data set directly from the request 104.


In some embodiments, the training data set may include a set of input data records and a set of ground truth, or target or output, data records, where each pair of input to ground truth records defines a known input and output. Accordingly, the Dual problem training engine 120 may initialize the DNN model and use the training data set to train the DNN model.


In some embodiments, to overcome the current limitations of tensor decomposition and fully unlock its potentials for model compression, the dual problem training engine 120 may employ a dual problem optimization problem, such as, e.g., dual problem optimization method, dual ascent method (DAM), dual decomposition, Alternating Direction Method of Multipliers (ADMM), among other suitable dual problem optimization problems or any combination thereof. By formulating tensor decomposition-based model compression to an optimization problem with constraints on tensor ranks, the dual problem optimization problem may be leveraged to systemically solve the optimization problem in an iterative way. During this procedure the entire DNN model is trained in the original structure instead of tensor decomposed structure, but gradually enjoys the desired low tensor rank characteristics. The trained uncompressed model may then be decomposed to tensor decomposed format, and fine-tuned to finally obtain a high-accuracy trained tensor decomposed format DNN model.


In some embodiments, the systematic framework may include formulating and solving the tensor decomposition-based model compression problem. With formulating this problem to a constrained non-convex optimization problem, embodiments of the present framework gradually restricts the DNN model to the target tensor ranks without explicitly training on the tensor decomposed format, thereby maintaining the model capacity as well as minimizing approximation error and increased network depth.


In some embodiments, the systematic framework employs a dual optimization problem such as ADMM to efficiently solve the reformulated optimization problem via separately solving two sub-problems. In some embodiments, a first sub-problem may be to directly optimize the loss function with a regularization of the DNN, e.g., by stochastic gradient descent or other suitable optimization method. In some embodiments, the second sub-problem may use the introduced projection to constrain the tensor ranks analytically.


Thus, in some embodiments, the dual problem training engine 120 may produce a trained DNN model by iteratively updating the at least one layer of the deep neural network, at each iteration, with: a rank value representing a tensor rank to apply to the at least one weight matrix, and the weights of the weight matrix upon decomposition to the tensor of the rank value. To do so, the Dual problem training engine 120 may, at each training iteration, minimize the rank value at each iteration, decompose the weight matrix to a tensor having the rank value, minimize a loss function based on the training data and the weights, backpropagate a loss to the plurality weights, and stop iteratively updating the at least one layer upon the resource capacity metric being satisfied and the loss function being minimized within the resource capacity metric.


In some embodiments, the resource capacity metric may define a maximum memory available for the DNN model. Thus, the dual problem training engine 120 may perform training iterations until the tensor rank is minimized to below the resource capacity metric such that the DNN model being decomposed to have one or more tensors of the tensor rank satisfies the maximum memory capacity. Accordingly, the dual problem training engine 120 may determine the minimum tensor rank associated with the available memory capacity and terminate training iterations where the minimization of the loss converges after the tensor rank is minimized to the minimum tensor rank associated with the available memory capacity.


In some embodiments, upon training the DNN model, the trained DNN model may be provided to a tensor decomposition engine 130 to compress the trained DNN model via tensor decomposition to obtain a trained tensor decomposition format DNN. Because the optimization procedure has already imposed the desired low tensor decomposed rank structure to the uncompressed model, such direction decomposition may avoid significant approximation error.


Thus, in some embodiments, the tensor decomposition engine 130 may utilize the tensor rank determined during training to decompose the weight matrix of the DNN model to a tensor having the tensor rank. In some embodiments, the tensor decomposition technique employed by the tensor decomposition engine 130 may include, e.g., tensor train decomposition, tensor ring decomposition, singular value decomposition (SVD), higher-order singular value decomposition (HOSVD), Tucker decomposition, hierarchical Tucker decomposition, or block term decomposition.


In some embodiments, upon compressing the trained DNN model to a trained tensor decomposition format DNN model, the fine-tuning engine 140 may refine the training of the trained tensor decomposition format DNN model to recover any losses due to the compression process. Thus, the fine-tuning engine 140 may employ the training data set to train the trained tensor decomposition format DNN model, e.g., using a suitable optimization technique such as, e.g., stochastic gradient descent, backtracking line search, coordinate descent, stochastic hill climbing, stochastic variance reduction, among others or any combination thereof. In some embodiments, the fine-tuning phase may be very fast relative to the initial training, e.g., requiring only a few iterations. In some embodiments, the speed of tine-tuning may be because the trained tensor decomposition format DNN model at the starting point of the fine-tuning phase may benefit from decreased accuracy loss relative to the original uncompressed DNN model.


In some embodiments, upon fine-tuning, the model deployment system 110 may deploy the fine-tuned trained tensor decomposition format DNN model to the resource constrained environment 100 such that the fine-tuned trained tensor decomposition format DNN model may be implemented in, e.g., memory constrained environment of the device and/or system of devices associated with the resource constrained environment. In some embodiments, to address the fundamental challenges, the trained tensor decomposition format DNN model and inferencing engine of the present disclosure enables the hardware-friendly inference scheme. In some embodiments, a theoretical limit for minimum number of multiplications needed for tensor decomposed-format inference may be calculated, a computation-efficient inference scheme may be developed. The inference scheme may be configured for the tensor decomposition of the fine-tuned trained tensor decomposition format DNN model, which may have two benefits: 1) it is very compact because the required number of multiplications of this scheme is identical to the theoretical limit, thus eliminating all the unnecessary redundant computations; and 2) based on its multi-stage processing style, the computing engine only needs to access one tensor core in each stage, thereby leading to significant saving in memory access.


In some embodiments, an inferencing engine based on the inferencing scheme may be developed to form a tensor decomposed format DNN inference engine (“TIE”), which may include a specialized hardware architecture based on tensor decomposed-DNN. TIE is designed to fully reap the benefits of embodiments of the present hardware-friendly inference scheme and achieves high computation efficiency as well as simple memory access. Also, TIE 15 flexible and may be adapted to various network types, values of ranks, number of tensor dimensions, and combinations of factorization factors, thereby making itself well suited for various application scenarios and tasks.


Example—Energy-Efficient Tensor Train-Based DNN Inferencing

In order to facilitate and promote the widespread deployment of DNNs in broader scope of application scenarios, both ML and hardware communities have conducted extensive investigations on compressing DNNs with affordable accuracy loss. Specifically, due to the well-recognized and verified redundancy of DNN models, different compression approaches, such as pruning, clustering, low rank decomposition and low bit-width etc., have been and adopted to remove the redundancy on structure, layer, weight or number precision of DNN models. Correspondingly, several compression-oriented DNN accelerators have also been customized for those compression approaches to achieve high hardware performance.


Among various DNN compression techniques, tensor decomposition is unique due to its extremely high compression ratios. Tensor decomposition may include one or more tensor-based compression techniques such as, e.g., tensor train (TT) decomposition, tensor ring (TR) decomposition, dual problem singular value decomposition (SVD), higher-order singular value decomposition (HOSVD), Tucker decomposition, hierarchical Tucker decomposition, block term decomposition among others or any combination thereof


For instance, experiments in show that applying tensor decomposition to the fully-connected (FC) layers of VGG-16 on ImageNet dataset may bring record-breaking 50000 times compression ratio, while other compression approaches typically only achieve much less compression on FC layers. Moreover, due to the generality of tenso decomposition, this approach may also be applied to compressing convolutional (CONV) layers via decomposing the weight tensors of CONV layers. In some embodiments, tensor decomposition may be effective in several representative types of DNNs, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs).


From the perspective of tensor theory, the impressive compression capability of tensor decomposition such as TT decomposition may come from its unique tensor factorization scheme. As illustrated in FIG. 2A, TT decomposition may decompose a d-dimensional n1×n1×. . . ×nd tensor custom-character to d 3-dimensional rk−1×nk×rk tensor cores, where rk is the preset rank value. Thanks to this special representation scheme, only Σk=1dnkrk−1rk parameters need to be stored in the tensor decomposed while Conventionally ┌k=1dnk parameters were required for an explicit representation. Since in practice r k is typically small, the compression ratio, as defined as Πk=1dnkk=1dnkrk−1rk may be very significant and hence brings orders of magnitude reduction in storage cost.


Due to the advantages of TT decomposition on model compression, exploiting efficient DNN hardware architecture based on TT decomposition (referred as TT-DNN) may provide a solution to the drawbacks of typically DNN compression techniques. Considering the high compression ratios that TT decomposition may bring, such specialized architecture may execute the state-of-the-art CNN and RNN models using smaller memory resources than typically DNNs including DNNs compressed by typical means, thereby leading to more area and energy efficient solutions for resource-constrained DNN accelerator design.


In some embodiments, realizing a high-performance TT-DNN accelerator may overcome the challenge on an inefficient inference scheme based on a tensor decomposed format DNN model. In some embodiments, a tensor decomposed inference scheme may have a larger amount of redundant computations, leading to higher computational cost in the inference phase relative to a standard DNN model. Moreover, those inherent redundant computations also incur intensive memory accesses because the tensor cores need to be frequently accessed when calculating each element of output tensor, thereby causing high energy consumption. As a result, despite the high compression ratios, the inherent inefficiency of a tensor decomposed inference scheme may directly impede the potential deployment of TT-DNN accelerator in energy-constrained applications.


In some embodiments, to address the fundamental challenges, the tensor composed DNN, and inferencing engine of the present disclosure enables the hardware-friendly inference scheme. In some embodiments, a theoretical limit for minimum number of multiplications needed for tensor decomposed-format inference may be calculated, a computation-efficient inference scheme may be developed. The tensor decomposed-format inference scheme has two benefits: 1) it is very compact because the required number of multiplications of this scheme is identical to the theoretical limit, thus eliminating all the unnecessary redundant computations; and 2) based on its multi-stage processing style, the computing engine only needs to access one tensor core in each stage, thereby leading to significant saving in memory access.


In some embodiments, an inferencing engine based on the inferencing scheme may be developed to form a tensor decomposed format DNN inference engine (“TIE”), which may include a specialized hardware architecture based on tensor decomposed-DNN. TIE is designed to fully reap the benefits of embodiments of the present hardware-friendly inference scheme and achieves high computation efficiency as well as simple memory access. Also, TIE 15 flexible and may be adapted to various network types, values of ranks, number of tensor dimensions, and combinations of factorization factors, thereby making itself well suited for various application scenarios and tasks.


In some embodiments, an example of the TIE design may include a prototype TIE design using CMOS 28 nm technology for tensor train format DNNs (TT-DNNs). With 16 processing elements (PEs) operating on 1000 MHz, the TIE accelerator consumes 1.74 mm2 and 154.8 mW. Compared to the typical compressed DNN-oriented accelerators using other compression methods, such as sparsification and structured matrices (CIRCNN), the TT decomposition-based TIE exhibits significant advantages in hardware performance. Compared with EIE, TIE achieves 7.22ט10.66× better area efficiency and 3.03ט4.48× better energy efficiency on different workloads, respectively. Compared with CIRCNN, TIE achieves 5.96× and 4.56× higher throughput and energy efficiency, respectively.


1 TT-Based DNN Compression
1.1 TT Decomposition & Tensor Decomposed


FIG. 2A illustrates a matrix via using the TT decomposition of its reshaped tensor in accordance with some embodiments of the present disclosure.



FIG. 2B illustrates a tensor decomposed inference scheme in accordance with some embodiments of the present disclosure.


In some embodiments, TT decomposition is an efficient compression approach to reduce DNN model sizes. In general, TT decomposition may decompose a large-size multidimensional tensor into a set of small-size 3-dimensional tensors.


Specifically, for a d-dimensional n1×n1×. . . ×nd tensor of custom-character, after TT decomposition custom-character is stored in the tensor decomposed using d tensor cores custom-characterkcustom-characterrk−1×nk×rk, where k=1, 2, . . . ,d, and each element in custom-character may be reconstructed as follows:






custom-character(j1, . . . ,jd)=G1[j1]×G2[j2]×. . . ×Gd[jd],   Equation (1)


where custom-characterk[jk]∈custom-characterrk−1×rk is the jk-th slice of the k-th tensor core custom-characterk with jk=1, 2, . . . , nk and rk is the rank of a tensor core. Accordingly, the number of parameters to represent custom-character is reduced from Πk=1dnk to Σk=1dnkrk-1rk. In some embodiments, the value of rk may vary since TT-decomposition for arbitrary tensor is not unique, r0 and rd are always set as 1 to satisfy the boundary condition.


In some embodiments, the value of rk may be set as a small value, so that the parameter saving resulting from TT decomposition may be very significant. Consequently, leveraging TT decomposition to perform efficient compression on DNN models is very attractive since the fully-connected (FC) and convolutional (CONV) layers of DNNs are in the format of matrix and tensor, which may be decomposed and represented in the tensor decomposed. In some embodiments, in order to maintain high test accuracy, the TT decomposition is typically not directly applied to the 2D weight matrix or 4D weight tensor but to their reshaped format. For instance, as illustrated in FIG. 2A, in order to store a 5×12 weight matrix in the tensor decomposed, the weight matrix is first reshaped to a 3-dimensional tensor as d=3, and then it is decomposed and stored in the three tensor cores (custom-character1˜custom-character3).


1.2 Tensor Decomposed Inference & Training on DNNs

In some embodiments, when a DNN model is stored in the tensor decomposed, the corresponding inference and training schemes may be re-formulated since the underlying representation for weight matrix and tensor of FC and CONV layers have been changed.


In some embodiments inference on tensor decomposed FC layers. In some embodiments, with weight matrix W∈custom-characterM×N, input vector x∈custom-characterN and output vector y∈custom-characterM, the inference procedure on FC layer is y=Wx, where bias is combined with W for simplicity. In the scenario of representing weight matrix in the tensor decomposed, such inference scheme may be re-formulated as follows:











𝓎

(


i
1

,


,

i
d


)

=





j
1

,

,

j
d






G
1

[


i
1

,

j
1


]




G
2

[


i
2

,

j
2


]




,



,



G
d

[


i
d

,

j
d


]



𝒳

(


j
1

,


,

j
d


)


,




Equation



(
2
)








where custom-charactercustom-characterm1×m2×, . . . ,xmd and custom-charactercustom-charactern1×n2×, . . . ,×nd are the reshaped y and x in the tensor format with M=Πk=1dnk, respectively. In some embodiments, weight matrix W may be first reshaped into a d-dimensional tensor custom-character, and then custom-character is decomposed and represented in the tensor decomposed format with d tensor cores custom-characterk, where k=1, 2, . . . ,d. Notice that here custom-characterkcustom-characterrk−1×mk×nk×rk is different from the representation in Equation (1) in Section 2.1. This is because, as indicated in, this 4D representation of tensor cores is better for describing the matrix-vector multiplication in the tensor decomposed format, where 3D-based tensor cores may still be used in the described inference scheme. The drawback is the complicated indexing. The 4D representation may be viewed as folding the original 3D tensor. Accordingly, as illustrated in FIG. 2B, custom-characterk may be viewed as a 2D mk-by-nk array, where each element (custom-characterk[ik,jk]) in Equation (2)) of this array is an rk-1-by-rk matrix.



FIG. 3 illustrates a converting computation on CONV layer to matrix multiplication in accordance with some embodiments of the present disclosure. Here H′=H−f+1 and W′=W−f+1.


In some embodiments, due to its generality for arbitrary tensor, tensor decomposition may also enable efficient inference on CONV layer that is affiliated with a weight tensor. In some embodiments, there are two methods to represent the conventional 4D weight tensor of CONV layer in the tensor decomposed format. The first one is to directly apply tensor decomposition to the 4D tensor and obtain the corresponding tensor cores. However, as indicated in, such method is not very efficient for CONV layer with small kernel sizes (e.g., 1×1 convolution). In some embodiments, another method is to reshape the 4D weight tensor to 2D matrix, and then use the same procedure for inference on FC layer to perform inference on CONV layer. As illustrated in FIG. 3, such transform is mathematically rigid since the 2D convolution between the 3D input tensor and 4D weight tensor is equivalent to the matrix-by-matrix multiplication. Consequently, both the inference on FC layers and CONV layers may be executed on the same tensor decomposed inference engine.


In some embodiments, regarding training tensor decomposed format DNN models, in general, after the sizes of tensor cores custom-characterk have been determined, a DNN model in the tensor decomposed format may be either trained from the scratch or obtained from a pretrained non-tensor decomposed format model. In some embodiments, the train-from-scratch strategy assigns initial values for each tensor core and then performs backward propagation scheme in to update them. On the other hand, if converting a non-tensor decomposed trained model to the tensor decomposed format is needed, the standard TT decomposition in is first applied to the weight matrix/tensor of the FC/CONV layer of model to form the initial values of tensor cores. Then backward propagation-based fine-tuning process is performed to retain original high accuracy. In some embodiments, other techniques for training a tensor decomposed format neural network are described further below.


1.3 Compression & Accuracy Performance

In some embodiments, based on the training and inference described in above, the tensor decomposed format DNN models may be trained and tested. Table 1-Table 3 list the test accuracy and compression ratio (CR) of different types of DNN models (convolutional neural network (CNN) and recurrent neural network (RNN)) on different datasets. Here the CR is measured as the reduction of number of parameters of the model. Specifically, the experimental settings are shown in Tables 1 through 3 as follows:









TABLE 1







FC-dominated CNN on ImageNet.












FC-dominated






CNN
Accuracy
CR for FC
CR for overall



(NIPS′16)
(%)
layers
network







VGG-16 (baseline)
69.1
  1×
  1×



TT-VGG-16
67.8
30.9×
7.4×

















TABLE 2







CONV-dominated CNN on CIFAR-10.












CONV-dominated
Accuracy
CR for
CR for overall



CNN (NIPS′17)
(%)
CONV layers
network







CNN (baseline)
90.7
  1×
  1×



TT-CNN
89.3
3.3×
3.27×

















TABLE 3







RNN on Youtube Celebrities Face Data.















CR for



RNN
Accuracy
CR for FC
overall



(ICML′17)
(%)
layers
network







LSTM (baseline)
33.2
  1×
 1×



TT-LSTM
75.5
15283×
196×



GRU (baseline)
34.2
  1×
 1×



TT-GRU
80.0
11683×
195×












    • a. FC-dominated CNN: Two FC layers (FC6 and FC7) are in the tensor decomposed format, where d=6, m1˜m6=4, r1˜r5=4. For FC6 and FC7, n1˜n6=[2,7,8,8,7,4] and [4,4,4,4,4,4], respectively.

    • b. CONV-dominated CNN: The 2nd˜6h CONV layers are in the tensor decomposed format, where d=4,m=[3,4,4,4] and [3,4,8,4] for the 2nd and the 3rd˜6th layers, respectively, n=[3,4,4,4] and [3,4,8,4] for the 2nd˜3rd and the 4th˜6th layers, respectively, r1˜r2=[22,20,20], [27,22,22], [23,23,23] for the 2nd, the 3rd and the 4th˜6th layers, respectively.

    • c. LSTM or GRU-based RNN: All the input-to-hidden layers are in the tensor decomposed format, where d=4, m=[4,4,4,4], n=[4,20, 20,36] and r2˜r4=4.





In some embodiments, as shown in Table 1-Table 3, TT decomposition enables significant reduction in the number of parameters of the decomposed layers and the entire DNN model sizes. Meanwhile, it preserves high task accuracy on different datasets, facilitating practical deployment of DNN models, e.g., in resource constrained environments. However, a drawback of typical tensor decomposition including low computational efficiency in the inference phase, which may be elaborated in Section 3.1, impedes wide adoption in practical systems. Indeed, the TT-decomposed models achieve similar test accuracy with the state-of-the-artwork having 80.8% accuracy.


2 Efficient tensor decomposed Inference: Challenge and Solution
2.1 Challenge of Tensor Decomposed Inference

In some embodiments, as described in Equation (2), the inference on the tensor decomposed layers of DNN models may be performed via multi-dimensional summation of the products of slices of different tensor cores. This implementation, though classical and straightforward, incurs severe challenge that leads to low computational efficiency.


In general, the low computational efficiency of tensor decomposed inference scheme comes from its inherent redundant computations. Recall that in Equation (2), calculating a specific element of output tensor (custom-character(i1, . . . , id))requires consecutive multiplications of custom-characterk[ik,jk] over all jk's. Since each custom-character(i1, . . . , id) always has the partially same indices ik with many other custom-character(i1, . . . , id)'s, calculating those indices shared elements inherently contains multiple times of consecutive multiplication among the same custom-characterk[ik,jk], thereby causing unnecessary computational redundancy.



FIG. 4 illustrates an example of redundant computations in conventional tensor decomposed inference scheme in accordance with some embodiments of the present disclosure. Thus, FIG. 4 illustrates the existence of such redundancy in the calculation of 3-dimensional tensor. As shown, the calculation procedure of custom-character(0,0,0) and custom-character(1,0,0) have two identical matrix-vector multiplication stages out of all three stages. FIG. 4 only shows the computational redundancy for these two specific output tensor elements. In general, such replicate multiplications in Equation (2) always exist for any pair of tensor elements that shares part of the indices.


Tensor Decomposed

To further quantify the computational redundancy, an analytic evaluation on the total number of multiplications consumed in Equation (2) and the minimum required number of multiplications for calculating all custom-character(i1, . . . ,id), respectively, may be performed. For simplicity, only multiplication is counted for computational cost.


Analysis on total number of multiplications in Equation (2): First, the total required number of multiplications consumed in Equation (2) may be examined. As indicated in FIG. 4, computing one custom-character(i1, . . . ,id) needs I stages of matrix-vector multiplication between a length-ri and ri-by-ri-1 matrix. Therefore, total number of multiplications to calculate all Mcustom-character(i1, . . . ,id)'s is:










MUL
naive

=

MN





i
=
1

d




r
i




r

i
-
1


.








Equation



(
3
)








Analysis on minimum number of multiplications for custom-character(i1, . . . , id): Next the minimum number of required multiplications for calculating all custom-character(i1, . . . , id)∈custom-character may be analyzed. The general procedure is to first determine the computational cost for custom-character(i1, . . . , id−1, :)4 when i1˜id−1 are specific. In some embodiments, the ‘:’ is used in the i-th dimension of tensor to denote all elements in the i-th dimension of this tensor. The number of non-redundant multiplications for calculating custom-character(i1, . . . , id−2, :, :) when i1˜id−2 are specific is determined based on the computation cost when i1˜id−1 are specific. As result, the computations involved with custom-characterd[id,jd] are not considered since such computations have been included before, thereby avoiding counting repeated computation. The similar analysis from the (d−2)-th dimension to the 1-st dimension of custom-character may be performed, and finally the minimum required number of multiplications for calculating all custom-character(i1, . . . , id)'s as custom-character(:, . . . , :, :) may be obtained. In general, such recursive counting method ensures that all the multiplications involved with the calculation of all custom-character(i1, . . . , id)'s are included and meanwhile those multiplications are not counted repeatedly.



FIG. 5 illustrates a partially paralleling the inputs in Stage-3 in accordance with some embodiments of the present disclosure. Redundant computations are partially reduced.


Specifically, the detail of above analysis procedure is described as follows. First consider the computational cost for custom-character(i1, . . . , id−1, :) (referred as stage-1). Recall that Equation (2) indicates calculating one custom-character(i1, . . . , id−1, id) requires all custom-character(j1, . . . , jd−1, jd)'s and dcustom-characterk[ik,jk]'s where k=1,2, . . . , d. Additionally, as illustrated in FIG. 4, custom-character(j1, . . . , jd−1, jd) only shares one common index jk with one slice of tensor core custom-characterk[ik,jk]. Based on these two observations, in order to facilitate the analysis on custom-character(i1, . . . , id−1, :) the counting procedure may be partitioned into d steps, where in the k-th step the involvement of custom-character(j1, . . . , jd−k, :, :, . . . , :) is considered. Accordingly, when considering the number of multiplications involved with custom-character(j1, . . . , jd−1, :) and custom-characterd[id,jd] for calculating custom-character(i1, . . . , id−1, :), the computational cost is rd−1rdnd. This cost corresponds to paralleling the d-th dimension of input custom-character(j1, . . . ,jd−1,jd) (as shown in stage-1 of FIG. 5).Though such paralleled input scheme does not reduce any computational redundancy involved with custom-characterd[id,jd] (stage-1 in FIG. 5), it saves the computation involved with custom-characterd−1[id−1,jd−1] (stage-2 in FIG. 5) because now only one, instead of nd, length-r1 vector is multiplied with custom-characterd−1[id−1,jd−1]. Besides, since all the computations involved with custom-characterd[id,jd] has been considered before, the additional computational cost for custom-character(i1, . . . , id−1, :), with specific jd−1 is rd−1rdnd+rd−2rd−1. Therefore, the number of multiplications of calculating custom-character(i1, . . . , id−1, :) with custom-character(j1, . . . , jd−2, :, :) is (rd−1rdnd+rd−2rd−1)nd−1. By recursively applying this analysis, the number of multiplications for calculating custom-character(i1, . . . , id−1, :) may be derived with custom-character(:, . . . , :, :) as:











MUL



𝓎

(


i
1

,



,

k

d
-
1


,
:

)


=


m
d






i
=
1

d



(


r
i



r

i
-
1







t
=
1

i


n
t



)

.







Equation



(
4
)








Next, the additional computational cost for calculating custom-character(i1, . . . , id−2, :, :) (referred as stage-2) may be determined. Similar to the previous analysis on recursive computation, when computing custom-character(i1, . . . , id−2, :, :), the computation involved with custom-character(i1, . . . , id−2, id−1, :) has already been considered and may be not be re-counted again. Therefore, the additional number of multiplications for calculating custom-character(i1, . . . , id−2, :, :) with custom-character(:, . . . , :, :) is:










MUL

extra


for



𝓎

(


i
1

,


,

i

d
-
2


,
:
,
:

)



=


(



m

d
-
1




m
d


-

m
d


)






i
=
1


d
-
1




(


r
i



r

i
-
1







t
=
1

i


n
t



)

.







Equation



(
5
)








By generalizing Equation 5, the additional number of multiplications for calculating custom-character(i1, . . . , il, :, . . . , :) withinn stage-1 may be derived as:














MUL

extra


for



𝓎

(

:
,



,
:
,
:

)



=




MUL



𝓎

(


i
1

,



,

i

d
-
1


,
:

)


+










MUL

extra


for



𝓎

(


i
1

,



,

i

d
-
2


,
:
,
:

)



+











+

MUL

extra


for



𝓎

(


i
1

,



,
:
,
:

)










=





l
=
1

d


(


(


m
l

-
1

)






j
=

l
+
1


d



m
j






i
=
1

l


(


r
i



r

i
-
1







t
=
1

i


n
t



)





)






.




Equation



(
6
)








Consequently, because Y has d dimensions, the total minimum number of multiplications for calculating custom-character(i1, . . . , il, :, . . . , :) in all d stages is:














MUL

extra


for



𝓎

(

:
,



,
:
,
:

)



=




MUL



𝓎

(


i
1

,



,

i

d
-
1


,
:

)


+










MUL

extra


for



𝓎

(


i
1

,



,

i

d
-
2


,
:
,
:

)



+










+

MUL

extra


for



𝓎

(


i
1

,



,
:
,
:

)










=





l
=
1

d


(


(


m
l

-
1

)






j
=

l
+
1


d



m
j






i
=
1

l


(


r
i



r

i
-
1







t
=
1

i


n
t



)





)






.




Equation



(
7
)








Equation 7 gives the analytical result of minimum number of multiplications for performing tensor decomposed inference. Comparing this theoretical limit with Equation 3, the conventional tensor decomposed scheme contains computational redundancy. For instance, for the FC-6 layer in VGG-16 with d=6 and ri=4, the number of multiplications consumed in Equation 3 is 1073 times than that in Equation 7. Such redundancy in multiplication results in reduced computational efficiency of conventional approaches.


2.2 Compact Tensor Decomposed Inference Scheme

In some embodiments, to address the challenge of low computational efficiency, a computation-efficient tensor decomposed inference scheme is configured to calculate all the elements of output tensor custom-character in parallel without any redundant computations in a compact inference scheme, thereby improving computational efficiency over the conventional tensor decomposed inference scheme.


In some embodiments, the design of the compact inference scheme leverages the theoretical analysis on the minimum required number of computations in Section 3. Recall that in the previous analysis the minimum number of multiplications is counted based on the assumption that all the computations involved with custom-characterk[ik,jk] are not included for the future computations involved with custom-characterk−1[ik−1,jk−1]. To achieve this, in embodiments, the computation on different custom-characterk's may be performed one by one. In other words, different from Equation 2 that calculates one output tensor element using dcustom-characterk[ik,jk]'s where k=1, 2, . . . ,d, in some embodiments, the compact inference scheme may perform computation using all mk,nkcustom-characterk[ik,jk]'s with one specific k, and then not use all mk,nkcustom-characterk[ik,jk]'s in the future for other k's. Consequently, such computing arrangement breaks the original data dependency and eliminates the potential computational redundancy.



FIG. 6A and FIG. 6B illustrate an example of compact tensor decomposed inference scheme in accordance with some embodiments of the present disclosure.


In some embodiments, as described in Section 2.1 above and shown in FIG. 5, partial paralleling input multiple custom-character(j1, . . . , jd−1, jd)'s as custom-character(j1, . . . , jd−1, :) reduces redundant computations involved with custom-characterd−1[id−1,jd−1]. To be consistent with this, the computation involved with all the input custom-character(j1, . . . , jd−1, jd)'s (as custom-character(:, . . . , :, :)) and all the slices of custom-characterd (see FIG. 6A) may be made fully parallel. As shown in FIGS. 6A and 6B, such parallel computation is associated with a compact matrix-format multiplication that replaces the original summations in Equation 2 over index jd. Different from FIG. 5, the computations in stage-1 of FIG. 6A are for every custom-character(j1, . . . , jd−1, jd) and every slice of custom-characterd; therefore, the input tensor custom-charactercustom-charactern1×n2×. . . ,×nd may be properly transformed to a new matrix format to ensure the functional validity and suited for matrix multiplication. In some embodiments, such transform converts a tensor custom-charactercustom-charactern1×n2×. . . ,×nd to a matrix custom-charactercustom-characternd×Πt=1Πd−1nt, and the mapping principle for this transform is as follows:






custom-character(j1, . . . , jd−1, jd)→X′(p,q),   Equation (8)


where p=jd and q=Σl=1d−1Πi=1l−1ni. Then, a compact matrix format multiplication may be performed as:






V
d
={tilde over (G)}
d
X′,   Equation (9)


where Vd is the intermediate matrix to be sent to stage-2, and {tilde over (G)}d is the matrix format of unfolded custom-characterd (see FIG. 6A).


In some embodiments, FIG. 6A depicts compact matrix-format computation in stage-1; while in stage-2 and stage-3 custom-character2[i2,j2] and custom-character1[i1,j1] may be fetched for processing in serial, thereby causing redundant computations. As analyzed in Section 2.1, avoiding the redundant computations in each computing stage may be mitigated using the involvement of the output values from previous stages (e.g., V3) and the matrix format of the unfolded tensor core (e.g., {tilde over (G)}2). However, as illustrated in FIGS. 6A and 6B, {tilde over (G)}2 and V3, or {tilde over (G)}1 and V2, may not be simply multiplied because 1) the sizes do not fit for direct matrix multiplications; and 2) the elements of Vh, as the intermediate values from stages-(d−h+1), and {tilde over (G)}h−1, as the matrix format of unfolded tensor core in the stage-(d−h+2), may not be in the correct positions to produce correct results even if {tilde over (G)}2 and V3, or {tilde over (G)}1 and V2 may be multiplied. In some embodiments, therefore, transforming Vh before it is multiplied with {tilde over (G)}h−1 may resolve such complications. In some embodiments, in the stage-(d−h+1) such transform is to convert Vhcustom-character(mh×rh−1)×(Πk=1Πh−1nk×Πk=1Πd−hmd−k−1) to V′hcustom-character(nh−1×rh−1)×(Πk=1Πh−1nk×Πk=1Πd−h+1md−k−1), and the mapping principal for this transform is as follows:






V
h(p,q)→V′h(p′,q′),





where p=ihrh−1+th−1, p′=jh−1rh−1+th−1,






q=(Σl=1h−1jtΠi=1l−1nik=1d−hmd−k+1g=2d−hk=gd−hmd−k+1)id−g+2,






q′=(Σl=1h−2jtΠi=1l−1nik=1d−h+1md−k+1g=2d−h+1k=gd−h+1md−k+1)id−g+2.   Equation (10)


After performing this transform, a compact matrix-format computation for stage-(d−h+2) is as:






V
h−1
={tilde over (G)}
h−1
V′
h   Equation (11)


where Vh−1 may be then be sent to stage-(d−h+3) and being transformed again.


In some embodiments, as illustrated in FIG. 6B, the entire compact tensor decomposed inference scheme contains d computation stages, where each stage performs transform and multiply with the output Vh from previous stages. In each stage, matrix multiplication between transformed input V′h and corresponding {tilde over (G)}h−1. By using this scheme, all the elements of final output tensor Y may be obtained simultaneously at the output end of stage-1 without any redundant computations. In some embodiments, for practical implementation, the transformation described in Equation(2) may be equivalently achieved by performing 4-step matrix-wise multiplications (see Transform in FIG. 6B). Putting all together, the compact tensor decomposed inference scheme may be as described in Algorithm 1.












Algorithm 1: Compact Tensor Decomposed Inference Scheme

















Input :X, {tilde over (G)}1, . . . , {tilde over (G)}d, m = [m1, . . . , md], n =



   [n1, . . . , nd], r = [r0, r1, . . . , rd]



Output : Y


1
X’=Reshape(X, [nd, −1])


2
V’d+1=X’


3
for h = d to |1 do


4
 | Vh = MatMul({tilde over (G)}h, V’h+1)


5
 └ V’h = Transform(Vh, h)


6
Function Transform(V, h)


7
 | V’ = Transpose(V)


8
 | V’ = Rehape(V’, [nh−1, −1])


9
 | // split


10
 | t’ = new [nh−1, rh−1]


11
 | T′ = new [Πk=1h−2 nk Πk=hd mk] t’


12
 | for j = 1 to [Πk=1h−2 nk Πk=hd mk] do


13
 |  | T′[j] = V’[:, (j − 1) * rh−1 : j * rh−1]


14
 |  └ T′[j] = Reshape(T′[j],[nh−1 * rh−1])


15
 | // assemble


16
 | V’ = new [nh−1 * rh−1, Πk=1h−2 nk Πk=hd mk]


17
 | for j = 1 to [Πk=1h−2 nk Πk=hd mk] do


18
 |  └ V’[:, j] = T′[j]


19
 └ Return V’









In some embodiments, at each stage of compact tensor decomposed inference scheme, the intermediate values may be buffered on-chip for the processing of next stage. In some embodiments, the storage capacity to store the intermediate values from state-(d−h+1) is max(rh−1Πk=1k−1nkk=hdmk, where h=1 . . . d. In some embodiments, both the input and output of each stage may be stored. The overall storage overhead is 2×max(rh−1Πk−1k−1nkk=hdmk, where h=1 . . . d. In some embodiments, the activation size is much less than weights size in other compression techniques, thus, the storage overhead brought by compact tensor decomposed inference scheme is low enough to be implemented in an embedded or other resource constrained device.



FIG. 7 illustrates an example 2-PE processing scheme with 3 MAC units in each PE in accordance with some embodiments of the present disclosure.



FIG. 8 illustrates a convolution using an input 3D tensor, and d-dimensional tensor kernel and output 3D tensor in accordance with some embodiments of the present disclosure.



FIG. 9 illustrates a TT convolutional computation in accordance with some embodiments of the present disclosure.



FIG. 10 illustrates the compact TT convolutional computation in accordance with some embodiments of the present disclosure.



FIG. 11 illustrate a general case diagram for the reduced TT convolutional computation in accordance with some embodiments of the present disclosure.


A standard convolution includes performing each of Equations 12 and 13 below:






Y(h′,w′,o)=Σk1kΣk2kΣiIF(o,{dot over (v)},k1,k2)x(h,w,ż),   Equation (12)






h=(h′−1)s+k1−p,w=(w′−1)s+k2−p   Equation (13)


where s is the stride, and p is the zero-padding size. As a result, the computational cost of standard convolution is HWOIk2.


Where a convolutional network is compressed using TT, the computation of inferencing with the tensor decomposed CNN may be as follows in Equation 14:






Y′=(h′,w′,o1, . . . od)=Σk1kΣk2kΣr1. . . idR1. . . RdΣi1. . . idI1. . . IdG0(k1, k2, r1)xG1(r1, o, i, r2)x . . . Gd(rd, od, id),   Equation (14)


However, this result in a computation cost of H′W′OIk2R1 . . . Rd, is R1 . . . Rd times as costly as a standard convolution computation.


Accordingly, the compact TT convolution computation for inferencing with a tensor decomposed CNN may be performed in accordance with embodiments described herein using Equation 15 below:






Z
1(h,w,r2)=ΣiIG3(r2,i)X(h,w,i),






Z
2(h′,w′,r1)=Σkk2Σr2r2G1(r1,k,r2)Z1(h,w,r2),






Y(h′,w′,o)=Σr1R1Z2(h′,w′,r1)G1(o,r1)   Equation (15)


This results in a computation cost of HWIR1+H′W′k2R1R2+H′W′OR1, which is less costly than the conventional TT convolution scheme.


Equation 15 has been represented for a particular case of a tensor decomposed CNN. A general case compact TT convolution computation may be performed according to Equation 16 below:












Z
1

(

h
,
w
,

i
2

,


,

i
d

,

r

d
+
1


,

r

d
+
2



)

=







i
1


I
1





G

d
+
2


(


r

d
+
1


,
i
,

r

d
+
2



)




X


(

h
,
w
,

i
1

,


,

i
d


)



,




Equation



(
16
)





















Z
d

(

h
,
w
,

r
d


)

=







i
d


I
d









r


2

d

+
1



R


2

d

+
1






G


2

d

+
1


(


r


2

d

+
1


,

i
d


)




Z

d
-
1


(

h
,
w
,

i
d

,

r

d
+
1


,

r


2

d

+
1



)



,












Z

d
+
1


(


h


,

w


,

r
d


)

=






k

k
2









r

d
+
1



R

d
+
1






G

d
+
1


(


r
d

,
k
,

r

d
+
1



)




Z
d

(

h
,
w
,

r

d
+
1



)



,









Y

(


h


,

w


,

o
1

,


,

o
d


)

=







r
1


R
1





Z


2

d

+
1


(


h


,

w


,

o
1

,


,

o
d

,

r
1


)




G
1

(

o
,

r
1


)








3. Hardware Architecture

In some embodiments, based on the efficient tensor decomposed inference scheme in TIE, a specially configured hardware architecture of tensor train-based inference engine may be produced.


3.1 Data Mapping and Processing Scheme


FIGS. 6A and 6B show the overall computing flow of the TIE. In some embodiments, the overall computing flow of the TIE may include two types of operation: reshaping the inputs (x to X′ and Vh to V′h) and multiplying V′h and {tilde over (G)}h−1. Considering reshaping x to X′ may be prepared offline and reshaping Vh to V′h may be performed by carefully designed memory access scheme in Section 3.4, the data-path of TIE is mainly responsible for executing matrix multiplication. FIG. 14 illustrates the detailed data mapping and processing scheme of an example 2-PE data-path for the multiplication between 3×2 {tilde over (G)}h−1 and 2×4 V′h matrices. In some embodiments, each PE is equipped with 3 multiply-accumulate (MAC) units. In each clock cycle, one column of {tilde over (G)}h−1 is broadcast to all the PEs, where each multiplier of PE receives one element of the column. Meanwhile, in some embodiments, two elements in the same row of V′h are sent to two PEs, respectively; and each of these elements in V′h is broadcast to all the multipliers of their corresponding PEs. After finishing the computation in the current cycle, in the next cycle PEs may be move on to process the next column of {tilde over (G)}h−1 and next row of V′h. In some embodiments, with NPE PEs equipped with NMAC MAC units, the processing scheme may produce a NMAC×NPE-size sub-block of the result matrix Vh−1={tilde over (G)}h−1V′h in NGcol cycles, where NGcol is the number of columns of {tilde over (G)}h−1. In some embodiments, that when the number of rows of {tilde over (G)}h−1 (as NGrow) is larger than NMAC or number of columns of V′h (as NVcol) is larger than NPE, it takes PEs multiple NGcol cycles to calculate the entire Vh−1 (see FIG. 7).


3.2 Overall Architecture


FIG. 12 illustrates an overall architecture of TIE in accordance with some embodiments of the present disclosure.


In some embodiments, based on the data mapping and processing scheme described above, the overall architecture of TIE is shown in FIG. 12. For the inference task on one layer, the data-path of TIE performs d-stage matrix multiplications between matrix V′h read from the working SRAM and {tilde over (G)}h−1 read from the weight SRAM. During the computation in each stage, as indicated in Section 3.1, part of the results matrix Vh−1={tilde over (G)}h−1V′hV′h may already be calculated in the PEs, these sub-block of Vh−1, once available, may be written to another working SRAM. In some embodiments, according to this scheme, two working SRAMs may be used to avoid potential read/write conflict. In some embodiments, after some or all of the Vh−1 is written to one working SRAM, the SRAM may then output V′h−1, as the reshaped Vh−1, to the data-path via a specifically designed memory read scheme (described in Section 3.4). Therefore, the two working SRAMs act as source and destination memories, respectively, and exchange their roles for every stage. In some embodiments, during the last stage of computations for V1, the calculated elements of V1 may be sent to activation units first and then written to working SRAM.


3.3 Data-Path &Weight SRAM


FIG. 13 illustrates a data allocation in weight SRAM in accordance with some embodiments of the present disclosure. In some embodiments, the SRAM may be defined by a data-path and weight SRAM.


In some embodiments, with respect to data-path, as shown in FIG. 12, TIE may include an array of NPE PEs that perform matrix multiplication. In some embodiments, each PE contains NMAC MAC units and NMAC activation units. In some embodiments, by using the processing scheme in Section 3.1, the NMACNPE elements of result matrix are available simultaneous in the registers of all PEs every NGcol cycles, and then they may be written to working SRAM in parallel.


In some embodiments, with respect to weight SRAM, the weight SRAM of TIE may store the weight parameters of tensor cores {tilde over (G)}h's. In some embodiments, though each layer of tensor decomposed format DNN models is affiliated with d tensor cores, the sequential access to different {tilde over (G)}h's in different computation stages, which is described in Section 2.2, enables the simple storing strategy of locating all {tilde over (G)}h's in the same weight SRAM sequentially from h=1 to d. However, different from such sequential placement for the consecutive {tilde over (G)}h's, the data allocation within the same {tilde over (G)}h may not always be sequentially in the weight SRAM. For instance, as illustrated in FIG. 13, when the number of rows of {tilde over (G)}h's is larger than the number of PEs, in order to be consistent with processing scheme described in Section 3.1, the elements in the same column of {tilde over (G)}h need to be stored in the different row of weight SRAM via an interleaved way. Thus, in some embodiments, the entire data allocation of weight SRAM is sequential at the inter-{tilde over (G)}h level and interleaved at the intra-{tilde over (G)}h level.


3.4 Working SRAM

In some embodiments, as indicated in Algorithm 1, a transform from Vh to V′h may be used in each stage of computation to ensure the functional correctness of the inference scheme. Conventionally, such transform, including matrix reshape and transpose, which demands extra memory resource to implement those matrix operations, thereby degrading the hardware performance of the entire design on both area efficiency and power efficiency.


In some embodiments, to address such problem, efficient read and write schemes may be designed for working SRAMs to achieve zero-cost matrix transform. In some embodiments, of the present general methodology is to ensure that the data-path reads the elements of V′h from the working SRAM that stores Vh, thereby enabling on-the-fly matrix transform for Vh. In some embodiments, to achieve the on-the-fly transform, working SRAM may be partitioned to multiple groups with well-designed data selection mechanism.


In some embodiments, the writing scheme may be consistent with the computing scheme described above. As described above, the computing scheme (Section 3.1) each PE calculates NMAC elements in the same column of Vh after every NGcol cycles. In some embodiments, to make data allocation in the working SRAM consistent with the corresponding matrix format of Vh, different PEs assemble the calculated elements in the same positions of MAC units together and write them to one row of component SRAMs. In some embodiments, during the writing phase the calculated elements in the i-th MAC units among different PEs form one row of data to be written to memory. In some embodiments, as mentioned before, each of the two working SRAMs is partitioned to multiple groups, where each group contains multiple component SRAMs. Based on this type of memory organization, multiple columns of Vh may be written to multiple component SRAMs concurrently without access conflict.


In some embodiments, a reading scheme may be designed for on-the-fly transform of Vh. As described above, the matrix transform operation on Vh may be performed during the reading phase using a partitioned group-based data selection mechanism. In some embodiments, Algorithm 2 describes an example of the mechanism in detail. In some embodiments, the transform mechanism may to utilize the indices of SRAM groups, component SRAMs and element to locate and read the targeted element of V′h in a mathematically equivalent manner.












Algorithm 2: Data Read Scheme for Working SRAM
















Input
:Memory[Ng, Nr, M]. Nr component SRAMs are



divided into Ng groups. Each group contains Nr/Ng



component SRAMs. Each component SRAM contains



M elements. NPE denotes the number of PE.







 Output:Data








 1
//Number of total cycles for reading data











 2





N
c

=





N
g



N
r


M


N
PE













 3
while (k < Nc) do












 4
 |  |





for


j

=

1


to





M


N
PE

/

N
g







do


















 5
 |
 |
for ir = 1 to Nr/Ng do














 6
 |  |
 |  |
 |  |




Data
=

Read
(

Memory
(

:
,

i
r

,

[

j
*



M


N
PE

/

N
g





:















 |  |
 |  |
 |  |

(j+1)*MNPE/Ng]))






 7
 |
 └
 └
 Data = ReArrange(Data)


 8
 └
k++










 9
Function ReArrange (Data)












10
 |  |





Data


=

new

[


N
g

×



M


N
PE

/

N
g






]










11
 |  |





for






j

=

1


to





M


N
PE

/

N
g







do


















12
 |
 |
for ig = 1 to Ng do











13
 |
 └
 └
Data′[j * Ng + ig] = Data(ig, j)









14
 └
Return Data′










FIG. 14 illustrates a perform-on-the-fly transform using well-designed working SRAM read access scheme in accordance with some embodiments of the present disclosure.



FIG. 14 illustrates the working SRAM reading scheme based on the data selection mechanism. As shown, in each cycle the elements of V′h may be located and read from the rows of different component SRAMs of memory group. As shown in FIG. 14, after being assembled, these data may form the row vector of targeted V′h, and then they may be distributed to their corresponding PEs for calculating Vh−1={tilde over (G)}h−1V′h.


In some embodiments, besides the transform from Vh to V′h, the inference on entire DNN models also require transform from V15 of this layer to X′ of next layer. In some embodiments, before being transformed, V1 needs to be processed by the activation units in PEs first. Interestingly, embodiments of the present mathematical analysis shows that such inter-layer transform is identical to the intra-layer transform described before. Therefore, when the TIE is performing the computation between two consecutive layers, it may still utilize the working SRAM read scheme.


4 Evaluation
4.1 Experimental Methodology

In some embodiments, high-level functional behavior of TIE may be modeled by a bit-accurate cycle-accurate simulator. Based on the model, an RTL model may be developed using Verilog and verified the functional validity of the RTL model. In some embodiments, the verified RTL model may be synthesized using Synopsis Design Compiler with CMOS 28 nm library. Here the gate-level netlist may be annotated with toggle rate that may be obtained from the extracted switching activity during the simulation. After that we used Synopsis IC Compiler to perform place and route and generate layout (see FIG. 15). Then, Synopsis Prime-Time PX may be used to estimate power consumption. Notice that the area and power of memory part were reported by Cacti.



FIG. 15 illustrates a layout and performance metrics of TIE in accordance with some embodiments of the present disclosure.


Benchmarks. To evaluate the performance of TIE on different tasks, we choose several workloads from two models used in image classification and video classification tasks, respectively. Here the same-size layers with different TT-decomposition setting are viewed as different workloads. Table 4 lists the information of four benchmark layers, including the size, TT-decomposition settings (d, n, m, and r) and compression ratio.









TABLE 4







Information of evaluated benchmarks.




















Compression



Layer
Size
d
n
m
r
Ratio
Tasks





VGG-FC6
(4096, 25088)
6
[2, 7, 8, 8, 7, 4]
[4, 4, 4, 4, 4, 4]
[1, 4, 4, 4, 4, 4, 1]
50972×
CNN model


VGG-FC7
(4096, 4096)
6
[4, 4, 4, 4, 4, 4]
[4, 4, 4, 4, 4, 4]
[1, 4, 4, 4, 4, 4, 1]
14564×
for









image









classification


LSTM-UCF
(57600, 256)
4
[8, 20, 20, 18]
[4, 4, 4, 4]
[1, 4, 4, 4, 1]
 4954×
RNN models


LSTM-Youtube
(57600, 256)
4
[4, 20, 20, 36]
[4, 4, 4, 4]
[1, 4, 4, 4, 1]
 4608×
for video









classification









4.2 Hardware Performance

Design Configuration. Table 5 shows the configuration information of the TIE hardware. The entire design consists of 16 PEs with 16-bit quantization. For each PE, it is equipped with 16 MACs and 16 activation units, where each MAC contains one 16-bit width multiplier and one 24-bit width accumulator. Regarding the memory, a 16 KB Weight SRAM is used to store up to 8192 16-bit weights on the chip. According to Section 2.3, such memory budgeted capacity for weight SRAM is sufficient for most TT-DNN models. For working SRAM, it contains two copies acting as ping-pong buffer, where each copy has the capacity as 384 KB. Therefore, the total capacity of working RAM is 384×2=768 KB.









TABLE 5







Design configuration information.









PE Parameter
Multiplier
Accumulator





Amount
16
16


Width
16-bit
24-bit


Memory Parameter
Weight SRAM
Working




SRAM


Capacity
16 KB
768 KB


TIE parameter
Amount of PEs
Quantization


Value
16
16-bit









Hardware Resources and Performance. FIG. 15 shows the overall hardware source and performance of TIE design. Operating on 1000 MHz, the 16-PE TIE occupies 1.74 mm2 silicon area and consumes 154.8 mW power. Notice that all the memory used in TIE are on-chip SRAM due to the high compression ratio brought by TT decomposition. The area and power breakdown are shown in Table. 6.









TABLE 6







Power and area breakdowns.









Component
Power (mW)
Area (mm2)














Memory
60.8
(39.28%)
1.29
(73.93%)


Register
10.9
(7.04%)
0.019
(1.11%)


Combinational
54
(34.88%)
0.082
(4.70%)


Clock Network
29.1
(18.80%)
0.0035
(0.02%)










Other

0.35
(20.06%)









Total
154.8
1.744










4.3 Comparison with EIE, CIRCNN, and Eyeriss



FIG. 16 illustrates performance comparison between EIE and TIE on different benchmarks in accordance with some embodiments of the present disclosure.


In this subsection, we compare TIE with two state-of-the-art compressed DNN-targeted accelerators: EIE and CIRCNN. Different from TIE, model compression in EIE and CIRCNN comes from other sources: For EIE, model compression is achieved via network sparsification; for CIRCNN, model compression is from structuring topology. Moreover, to evaluate the performance of TIE on CONV layers, we also compare TIE with representative CONV-oriented work: Eyeriss.


Comparison with EIE. Table 7 summarizes the design parameters and hardware performance of EIE and TIE. Due to the different technology nodes adopted in the two works, the clock frequency, silicon area and power consumption of EIE are also projected under the same 28 nm technology for fair comparison. Such projection is based on the scaling rule used in linear, quadratic and constant scaling for frequency, area and power, respectively.









TABLE 7







Comparisons of EIE and TIE.











Design
EIE
TIE
















CMOS Tech.
45 nm
28 nm
28 nm




(reported)
(projected)




Frequency (MHz)
800
1285
1000











Memory
SRAM
SRAM



Quantization
4-bit for weight index,
16-bit




16-bit forshared weight













Area (mm2)
40.8
15.7
1.74



Power (mW)
590
590
154.8











FIG. 16 compares the hardware performance of EIE and TIE on two benchmarks (VGG-FC6 and VGG-FC7) in terms of throughput, area efficiency and energy efficiency. We see that TIE may achieve a comparable throughput with EIE. More importantly, thanks to the high compression ratio brought by TT decomposition, TIE achieves 7.22ט10.66× better area efficiency and 3.03ט4.48× better energy efficiency on different workloads, respectively.


Comparison with CIRCNN. Table 8 compares the hardware performance of CIRCNN and TIE. Notice that here the listed performance metrics of the two designs are obtained from their synthesis reports for fair comparison since CIRCNN reports synthesis results.









TABLE 8







Comparisons of CIRCNN and TIE.











Design
CIRCNN
TIE
















CMOS Tech.
45 nm
28 nm
28 nm




(reported)
(projected)




Freq (MHz)
200
320
1000











Quantization
16-bit
16-bit












Area (mm2)
N/A
N/A
1.40



Power (mW)
80
80
104.8



Throughput
0.8
1.28
7.64



(TOPS)


(5.96×)



Energy Efficiency
10.0
16.0
72.90



(TOPS/W)


(4.56×)










Meanwhile, due to the lack of area information of CIRCNN, we compare the overall throughput (in term of TOPS) and energy efficiency (in term of TOPS/W) of the two designs. After projecting performance of CIRCNN to the same 28 nm technology for fair comparison, it is seen that TIE achieves 5.96× and 4.56× higher throughput and energy efficiency than CIRCNN, respectively.


Comparison with Eyeriss. Table 9 summarizes the design parameters and hardware performance of Eyeriss and TIE on CONV layers of VGG. For fair comparison, the clock frequency, silicon area and power consumption of Eyeriss are also projected under the 28 nm technology. We used core area and processing latency of Eyeriss instead of chip area and total latency for fair comparison with TIE.









TABLE 9







Comparisons of Eyeriss and TIE on VGG CONV layers.6











Design
Eyeriss
TIE
















CMOS Tech.
65 nm
28 nm
28 nm




reported
(projected)




Freq (MHz)
200
464
1000











Quantization
16-bit
16-bit












Area (mm2)
12.25
2.27
1.74



Power (mW)
236
236
170



Throughput
0.8
1.86
6.72



(Frame/s)


(3.61×)



Area Efficiency
0.065
0.82
3.86



(Frame/s/W)


(4.71×)



Energy Efficiency
3.39
7.89
39.5



(Frame/s/mm2)


(5.01×)










4.4 Flexibility


FIG. 17 illustrates a flexibility of TIE on different decomposition ranks in accordance with some embodiments of the present disclosure.


TIE is designed to provide sufficient flexibility to support the needs of different TT models having different layer sizes and decomposition settings. As illustrated in FIG. 17, different workloads with different d,m, n and r may be executed on the same TIE accelerator hardware efficiently in a flexible way. In addition, we also investigate the throughput with different r's for the same workload (see FIG. 17. Here the change of r is specifically studied since it is an important metric to provide flexible control the compression and acceleration effect of TT decomposition. From the FIG. we may see that TIE exhibits great flexibility to support the flexibility for this important TT decomposition parameter.


Example—Training of Tensor Decomposition Deep Neural Network

In some embodiments, tensor decomposition is a tool that explores the low tensor rank characteristics of the largescale tensor data. Different from other model compression methods, tensor decomposition, uniquely, may provide ultra-high compression ratio for DNNs, including CNN and RNN models. The advanced tensor decomposition approaches, such as tensor train (TT) and tensor ring (TR), among others, may bring more than one thousand times parameter reduction to the input-to-hidden layers of DNN models, and meanwhile the corresponding classification accuracy in the video recognition task may be even significantly improved.


In some embodiments, typical tensor decomposition approaches, including TT and TR, suffer accuracy loss when compressing a trained DNN. CNN models exhibit the greatest accuracy loss.


The accuracy loss due to tensor decomposition is mainly due to the unique challenges involved in training the tensor decomposed format DNN models. In general, there are typically two ways to use tensor decomposition to obtain a compressed model: 1) Train from scratch in the decomposed format; and 2) Decompose a pre-trained uncompressed model and then retrain. In the former case, when the required tensor decomposition-based, e.g., tensor decomposed format model, is directly trained from scratch, because the structure of the models are already pre-set to low tensor rank format before the training, the corresponding model capacity is typically limited as compared to the full-rank structure, thereby causing the training process being very sensitive to initialization and more challenging to achieve high accuracy. In the latter scenario, though the pre-trained uncompressed model provides good initialization position, the straightforwardly decomposing full-rank uncompressed model into low tensor rank format causes inevitable and non-negligible approximation error, which is very difficult to be recovered even after long-time re-training period. Moreover, in either training approach, tensor decomposition always brings linear increase in network depth, which implies training the tensor decomposition-format DNNs are typically more prone to gradient vanishing problem and hence being difficult to be trained well.


In some embodiments, to overcome the current limitations of tensor decomposition and fully unlock its potentials for model compression, a systematic framework for tensor decomposition-based model compression using a dual problem optimization problem, such as, e.g., alternating direction method of multipliers (ADMM), dual ascent method (DAM), dual decomposition, among other suitable dual problem optimization problems or any combination thereof. By formulating TT decomposition-based model compression to an optimization problem with constraints on tensor ranks, the dual problem optimization problem may be leveraged to systemically solve the optimization problem in an iterative way. During this procedure the entire DNN model is trained in the original structure instead of tensor decomposed format, but gradually enjoys the desired low tensor rank characteristics. The trained uncompressed model may then be decomposed to tensor decomposed format and fine-tuned to finally obtain a high-accuracy trained tensor decomposed format DNN model.


In some embodiments, the systematic framework may include formulating and solving the tensor decomposition-based model compression problem. With formulating this problem to a constrained non-convex optimization problem, embodiments of the present framework gradually restricts the DNN model to the target tensor ranks without explicitly training on the tensor decomposed, thereby maintaining the model capacity as well as minimizing approximation error and increased network depth.


In some embodiments, the systematic framework employs a dual optimization problem such as ADMM to efficiently solve the reformulated optimization problem via separately solving two sub-problems. In some embodiments, a first sub-problem may be to directly optimize the loss function with a regularization of the DNN, e.g., by stochastic gradient descent or other suitable optimization method. In some embodiments, the second sub-problem may use the introduced projection to constrain the tensor ranks analytically.


In an example implementation of the framework, on an evaluation of different example DNN models for image classification and video recognition tasks are described. The example evaluation results show that embodiments of the present dual optimization problem-based tensor decomposed format models demonstrate high compression performance with high accuracy. For example, on CIFAR-100, with 2.3 times and 2.4 times compression ratios, embodiments of the present models have 1.96% and 2.21% higher top-1 accuracy than the original ResNet-20 and ResNet-32, respectively. For compressing an example ResNet-18 on ImageNet, embodiments of the present model achieves 2.47 FLOPs reduction with no accuracy loss.


5 Framework Initialization
5.1. Notation

In some embodiments, custom-charactercustom-charactern1×n2×. . . , ×nd, X∈custom-charactern1×n2, and x∈custom-charactern1, represent d-order order tensor, matrix and vector, respectively. Also, custom-character(i1, . . , id) and X(i,j) denote the single entry of tensor custom-character and matrix X, respectively.


5.2. Tensor Train (TT) Decomposition

In some embodiments, given a tensor custom-charactercustom-charactern1×n2×. . . ,×nd, the tensor may be decomposed to a sort of 3-order tensors via Tensor Train Decomposition (TTD) as follows:














𝒜

(


i
1

,

i
2

,



,

i
d


)


=



𝒢

1


(

:
,

i
1

,
:

)





𝒢

2


(

:
,

i
2

,
:

)









𝒢

d

(

:
,

i
d

,
:

)









=






α
0

,


α
1







α
d





r
0

,

r
1

,






r
d






𝒢

1


(


α
0

,

i
1

,

α
1


)





𝒢

2


(


α
1

,

i
2

,

α
2


)
















𝒢

d

(


α

d
-
1


,

i
d

,

α
d


)






,




Equation



(
17
)








where custom-characterkcustom-characterrk−1×nk×rk are called TT-cores for k=1,2, . . . , d, and r=[r0,r1, . . . , rd], r0=rd=1 are called TT-ranks, which determine the storage complexity of tensor decomposed format tensor. An example is demonstrated in FIG. 18.


5.3. Tensor Train (TT)-Format DNN

In some embodiments, a simple fully-connected layer with weight matrix custom-charactercustom-characterM×N and input x∈custom-characterN, where M=Πk=1dmk and N=Πk=1dnk, the output x∈custom-characterM may be obtained by y=custom-characterx. In some embodiments, in order to transform this standard layer to TT fully-connected (TT-FC) layer, the weight matrix custom-character may be tensorized to a weight tensor custom-charactercustom-character(m1×n1)×. . . ×(md×nd) by reshaping and order transposing. Then custom-character may be decomposed to tensor decomposed format:






custom-character
((i

1

,j

1

), . . . , (i

d

,j

d

))=custom-character1(i, i1,j1,:) . . . custom-characterd(:,id,jd,:).   Equation (18)


In some embodiments, each TT-core custom-characterkcustom-characterrk−1×mk×nk×rk may be a 4-order tensor, which is one dimension more than the standard one since the output and input dimensions of custom-character are divided separately. Hence, the forward propagation on the TT-FC layer may be expressed in tensor format as follows:











𝓎

(


i
1

,



,

i
d


)


=





j
1

,



,

j
d





𝒢

1


(

:
,

i
1

,

j
1

,
:

)









𝒢

d

(

:
,

i
d

,

j
d

,
:

)




𝒳

(


j
1

,



,

j
d


)





,




Equation



(
19
)








where custom-charactercustom-characterm1×. . . , ×md and custom-charactercustom-charactern1×, . . . ,×nd, are the tensorized input and output corresponding to x and y, respectively. In some embodiments, additional details about a TT-FC layer may be found in Xuefei Ning, Tianchen Zhao, Wenshuo Li, Peng Lei, Yu Wang, and Huazhong Yang. Dsa: “More efficient memory budgeted pruning via differentiable sparsity allocation.” In the European Conference on Computer Vision (ECCV), 2020, which is herein incorporated by reference in its entirety.


In some embodiments, for a conventional convolutional layer, forward computation performs convolution between a 3-order input tensor custom-charactercustom-characterW×H×N and a 4-order weight tensor custom-charactercustom-characterK×K×M×N to produce the 3-order output tensor custom-charactercustom-character(W−K+1)×(H−K+1)×M. In some embodiments, in a TT convolutional (TT-CONV) layer, the input tensor custom-character is reshaped to a tensor custom-charactercustom-characterW×H×n1×. . . , ×nd, while the weight tensor custom-character is reshaped and transposed to a tensor custom-charactercustom-character(K×K)×(m1×n1)×. . . ×(md×nd) and then decomposed to tensor decomposed format:






custom-character
((k

1

,k

2

),(i

1

,j

1

), . . . ,(i

1

,j

d

))=custom-character0(k1,k2)custom-character1(:,i1,j1,:) . . . custom-characterd(:,id,jd,:),   Equation (20)


where M=Πk=1dmk and N=Πk=1dnk. Similar with TT-FC layer, here custom-characterkcustom-characterrk−1×mk×nk×rk is a 4-order tensor except custom-character0custom-characterK×K. In some embodiments, the new output tensor custom-charactercustom-character(W−K+1)×(H−K+1)×m1×. . . , ×md may then be obtained by










𝓎

(

w
,
h
,

i
1

,



,

i
d


)


=




Equation



(
21
)














k
1

=
1

K






k
2

=
1

K






j
1

,



,

j
d





𝒳

(



k
1

+
w
-
1

,


k
2

+
h
-
1

,

j
1

,



,

j
d


)




𝒢

0


(


k
1

,

k
2


)





𝒢

1


(

:
,

i
1

,

j
1

,
:

)










𝒢

d

(

:
,

i
d

,

j
d

,
:

)


.








In some embodiments, additional description of a TT-CONV layer may be found in Timur Garipov, Dmitry Podoprikhin, Alexander Novikov, and Dmitry Vetrov. “Ultimate tensorization: compressing convolutional and fc layers alike.” arXiv preprint arXiv:1611.03214, 2016, which is herein incorporated by reference in its entirety.


In some embodiments, tensor decomposed format as the TT-FC layer, the TT-CONV layer and the corresponding forward propagation schemes are formulated, a suitable optimization method, such as, standard stochastic gradient descent (SGD) algorithm, may be used to update the TT-cores with the rank set r, which determines the target compression ratio. The initialization of the TT-cores may be either randomly set or obtained from directly TT-decomposing a pre-trained uncompressed model.


6 Systematic Compression Framework

In some embodiments, tensor decomposed format as described above, currently a tensor decomposed format DNN is either 1) trained from with randomly initialized tensor cores; or 2) trained from a direct decomposition of pre-trained model. For the first strategy, information related to the high-accuracy uncompressed model is unused and thus lost. For the second strategy, though the knowledge of the pre-trained model is indeed utilized, because the pre-trained model generally lacks low TT-rank property, after direct low-rank tensor decomposition the approximation error is too significant to be properly recovered even using long-time re-training. Such inherent limitations of the existing training strategies, consequently, cause significant accuracy loss for the compressed tensor decomposed format DNN models.


In some embodiments, to overcome the above described limitations to maximally retain the knowledge contained in the uncompressed model, or in other words, minimize the approximation error after tensor decomposition with given target tensor ranks, an optimization problem is formulated to minimize the loss function of the uncompressed model with low tensor rank constraints. With proper advanced optimization technique (e.g., a dual optimization problem such as ADMM, Dam, dual decomposition, etc.) regularized training procedure, the uncompressed DNN models may gradually exhibit low tensor rank properties. After the regularized training phase, the approximation error brought by the explicit low-rank tensor decomposition becomes negligible and may be easily recovered by the fine-tuning (e.g., with stochastic gradient descent or other suitable optimization method). FIG. 19 illustrates the steps of the framework of embodiments of the present disclosure.


6.1. Problem Formulation

In some embodiments, as described above, the first phase of the present framework may be to iteratively impose low tensor rank characteristics onto a high-accuracy uncompressed DNN model. Mathematically, this goal may be formulated as an optimization problem to minimize the loss function of the object model with constraints on TT-ranks of each layer (convolutional or fully-connected):











min
𝒲




(
𝒲
)


,



s
.
t
.


rank
(
𝒲
)




r
*


,




Equation



(
22
)








where custom-character is the loss function of the DNN , rank(⋅) is a function that returns the TT-ranks r=[r0, . . . ,rd] of the weight tensor cores, and r*=[r*0, . . . , r*d] are the desired TT-ranks for the layer. To simplify the notation, here r≤r* means ri≤r*i, i=0, . . . , d, for each ri in r.


6.2. Optimization Using ADMM

In some embodiments, solving the equation (22) may be generally difficult via normal optimization algorithms since rank(⋅) is non-differentiable. In some embodiments, to overcome this challenge, Equation 22 may be rewritten as:











min
𝒲




(
𝒲
)


,



s
.
t
.





𝒲


𝒮

,




Equation



(
23
)








where custom-character={custom-character|rank(custom-character)≤r*}. Hence, the objective form (23) is a classic non-convex optimization problem with constraints, which may be properly solved by the dual optimization problem. Specifically, we may first introduce an auxiliary variable custom-character and an indicator function g(⋅) of custom-character, i.e.










g

(
𝒲
)

=

{




0




𝒲

𝒮

,






+




otherwise



.






Equation



(
24
)








And then the equation (23) is equivalent to the following form:












min

𝒲
,





(
𝒲
)


+

g

(
)


,



s
.
t
.

𝒲

=

.






Equation



(
25
)








In some embodiments, to ensure convergence without assumptions like strict convexity or finiteness of custom-character, instead of Lagrangian, the corresponding augmented Lagrangian in the scaled dual form of the above equation is given by:













ρ

(

𝒲
,
,
𝒰

)

=




(
𝒲
)

+

g

(
)

+


ρ
2






𝒲
-
+
𝒰



F
2


+


ρ
2





𝒰


F
2




,




Equation



(
26
)








where custom-character is the dual multiplier, and ρ>0 is the penalty parameter. Thus, the iterative ADMIVI scheme may be explicitly performed as











𝒲

t
+
1


=



arg

min

𝒲





ρ

(

𝒲
,

t

,

𝒰
t


)



,




Equation



(
27
)
















t
+
1


=



arg

min






ρ

(


𝒲

t
+
1


,
,

𝒰
t


)



,




Equation



(
28
)















𝒰

t
+
1


=


𝒰
t

+

𝒲

t
+
1


-


t
+
1




,




Equation



(
29
)








where t is the iterative step. In some embodiments, the original equation (25) may be separated to two sub-equations (27) and (28), which may be solved individually. In some embodiments, each sub-equation may be solved at each training iteration.


In some embodiments, with regards to the custom-character-sub-equation (29). The custom-character-sub-equation (27) may be reformulated as follows:












min
𝒲




(
𝒲
)


+


ρ
2






𝒲
-

t

+

𝒰
t




F
2



,




Equation



(
30
)








where the first term is the loss function, e.g., cross-entropy loss in classification tasks, of the DNN model, and the second term is the L2 -regularization. In some embodiments, sub-problem (30) may be directly solved by stochastic gradient descent since both of the two terms are differentiable. Correspondingly, the partial derivative of (30) with respect to custom-character is calculated as:















ρ

(

𝒲
,

t

,

𝒰
t


)




𝒲


=







(
𝒲
)




𝒲


+


ρ

(

𝒲
-

t

+

𝒰
t


)

.






Equation



(
31
)








And hence custom-character may be updated by:











𝒲

t
+
1


=


𝒲
t

-

η







ρ

(

𝒲
,

t

,

𝒰
t


)




𝒲





,




Equation



(
32
)








where ρ is the learning rate.


In some embodiments, with regards to the custom-character-sub-problem (28), to solve custom-character-sub-problem (28) may be explicitly formulated it as follows:












min


g

(
)


+


ρ
2







𝒲

t
+
1


-
+

𝒰
t




F
2



,




Equation



(
33
)








where the indicator function g(⋅) of the non-convex set custom-character is non-differentiable. Then, in this format updating custom-character may be performed as:






custom-character
t+1=custom-character(custom-charactert+1+custom-charactert),   Equation (34)


where custom-character(⋅) is the projection of singular values onto custom-character, by which the TT-ranks of (custom-charactert+1+custom-charactert) are truncated to target ranks r*. Algorithm 3 describes an example of a specific procedure of this projection in the tensor decomposed format scenario.


In some embodiments, in each dual optimization problem iteration, upon the update of custom-character and custom-character, the dual multiplier custom-character is updated by (29). In overall, to solve (25), the entire dual optimization problem-regularized training procedure is performed in an iterative way until convergence or reaching the pre-set maximum iteration number. The overall procedure is summarized in Algorithm 4.












Algorithm 3: TT-SVD-based Project for Solving Equation (34)















Input: d-order tensor  custom-character   ϵ  custom-charactern1text missing or illegible when filednd, target TT-ranks r*.


Output:  custom-character  ( custom-character  ).


 1: Temporary tensor  custom-character   =  custom-character  ;


 2: for k = 1 to d − 1 do


 3:   custom-character   = reshape( custom-character  , [rk−1*nk, −1]);


 4:  Compute matrix SVD: U, S, V := SVD( custom-character  );


 5:  U := U(1text missing or illegible when filed);


 6:  S := S(1text missing or illegible when filed);


 7:  V := V(text missing or illegible when filed);


 8:   custom-characterk := reshape(U, [rk−1*, nk, rk*]);


 9:   custom-character   := SVT;


10:  custom-character   :=  custom-character1;


11: for k = 1 to d − 1 do


12:  T1 := reshape( custom-character  , [−1, rk*]);


13:  T2 := reshape( custom-characterk+1, [rk*, −1]);


14:   custom-character   := T1T2;


15:  custom-character   = reshape( custom-character  , [n1, . . . , nd]).






text missing or illegible when filed indicates data missing or illegible when filed

















Algorithm 4: ADMM-Regularized Training Procedure















Input: Weight tensor  custom-character  , target TT-ranks r*, penalty pa-


  rameter ρ, feasibility tolerance ϵ, maximum iterations


  T.


Output: Optimized  custom-character  .


 1: Randomly initialize  custom-character  ;


 2:  custom-character   :=  custom-character  ,  custom-character   := 0;


 3: while ∥ custom-charactert −  custom-charactert∥ > ϵ and t ≤ T do


 4:  Updating  custom-character   via (16);


 5:  Updating  custom-character   via (17) (Algorithm 1);


 6:  Updating  custom-character   via (13);


 7: end









6.3. Fine-Tuning

In some embodiments, upon dual optimization problem-regularized training, decompose the trained uncompressed DNN model may be decomposed into tensor decomposed. Here the decomposition may be performed with the target TT-ranks r* for tensor cores. Because the optimization procedure has already imposed the desired low TT-rank structure to the uncompressed model, such direction decomposition, unlike their counterpart in the existing tensor decomposed format DNN training, may be not bring significant approximation error (more detail may be provided below in Section 7.1). In some embodiments, the decomposed tensor decomposed format model may then be fine-tuned using standard stochastic gradient descent or other suitable optimization method. In some embodiments, in the fine-tuning phase the loss function is custom-character ({custom-characteri}) may be formulated without another regularization term that would be introduced by the dual optimization problem. In some embodiments, the fine-tuning phase may be very fast relative to the initial training, e.g., requiring only a few iterations. In some embodiments, the speed of tine-tuning may be because the decomposed TT model at the starting point of the fine-tuning phase may benefit from decreased accuracy loss relative to the original uncompressed model.


7 Experiments

To demonstrate the effectiveness and generality of the compression framework, examples of different DNN models in different computer vision tasks may be evaluated. For image classification tasks, multiple CNN models may be evaluated on MNIST, CIFAR-10, CIFAR-100 and ImageNet datasets. For video classification tasks, different LSTM models may be evaluated on UCF11 and HMDB51 datasets. To simplify selection procedure, some, all or more than half of the ranks in the same layer may be set to equal.



FIGS. 20A through 20C illustrate training loss, Frobenius norm and test accuracy in dual optimization problem-regularized training procedure with different ρ in accordance with one or more embodiments of the present disclosure.


7.1. Convergence and Sensitivity Analysis

In some embodiments, as shown in (26), ρ is the additional hyperparameter introduced in the dual optimization problem-regularized training phase. To study the effect of p on the performance as well as facilitating hyperparameter selection, we study the convergence and sensitivity of the ADMM-regularized training for ResNet-32 model with different p settings on CIFAR-10 dataset.


In some embodiments, as shown in FIG. 20A, the loss curves in the dual optimization problem-regularized training phase. As shown in FIG. 20A, different curves with very different ρ values (e.g., 0.001 vs 0.02) may exhibit comparable convergence speed. This phenomenon therefore demonstrates that ρ has little impact on the convergence of dual optimization problem-regularized training.


In some embodiments, considering the similar convergence behavior does not necessarily mean that different ρ would bring the similar accuracy, the performance sensitivity of dual optimization problem-regularized training with respect to ρ may be analyzed. In some embodiments, after dual optimization problem-regularized training, W, in the uncompressed format, may exhibit strong low TT-rank characteristics and meanwhile enjoy high accuracy. Once custom-character meets such two criteria concurrently, the TT-cores{custom-characteri}, whose initialization is decomposed from custom-character, may have high accuracy even before fine-tuning.


In some embodiments, to examine the required low TT-rank behavior of custom-character, ∥custom-charactercustom-characterF2, which measures the similarity between custom-character and custom-character, may be observed in the dual optimization problem-regularized training (see FIG. 20B). Since according to (33) custom-character is always updated with low TT-rank constraints, the curves shown in FIG. 20B reveal that custom-character indeed quickly exhibits low TT-rank characteristics during the training, except when ρ=0.001. This phenomenon indicates that to ensure the weight tensors are well regularized to the target TT-ranks by the dual optimization problem, ρ may be not be too small (e.g., less than 0.001). On the other hand, FIG. 20C shows the test accuracy of custom-character as training progresses. Here it is seen that smaller ρ tends to bring better performance. Based on these observations, ρ=0.005 may be an appropriate choice to let the trained W meet the aforementioned two criteria.


7.2. Image Classification

Table 10 shows the experimental results of LeNet-5 model on MNIST dataset. Embodiments of the present dual optimization problem-based tensor decomposed format model may be compared with the uncompressed model as well as a typical TT/TR-format. It is seen that embodiments of the present dual optimization problem-based compression may achieve the highest compression ratio and the best accuracy.









TABLE 10







LeNet-5 on MNIST dataset using different TT/TR-format


compression approaches.













Comp.
Top-1
Comp.



Model
Method
(%)
Ratio







Uncompressed

99.21
 1.0{circumflex over ( )}



Standard TR
TR
99.10
10.5{circumflex over ( )}



PSTRN-M

99.43
16.5{circumflex over ( )}



PSTRN-S

99.51
 6.5{circumflex over ( )}



Standard TT
TT
99.07
17.9{circumflex over ( )}



Ours

99.48
17.9{circumflex over ( )}





99.51
 8.3{circumflex over ( )}










Table 11 compares embodiments of the present dual optimization problem-based tensor decomposed format ResNet-20 and ResNet-32 models with the typical TT/TR-format on CIFAR-10 dataset. For ResNet-20, it is seen that standard training on TT/TR-format models causes greater accuracy loss. Even for the typical design using some advanced techniques, such as heuristic rank selection (PSTRN-M/S) and reinforcement learning (TR-RL), the performance degradation is still larger than the dual optimization problem-regularization framework of the present disclosure, especially with high compression ratio 6.8 times. On the other side, with the same high compression ratio embodiments of the present dual optimization problem-based tensor decomposed format model has only 0.22% accuracy drop, which means 2.53% higher than the typical PSTRN-M. Furthermore, with moderate compression ratio 4.5 time embodiments of the present method may even outperform the uncompressed model with 0.22% accuracy increase.









TABLE 11







ResNet-20 and ResNet-32 on CIFAR-10 dataset using different


TT/TR-format compression approaches.











Comp.
Top-1
Comp.


Model
Method
(%)
Ratio










ResNet-20










Uncompressed

91.25
1.0{circumflex over ( )}


Standard TR
TR
87.5
5.4{circumflex over ( )}


TR-RL

88.3
6.8{circumflex over ( )}


PSTRN-M

88.50
6.8{circumflex over ( )}


PSTRN-S

90.80
2.5{circumflex over ( )}


Standard TT
TT
86.7
5.4{circumflex over ( )}


Ours

91.03
6.8{circumflex over ( )}




91.47
4.5{circumflex over ( )}







ResNet-32










Uncompressed

92.49
1.0{circumflex over ( )}


Standard TR
TR
90.6
5.1{circumflex over ( )}


PSTRN-M

90.6
5.8{circumflex over ( )}


PSTRN-S

91.44
2.7{circumflex over ( )}


Standard TT
TT
88.3
4.8{circumflex over ( )}


Ours

91.96
5.8{circumflex over ( )}




92.87
4.8{circumflex over ( )}









For ResNet-32, again, standard training on compressed models using TT or TR decomposition causes larger performance degradation than other techniques. The typical PSTRN-S/M indeed brings performance improvement, but the test accuracy is still not satisfied. Instead, embodiments of the present highly compressed (5.8 times) dual optimization problem-regularized tensor decomposed format model has 0.53% accuracy loss, which means it has 1.36% higher accuracy than PSTRN-M with the same compression ratio. More importantly, when compression ratio is relaxed to 4.8 times, embodiments of the present dual optimization problem-based tensor decomposed format model achieves 92.87%, which is even 0.38% higher than the uncompressed model.


Table 12 shows the experimental results on CIFAR-100 dataset. Again, embodiments of the present dual optimization problem-based tensor decomposed format model outperforms the typical techniques. For ResNet-20, with even higher compression ratio (5.6 times the dual optimization problem-based tensor decomposed format model versus 4.7 times in PSTRN-M), embodiments of the present model achieves 1.3% accuracy increase. With 2.3 times compression ratio, embodiments of the present model achieves 67.36% Top-1 accuracy, which is even 1.96% higher than the uncompressed model. For ResNet-32, with the same 5.2 times compression ratio, embodiments of the present approach brings 0.4% accuracy increase over the typical PSTRN-M. With the same 2.4 times compression ratio, embodiments of the present approach has 2.26% higher accuracy than PSTRN-S. embodiments of the present model even outperforms the uncompressed model with 2.21% accuracy increase.









TABLE 12







ResNet-20 and ResNet-32 on CIFAR-100 dataset using different


TT/TR-format compression approaches.











Comp.
Top-1
Comp.


Model
Method
(%)
Ratio










ResNet-20










Uncompressed

65.4
1.0{circumflex over ( )}


Standard TR
TR
63.55
4.7{circumflex over ( )}


PSTRN-M

63.62
4.7{circumflex over ( )}


PSTRN-S

66.13
2.3{circumflex over ( )}


Standard TT
TT
61.64
5.6{circumflex over ( )}


Ours

64.92
5.6{circumflex over ( )}


Ours

67.36
2.3{circumflex over ( )}







ResNet-32










Uncompressed

68.10
1{circumflex over ( )}  


Standard TR
TR
66.70
4.8{circumflex over ( )}


PSTRN-M

66.77
5.2{circumflex over ( )}


PSTRN-S

68.05
2.4{circumflex over ( )}


Standard TT
TT
62.90
4.6{circumflex over ( )}


Ours

67.17
5.2{circumflex over ( )}


Ours

70.31
2.4{circumflex over ( )}









Table 13 shows the results of compressing ResNet-18 on ImageNet dataset. Because no prior TT/TR compression works report results on this dataset, a standard TT and TR-based training may be used for comparison. Embodiments of the present approach may also be compared with other compression methods, including pruning and matrix SVD. Since these works report FLOPs reduction instead of compression ratio, the FLOPs reduction brought by tensor decomposition according to embodiments of the present dual optimization problem-based tensor decomposed format model may be compared. It is shown that with the similar FLOPs reduction ratio (4.62 times), embodiments of the present dual optimization problem-based tensor decomposed format model has 1.83% and 1.18% higher accuracy than standard TT and TR, respectively. Compared with other compression approaches with non-negligible accuracy loss, embodiments of the present dual optimization problem-based tensor decomposed format model tensor decomposed format achieves better accuracy with more FLOPs reduction. In particular, with 2.47 times FLOPs reduction, embodiments of the present model has the same accuracy as the uncompressed baseline model.









TABLE 13







ResNet-18 on ImageNet dataset using compression approaches. The


uncompressed baseline model is from Torchvision. Note that the


reported Top-5 accuracy of FBS and FPGM in this table are obtained


from pruning the baselines with higher accuracy.











Comp.
Top-5



Model
Method
(%)
FLOPsÓ










ResNet-18










Uncompressed

89.08
1.00{circumflex over ( )}


Standard TR
TR
86.29
4.28{circumflex over ( )}


TRP
Matrix
86.74
2.60{circumflex over ( )}


TRP + Nu
SVD
86.61
3.18{circumflex over ( )}


DACP
Pruning
87.60
1.89{circumflex over ( )}


FBS

88.22
1.98{circumflex over ( )}


FPGM

88.53
1.72{circumflex over ( )}


DSA

88.35
1.72{circumflex over ( )}


Standard TT
TT
85.64
4.62{circumflex over ( )}


Ours

87.47
4.62{circumflex over ( )}


Ours

89.08
2.47{circumflex over ( )}









7.3. Video Recognition

Table 14 compares embodiments of the present dual optimization problems-based tensor decomposed format LSTM with an uncompressed LSTM model and the typical TT-LSTM and TR-LSTM. Note that does not report the performance of PSTRN-M/S on UCF11 dataset.


From Table 14, it is seen that both TT-LSTM and TR-LSTM provide performance improvement and compression ratio improvement relative to the uncompressed LSTM due to the feature extraction capabilities of TT/TR-format LSTM models on the ultra-high-dimensional inputs. In comparison, embodiments of the present dual optimization problems-based tensor decomposed LSTM tensor decomposed format achieves even greater performance. With fewer parameters, embodiments of the present dual optimization problems-based tensor decomposed LSTM results in 2.1% higher top-1 accuracy than the typical TR-LSTM.









TABLE 14







LSTM on UCF11 dataset using different TT/TR-format compression


approaches.












Comp.
Top-1

Comp.


Model
Method
(%)
# Para.
Ratio





Uncompressed

69.7
59M
1.0{circumflex over ( )}


TR-LSTM
TR
86.9
1,725
34.2K{circumflex over ( )}


TT-LSTM
TT
79.6
3,360
17.6K{circumflex over ( )}


Ours

89.0
1,656
35.6K{circumflex over ( )}









An Inception-V3 as the front-end pre-trained CNN model, and a backend uncompressed LSTM model may be compared to a dual optimization problems-based tensor decomposed version.


Table 15 summarizes the experimental results. It is seen that comparing with the typical TT/TR-format designs, embodiments of the present dual optimization problems-based tensor decomposed tensor decomposed format model results in greater performance. With the highest compression ratio (84.0 times), embodiments of the present model achieves 64.09% top-1 accuracy. Compared with the typical TR-LSTM, embodiments of the present model brings 3.35 times more compression ratio with additional 0.29% accuracy increase.









TABLE 15







LSTM on HMDB51 dataset using different TT/TR-format compression


approaches.












Comp.
Top-1

Comp.


Model
Method
(%)
# Para.
Ratio














Uncompressed

62.9
16.8M
 1.0{circumflex over ( )}


TR-LSTM
TR
63.8
0.67M
25.0{circumflex over ( )}


PSTRN-M

59.67
0.36M
46.7{circumflex over ( )}


PSTRN-S

60.04
0.48M
34.7{circumflex over ( )}


TT-LSTM
TT
62.24
0.67M
25.0{circumflex over ( )}


Ours

64.09
0.20M
84.0{circumflex over ( )}









7.4. Discussion on Tensor Format and Generality

As described above, the dual optimization problem-based tensor decomposed format models consistently outperform the existing TT/TR-format models with higher accuracy and higher compression ratio over various datasets, thereby comprehensively demonstrating the huge benefits brought by embodiments of the present framework.


In some embodiments, embodiments of the present disclosure may use tensor train-format DNN models. However, because dual optimization problems such as ADMM are general optimization techniques, embodiments of the present framework may be easily applied for model compression using other tensor decomposition approaches, such as Tensor Ring (TR), Block-term (BT), Tucker, Hierarchical Tucker, singular value decomposition (SVD), higher-order singular value decomposition (HOSVD), etc. To adapt to other tensor decomposition scenario, the main modification on embodiments of the present framework is to modify the Euclidean projection (Algorithm 3) to make the truncating methods being compatible to the corresponding tensor decomposition methods.


8 Example—Rank Optimization via Tensor Core Hyperparameter Training

In some embodiments, as described above, training of a tensor decomposed format DNN may be further improved by leveraging the tensor rank as a hyperparameter during training. Selecting the optimal rank is very challenging. Different from the rank selection in matrix decomposition, where the rank is a scalar, the rank selection in tensor decomposition needs to identify the proper vector-format tensor ranks. As indicated in, exact determination of tensor ranks for linear tensor problem is theoretically NP-hard. Even worse, since a modern DNN model typically consists of tens or even hundreds of layers, the overall search space for determining the rank for the decomposed DNN model is extremely huge.


In some embodiments, to systematically overcome the rank selection challenge and obtain high-performance compressed DNN models within the desired memory budget (e.g., model size or computational cost), an optimization-based framework may be configured to automatically select the rank configurations for the tensor decomposed format DNN models. Embodiments of the present framework may integrate the rank selection procedure into the training procedure, and let the models automatically learn the suitable rank setting from the data. For example, as described above, the original complicated NP-hard rank selection problem may be relaxed to a tensor nuclear norm-regularized optimization problem with the constraint of model size for DNN training. After such reformulation, during the training procedure a suitable dual optimization problem technique may be used to solve this optimization problem via solving two sub-problems in an iterative way. Upon the end of this iterative solving, the suitable tensor ranks are automatically learned, and thereby bringing highly-compressed highly-accurate tensor decomposed format DNN models with target compression ratio. Herein, boldface calligraphic script letters denote tensor, e.g., eq. custom-character2-dimensional tensor is a matrix, and a 1-dimensional tensor is a vector, which are represented by boldface capital letters and boldface lower-case letters, respectively, e.g. A and a. Also, non-boldface letters with indices custom-character(i1, . . . , id), A(i, j) and a(i) denote the entry of d-dimensional tensor custom-character, matrix A and vector a, respectively.


Let A=UΣVT be the singular value decomposition (SVD) of matrix A. The shrinkage operation is defined as






custom-character(A)=TVT,   Equation (36)


where Στ=diag(max(σi−τ), 0) denotes the i-th largest singular value.


Given a tensor custom-charactercustom-charactern1×. . . ×nd 2, the mode-k matricization (also called unfolding) of custom-character is denoted as A(k)custom-characternk(nk. . . nk−1nk+1. . . nd). The entry (i1, . . . , id) of the given tensor is mapped to the entry (ik,j) of the unfolded matrix, e.g., custom-character(i1, . . , id)=A(k)(ik,j), where:






j=1+Σp=1,p≠kd(ip−1)Jp with Jpq=1,q≠kp−1nq,   Equation (37)


Based on the matricization, the following two operations may be defined as:





unfoldk(custom-character)=A(k),





foldk(A(k))=custom-character.   Equation (38)


The Tucker rank of a d-dimensional tensor custom-character is denoted as n-rank(custom-character), which is the tuple of the ranks of mode-k matricizations:






n−rank(custom-character)=[R1, . . . , Rd];   Equation (39)


where Rk=rank(A(k)), and k=1, . . . , d.


The Tucker decomposition of a tensor custom-charactercustom-charactern1×. . . ×nd with n−rank(custom-character)=[R1, . . . , Rd] may be represented as:






custom-character(i1, . . . , id)=Σr1, . . . , rdR1, . . . , Rdcustom-character(r1, . . . , rd)B1(r1, i1) . . . Bd(rd, id),   Equation (40)


where custom-charactercustom-characterR1×. . . ×Rd is the core tensor and B1custom-characterR1×n1, . . . , Bdcustom-characterRd×nd are the factor matrices.


Considering a convolutional layer with kernel tensor custom-charactercustom-characterS×C×K×K, the input tensor custom-charactercustom-characterW×H×C is transformed into the output tensor custom-charactercustom-characterW′×H′×S convolving with the kernel:






custom-character(w′, h′, s)=Σi=1KΣj=1KΣc=1Ccustom-character(s, c, i, j)custom-character(w, h, c) with w=w; +i and h=h′+j−1,   Equation (41)


where W′=W−K+1 and H′=H−K+1.


In some embodiments, in a Tucker-format convolutional layer, in order to retain the spatial information, the orders associated to the kernel size are not decomposed. This decomposing approach is called Tucker-2 decomposition, by which the kernel tensor may be represented as:











𝒲

(

s
,
c
,
i
,
j

)

=





r
1

=
1


R
1







r
2

=
1


R
2




C

(


r
1

,

r
2

,
i
,
j

)




B
1

(


r
1

,
s

)




B
2

(


r
2

,
c

)





,




Equation



(
42
)








where custom-charactercustom-characterR1×R2×K×K, B1custom-characterR1×S and B2custom-characterR2×C. Correspondingly, the Tucker rank of the decomposed kernel custom-character is n−rank(custom-character)=[R1, R2], R1=rank(W(1)), R2=rank(W(2)).


With the Tucker-2 format, the original full convolution layer is decomposed into a sequence of small convolutions, thus the output tensor is obtained by:












𝒯
1

(

w
,
h
,

r
2


)

=




c
=
1

C




B
2

(


r
2

,
c

)



𝒳

(

w
,
h
,
e

)




,




Equation



(
43
)
















𝒯
2

(


w


,

h


,

r
1


)

=




i
=
1

K





j
=
1

K






r
2

=
1


R
2




𝒞

(


r
1

,

r
2

,
i
,
j

)




𝒯
1

(

w
,
h
,

r
2


)






,




Equation



(
44
)















y

(


w


,

h


,
s

)

=





r
s

=
1


R
1





B
1

(


r
1

,
s

)




𝒯
2

(


w


,

h


,

r
1


)




,



where



𝒯
1







W
×
H
×

R
2





and



𝒯
2








W


×

H


×

R
1



.






Equation



(
45
)








8.1 Problem Analysis & Formulation

Without loss of generality, we aim to compress an L-layer convolutional neural network (CNN) using Tucker decomposition. While Tucker decomposition is used here, Tucker decomposition is just an illustration of a particular form of tensor decomposition with the tensor cores used as a hyperparameter to be trained during the training process. Any suitable tensor decomposition may be used according to similar processes described below. To compress other types of DNNs, e.g., recurrent neural network (RNN), or use other tensor decomposition, e.g., tensor train or tensor ring, similar solution as described above may also be derived under the optimization framework. To simplify the notation, custom-character may be used without subscript to represent the weight of all layers ({custom-characteri}i=1L, custom-characteri) is a 4-dimensional kernel tensor) and omit the bias parameters. The loss function is denoted by custom-character(custom-character) (e.g., crossentropy loss for classification task).


With such notation, embodiments of the present optimization framework is to minimize the loss function and the Tucker rank of each layer simultaneously, e.g., mincustom-charactercustom-character(custom-character) and mincustom-charactern−rank(custom-character), with memory budget constraints to satisfy the target compression ratio. In some embodiments, decreasing the tensor ranks may lower the capacity of the CNN model, thus making it more difficult for the loss function to reach the optimal point. Therefore, the problem may be solved by identifying the loss in the acceptable range and while the tensor ranks also satisfy the compression demand. Since exact determination of tensor ranks for linear tensor problem is NP-hard, it is impossible to obtain the optimal tensor ranks via direct searching approaches.


In some embodiments, to overcome this challenge, based on the fact that the nuclear norm or trace norm ∥⋅∥, is the tightest surrogate for rank(⋅), the second objective may be relaxed and the original equation may be approximated in a reformulation as follows:












min
𝒲




(
𝒲
)


+



𝒲


*


,



s
.
t
.


𝒫

(

n
-

rank
(
𝒲
)


)




P
*


,




Equation



(
46
)








where ∥custom-character* is the tensor nuclear norm of custom-character, custom-character(⋅) returns the overall model size with the current ranks, and custom-character*, as the “memory budget”, is the desired model size after compression. The tensor nuclear norm of a d-dimensional tensor custom-character is the convex combination of the nuclear norms of all unfolded matrices along each mode:













𝒳


*

=




i
=
1

d



α
i






X

(
i
)




*




,




Equation



(
47
)








where {αi}i=1d are constants that satisfy a αi≥0 and Σi=1dαi=1. Recall for the Tucker-2 convolution, only the first mode and the second mode are decomposed to retain the spatial information. Therefore, the tensor nuclear norm of the kernel tensor custom-character is the sum of the nuclear norms of the mode-1 and mode-2 matricization:





custom-character*1∥W(1)*α2∥W(2)*.   Equation (48)


In some embodiments, the dimensions of input channel and the output channel may have the same influence on the kernel tensor, thus a 1 and a 2 may be set to be equal such that they may be eliminated in the above equation. Hence, the formulated optimization equation (46) may be rewritten as:












min
𝒲




(
𝒲
)


+




W

(
1
)




*

+




W

(
2
)




*


,



s
.
t
.






𝒫

(


R
1

,

R
2


)





P
*

.






Equation



(
49
)








8.2 Automatic Compression: Learn the Tensor Ranks via Dual Optimization Problem-Based Optimization

Minimizing the objective that contains two entries-shared non-smooth nuclear norm terms (e.g., W(1) and W(2)), as the problem format that Equation (43) exactly exhibits, is challenging. Therefore, in some embodiments, a dual optimization problem may provide an efficient optimization technique for problems containing non-smooth terms, e.g., via equation (49). Thus, in some embodiments, the objective may be transformed into a format that fits the dual optimization problem framework. For example, separate variables may be attributed to mode-1 and mode-2 matricizations of custom-character, e.g., introduce auxiliary variables custom-character1 and custom-character2 whose shapes are identical to custom-character such that custom-character1(1)=W(1) and custom-character2(2)=W(2). Then, with the introduced variables, the optimization equation (49) may be reformulated as:












min

𝒲
,

1

,

2






(
𝒲
)


+




Z

1


(
1
)





*

+




Z

2

(
2
)





*


,



s
.
t
.





𝒲

=

1


,

𝒲
=

2


,


𝒫

(


R
1

,

R
2


)




P
*

.






Equation



(
50
)








Hence, the corresponding augmented Lagrangian with dual multipliers of (50) may be defined as:













ρ

(

𝒲
,

1

,

2

,


1

,


2


)

=




(
𝒲
)

+




Z

1

(
1
)





*

+




Z

2

(
2
)





*

+




𝒲
-

1


,


1




+




𝒲
-

2


,


2




+


ρ
2






𝒲
-

1




F
2


+


ρ
2






𝒲
-

2




F
2




,




Equation



(
51
)








Where custom-character1 and custom-character2 are the dual multipliers.


In some embodiments, the dual optimization problem technique may be directly applied with the augmented Lagrangian function and solve the equation iteratively.


8.2.1 Updating Step for Z1;Z2-Variable


In some embodiments, considering the terms in Equation (51) either contain custom-character1 or custom-character2 and they are independent non-negative functions, thus each custom-character may be minimized with respect to custom-character1 and custom-character2 separately. Taking the sub-equation of custom-character2 as a detailed example, the matrix form of custom-character1 along mode-1 is updated by:














Z

1

(
1
)



k
+
1


=





arg

min


Z

1

(
1
)









Z

1

(
1
)





*


+





W

(
1
)

k

-

Z

1

(
1
)




,

M

1

(
1
)


k




+










ρ
2







W

(
1
)

k

-

Z

1


(
1
)






F
2








=





arg

min


Z

1

(
1
)









Z

1

(
1
)





*


+










ρ
2







W

(
1
)

k

-

Z

1

(
1
)



+


M

1

(
1
)


k

/
ρ




F
2






,




Equation



(
52
)








where k is the iteration step. According to, Equation (46) has the closed-form solution custom-characterτ1(Wk(1)+M1(1)k/ρ). Hence, custom-character may be updated by:






custom-character
1
k+1=fold1(Sτ1(W(1)k+M1(1)k/ρ)).   Equation (53)


For updating custom-character2, since this procedure is independent from custom-character1-update, custom-character2 may be updated in parallel using the similar way as follows:






custom-character
2
k+1=fold2(Sτ2(W(2)k+M2(2)k/ρ)).   Equation (54)


In such updating steps, the automatic determination of the Tucker ranks R1 and R2 may be performed via the shrinkage operation custom-character(⋅) in Equation (47) and (48) with τ1, τ2 for the current iteration. In some embodiments, the singular value decomposition U1diag(σ1)V1T=W(1)k+M1(1)k/ρ and U2diag(σ2)V2T=W(2)k+M2(2)k/ρ may be computed, where the singular values σ1i, σ2i in σ1, σ2 are sorted in a decreasing order, respectively.


In some embodiments, τ1 and τ2 may then be calculated by solving the following 0-1 knapsack problem:












max


s
1

,


s
2



{

0
,
1

}








σ
1

,

s
1





+




σ
2

,

s
2





,



s
.
t
.


𝒫

(


R
1

,

R
2


)




P
*


,




Equation



(
55
)








where s1 and s2 are binary vectors. In some embodiments, the lengths of s1 and s2 may be identical to σ1 and σ2, respectively, and the number of non-zero elements in s1 and s2 may be equal to R1 and R2, respectively. Thus, once τ1 and τ2 are obtained, R1 and R2 are also correspondingly determined. In such knapsack problem, σ1 and σ2 may be the “profit” and custom-character(⋅) is the “weight”. In some embodiments, to determine the optimal ranks, the selected “profit” may be maximized and the overall “weight” may be prevented from exceeding the memory budget. Although perfectly solving such 0-1 knapsack problem is very challenging, an efficient greedy algorithm may obtain sub-optimal solution with very low computational complexity. Thus, in some embodiments, the largest value in σ1 and σ2 may be selected every time. When the “weight” reaches the memory budget custom-character*, the algorithm terminates. Hence, τ1 and τ2 may be set as the current selected values in σ1 and σ2. Correspondingly, R1 and R2 may be naturally obtained by:






R
1=max{i|σ1i1, i=1, 2, . . . },   Equation (56)






R
2=max{i|σ2i2, i=1, 2, . . . }.   Equation (57)


Upon finding the solution, the singular values that are smaller than τ1 and τ2 are truncated.


8.2.2 Updating Step for w Variable

In some embodiments, after determining the tensor ranks and updating custom-character1,custom-character2-variable at the current iteration, custom-character may be updated via minimizing custom-character over custom-character. In some embodiments, fixing all variables except custom-character, the minimization of custom-character(custom-character) reformulates to minimizing a quadratic function:












min
𝒲




(
𝒲
)


+




𝒲
-

1
k


,


1
k




+




𝒲
-

2
k


,


2
k




+


ρ
2






𝒲
-

1
k




F
2


+


ρ
2






𝒲
-

2
k




F
2



,




Equation



(
58
)








which is differentiable. Hence, the gradient of custom-character(custom-character) over custom-character is given by:















ρ

(
𝒲
)




𝒲


=







(
𝒲
)




𝒲


+


ρ
2



(

𝒲
-

𝒵
1
k

+



1
k

/
ρ


)


+


ρ
2




(

𝒲
-

𝒵
2
k

+



2
k

/
ρ


)

.







Equation



(
59
)








Then, in some embodiments, custom-character may be updated by an optimization algorithm such as standard stochastic gradient descent as:











𝒲

k
+
1


=


𝒲
k

-

β







ρ

(
𝒲
)




𝒲





,




Equation



(
60
)








where is the learning rate for DNN training.


In some embodiments, once custom-character-variable is updated, another round of custom-character1,custom-character2-variable update may be performed based on the latest custom-character-variable. Such iterative update may continue until the end of training procedure. After that, the desired tensor decomposed weight tensor are well trained and obtained. Meanwhile, the finally selected tensor ranks are automatically determined as the R1 and R2 calculated by Equation (56) (57) at the last iteration. Thus, in some embodiments, the overall procedure of the dual optimization problem-regularized training procedure with automatic rank selection may be summarized in Algorithm 5.












Algorithm 5: Dual Optimization Problem-Regularized Training


Algorithm with Automatic Tensor Rank Selection















1: Inputs: Weight tensor  custom-character  , learning rate β, penalty parameter ρ, number of epochs T,


 target compressing model size (“memory budget”) P*.


2: Output: Optimized weight tensor  custom-character  , selected rank R1 and R2.


3:  custom-character1 :=  custom-character  ,  custom-character2 =  custom-character  ;


4:  custom-character1 := zeros_like( custom-character  );


5:  custom-character2 := zeros_like( custom-character  );


6: while k < T do


7: Obtain τ1 and τ2 by solving Equation (15);


8: Determine R1 and R2 using Equation (16) and (17);


9: Update  custom-character1k+1 using Equation (13);


10: Update  custom-character2k+1 using Equation (14);


11: Update  custom-characterk+1 using Equation (20);


12:  custom-character1k+1 =  custom-character1k + ρ( custom-character1k −  custom-character1k);


13:  custom-character2k+1 =  custom-character2k + ρ( custom-character2k −  custom-character2k);


14: End while










FIGS. 21A through 21C illustrate a change of (a) training loss and (b) Top-1 test accuracy, and (c) number of parameters during the training process for ResNet-18 on ImageNet dataset in accordance with aspects of embodiments of the present disclosure. The compression memory budget is set as 5.058×106 parameters.



FIG. 22A and 22B illustrate (a) Rank variations for two example component convolutional layer during training, and (b) final rank distribution using embodiments of the present method of ResNet-18 on ImageNet dataset in accordance with aspects of embodiments of the present disclosure.



FIG. 23 depicts a block diagram of an exemplary computer-based system and platform 2300 in accordance with one or more embodiments of the present disclosure. However, not all of these components may be required to practice one or more embodiments, and variations in the arrangement and type of the components may be made without departing from the spirit or scope of various embodiments of the present disclosure. In some embodiments, the illustrative computing devices and the illustrative computing components of the exemplary computer-based system and platform 2300 may be configured to manage a large number of members and concurrent transactions, as detailed herein. In some embodiments, the exemplary computer-based system and platform 2300 may be based on a scalable computer and network architecture that incorporates varies strategies for assessing the data, caching, searching, and/or database connection pooling. An example of the scalable architecture is an architecture that is capable of operating multiple servers.


In some embodiments, referring to FIG. 23, member computing device 2302, member computing device 2303 through member computing device 2304 (e.g., clients) of the exemplary computer-based system and platform 2300 may include virtually any computing device capable of receiving and sending a message over a network (e.g., cloud network), such as network 2305, to and from another computing device, such as servers 2306 and 2307, each other, and the like. In some embodiments, the member devices 2302-2304 may be personal computers, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, and the like. In some embodiments, one or more member devices within member devices 2302-2304 may include computing devices that typically connect using a wireless communications medium such as cell phones, smart phones, pagers, walkie talkies, radio frequency (RF) devices, infrared (IR) devices, citizens band radio, integrated devices combining one or more of the preceding devices, or virtually any mobile computing device, and the like. In some embodiments, one or more member devices within member devices 2302-2304 may be devices that are capable of connecting using a wired or wireless communication medium such as a PDA, POCKET PC, wearable computer, a laptop, tablet, desktop computer, a netbook, a video game device, a pager, a smart phone, an ultra-mobile personal computer (UMPC), and/or any other device that is equipped to communicate over a wired and/or wireless communication medium (e.g., NFC, RFID, NBIOT, 3G, 4G, 5G, GSM, GPRS, WiFi, WiMax, CDMA, OFDM, OFDMA, LTE, satellite, ZigBee, etc.). In some embodiments, one or more member devices within member devices 2302-2304 may include may run one or more applications, such as Internet browsers, mobile applications, voice calls, video games, videoconferencing, and email, among others. In some embodiments, one or more member devices within member devices 2302-2304 may be configured to receive and to send web pages, and the like. In some embodiments, an exemplary specifically programmed browser application of the present disclosure may be configured to receive and display graphics, text, multimedia, and the like, employing virtually any web based language, including, but not limited to Standard Generalized Markup Language (SMGL), such as HyperText Markup Language (HTML), a wireless application protocol (WAP), a Handheld Device Markup Language (HDML), such as Wireless Markup Language (WML), WMLScript, XML, JavaScript, and the like. In some embodiments, a member device within member devices 2302-2304 may be specifically programmed by either Java, .Net, QT, C, C++, Python, PHP and/or other suitable programming language. In some embodiment of the device software, device control may be distributed between multiple standalone applications. In some embodiments, software components/applications may be updated and redeployed remotely as individual units or as a full software suite. In some embodiments, a member device may periodically report status or send alerts over text or email. In some embodiments, a member device may contain a data recorder which is remotely downloadable by the user using network protocols such as FTP, SSH, or other file transfer mechanisms. In some embodiments, a member device may provide several levels of user interface, for example, advance user, standard user. In some embodiments, one or more member devices within member devices 2302-2304 may be specifically programmed include or execute an application to perform a variety of possible tasks, such as, without limitation, messaging functionality, browsing, searching, playing, streaming or displaying various forms of content, including locally stored or uploaded messages, images and/or video, and/or games.


In some embodiments, the exemplary network 2305 may provide network access, data transport and/or other services to any computing device coupled to it. In some embodiments, the exemplary network 2305 may include and implement at least one specialized network architecture that may be based at least in part on one or more standards set by, for example, without limitation, Global System for Mobile communication (GSM) Association, the Internet Engineering Task Force (IETF), and the Worldwide Interoperability for Microwave Access (WiMAX) forum. In some embodiments, the exemplary network 2305 may implement one or more of a GSM architecture, a General Packet Radio Service (GPRS) architecture, a Universal Mobile Telecommunications System (UMTS) architecture, and an evolution of UMTS referred to as Long Term Evolution (LTE). In some embodiments, the exemplary network 2305 may include and implement, as an alternative or in conjunction with one or more of the above, a WiMAX architecture defined by the WiMAX forum. In some embodiments and, optionally, in combination of any embodiment described above or below, the exemplary network 2305 may also include, for instance, at least one of a local area network (LAN), a wide area network (WAN), the Internet, a virtual LAN (VLAN), an enterprise LAN, a layer 3 virtual private network (VPN), an enterprise IP network, or any combination thereof. In some embodiments and, optionally, in combination of any embodiment described above or below, at least one computer network communication over the exemplary network 2305 may be transmitted based at least in part on one of more communication modes such as but not limited to: NFC, RFID, Narrow Band Internet of Things (NBIOT), ZigBee, 3G, 4G, 5G, GSM, GPRS, WiFi, WiMax, CDMA, OFDM, OFDMA, LTE, satellite and any combination thereof. In some embodiments, the exemplary network 2305 may also include mass storage, such as network attached storage (NAS), a storage area network (SAN), a content delivery network (CDN) or other forms of computer or machine readable media.


In some embodiments, the exemplary server 2306 or the exemplary server 2307 may be a web server (or a series of servers) running a network operating system, examples of which may include but are not limited to Apache on Linux or Microsoft IIS (Internet Information Services). In some embodiments, the exemplary server 2306 or the exemplary server 2307 may be used for and/or provide cloud and/or network computing. Although not shown in FIG. 23, in some embodiments, the exemplary server 2306 or the exemplary server 2307 may have connections to external systems like email, SMS messaging, text messaging, ad content providers, etc. Any of the features of the exemplary server 2306 may be also implemented in the exemplary server 2307 and vice versa.


In some embodiments, one or more of the exemplary servers 2306 and 2307 may be specifically programmed to perform, in non-limiting example, as authentication servers, search servers, email servers, social networking services servers, Short Message Service (SMS) servers, Instant Messaging (IM) servers, Multimedia Messaging Service (MMS) servers, exchange servers, photo-sharing services servers, advertisement providing servers, financial/banking-related services servers, travel services servers, or any similarly suitable service-base servers for users of the member computing devices 2301-2304.


In some embodiments and, optionally, in combination of any embodiment described above or below, for example, one or more exemplary computing member devices 2302-2304, the exemplary server 2306, and/or the exemplary server 2307 may include a specifically programmed software module that may be configured to send, process, and receive information using a scripting language, a remote procedure call, an email, a tweet, Short Message Service (SMS), Multimedia Message Service (MMS), instant messaging (IM), an application programming interface, Simple Object Access Protocol (SOAP) methods, Common Object Request Broker Architecture (CORBA), HTTP (Hypertext Transfer Protocol), REST (Representational State Transfer), SOAP (Simple Object Transfer Protocol), MLLP (Minimum Lower Layer Protocol), or any combination thereof.



FIG. 24 depicts a block diagram of another exemplary computer-based system and platform 2400 in accordance with one or more embodiments of the present disclosure. However, not all of these components may be required to practice one or more embodiments, and variations in the arrangement and type of the components may be made without departing from the spirit or scope of various embodiments of the present disclosure. In some embodiments, the member computing device 2402a, member computing device 2402b through member computing device 2402n shown each at least includes a computer-readable medium, such as a random-access memory (RAM) 2408 coupled to a processor 2410 or FLASH memory. In some embodiments, the processor 2410 may execute computer-executable program instructions stored in memory 2408. In some embodiments, the processor 2410 may include a microprocessor, an ASIC, and/or a state machine. In some embodiments, the processor 2410 may include, or may be in communication with, media, for example computer-readable media, which stores instructions that, when executed by the processor 2410, may cause the processor 2410 to perform one or more steps described herein. In some embodiments, examples of computer-readable media may include, but are not limited to, an electronic, optical, magnetic, or other storage or transmission device capable of providing a processor, such as the processor 2410 of client 2402a, with computer-readable instructions. In some embodiments, other examples of suitable media may include, but are not limited to, a floppy disk, CD-ROM, DVD, magnetic disk, memory chip, ROM, RAM, an ASIC, a configured processor, all optical media, all magnetic tape or other magnetic media, or any other medium from which a computer processor may read instructions. Also, various other forms of computer-readable media may transmit or carry instructions to a computer, including a router, private or public network, or other transmission device or channel, both wired and wireless. In some embodiments, the instructions may comprise code from any computer-programming language, including, for example, C, C++, Visual Basic, Java, Python, Perl, JavaScript, and etc.


In some embodiments, member computing devices 2402a through 2402n may also comprise a number of external or internal devices such as a mouse, a CD-ROM, DVD, a physical or virtual keyboard, a display, or other input or output devices. In some embodiments, examples of member computing devices 2402a through 2402n (e.g., clients) may be any type of processor-based platforms that are connected to a network 2406 such as, without limitation, personal computers, digital assistants, personal digital assistants, smart phones, pagers, digital tablets, laptop computers, Internet appliances, and other processor-based devices. In some embodiments, member computing devices 2402a through 2402n may be specifically programmed with one or more application programs in accordance with one or more principles/methodologies detailed herein. In some embodiments, member computing devices 2402a through 2402n may operate on any operating system capable of supporting a browser or browser-enabled application, such as Microsoft™, Windows™, and/or Linux. In some embodiments, member computing devices 2402a through 2402n shown may include, for example, personal computers executing a browser application program such as Microsoft Corporation's Internet Explorer™, Apple Computer, Inc.'s Safari™, Mozilla Firefox, and/or Opera. In some embodiments, through the member computing client devices 2402a through 2402n, user 2412a, user 2412b through user 2412n, may communicate over the exemplary network 2406 with each other and/or with other systems and/or devices coupled to the network 2406. As shown in FIG. 24, exemplary server devices 2404 and 2413 may include processor 2405 and processor 2414, respectively, as well as memory 2417 and memory 2416, respectively. In some embodiments, the server devices 2404 and 2413 may be also coupled to the network 2406. In some embodiments, one or more member computing devices 2402a through 2402n may be mobile clients.


In some embodiments, at least one database of exemplary databases 2407 and 2415 may be any type of database, including a database managed by a database management system (DBMS). In some embodiments, an exemplary DBMS-managed database may be specifically programmed as an engine that controls organization, storage, management, and/or retrieval of data in the respective database. In some embodiments, the exemplary DBMS-managed database may be specifically programmed to provide the ability to query, backup and replicate, enforce rules, provide security, compute, perform change and access logging, and/or automate optimization. In some embodiments, the exemplary DBMS-managed database may be chosen from Oracle database, IBM DB2, Adaptive Server Enterprise, FileMaker, Microsoft Access, Microsoft SQL Server, My SQL, PostgreSQL, and a NoSQL implementation. In some embodiments, the exemplary DBMS-managed database may be specifically programmed to define each respective schema of each database in the exemplary DBMS, according to a particular database model of the present disclosure which may include a hierarchical model, network model, relational model, object model, or some other suitable organization that may result in one or more applicable data structures that may include fields, records, files, and/or objects. In some embodiments, the exemplary DBMS-managed database may be specifically programmed to include metadata about the data that is stored.


In some embodiments, the exemplary inventive computer-based systems/platforms, the exemplary inventive computer-based devices, and/or the exemplary inventive computer-based components of the present disclosure may be specifically configured to operate in a cloud computing/architecture 2425 such as, but not limiting to: infrastructure a service (IaaS) 2610, platform as a service (PaaS) 2608, and/or software as a service (SaaS) 2606 using a web browser, mobile app, thin client, terminal emulator or other endpoint 2604. FIGS. 25 and 26 illustrate schematics of exemplary implementations of the cloud computing/architecture(s) in which the exemplary inventive computer-based systems/platforms, the exemplary inventive computer-based devices, and/or the exemplary inventive computer-based components of the present disclosure may be specifically configured to operate.


It is understood that at least one aspect/functionality of various embodiments described herein may be performed in real-time and/or dynamically. As used herein, the term “real-time” is directed to an event/action that may occur instantaneously or almost instantaneously in time when another event/action has occurred. For example, the “real-time processing,” “real-time computation,” and “real-time execution” all pertain to the performance of a computation during the actual time that the related physical process (e.g., a user interacting with an application on a mobile device) occurs, in order that results of the computation may be used in guiding the physical process.


As used herein, the term “dynamically” and term “automatically,” and their logical and/or linguistic relatives and/or derivatives, mean that certain events and/or actions may be triggered and/or occur without any human intervention. In some embodiments, events and/or actions in accordance with the present disclosure may be in real-time and/or based on a predetermined periodicity of at least one of: nanosecond, several nanoseconds, millisecond, several milliseconds, second, several seconds, minute, several minutes, hourly, several hours, daily, several days, weekly, monthly, etc.


As used herein, the term “runtime” corresponds to any behavior that is dynamically determined during an execution of a software application or at least a portion of software application.


In some embodiments, exemplary inventive, specially programmed computing systems and platforms with associated devices are configured to operate in the distributed network environment, communicating with one another over one or more suitable data communication networks (e.g., the Internet, satellite, etc.) and utilizing one or more suitable data communication protocols/modes such as, without limitation, IPX/SPX, X.25, AX.25, AppleTalk™, TCP/IP (e.g., HTTP), near-field wireless communication (NFC), RFID, Narrow Band Internet of Things (NBIOT), 3G, 4G, 5G, GSM, GPRS, WiFi, WiMax, CDMA, satellite, ZigBee, and other suitable communication modes.


In some embodiments, the NFC may represent a short-range wireless communications technology in which NFC-enabled devices are “swiped,” “bumped,” “tap” or otherwise moved in close proximity to communicate. In some embodiments, the NFC could include a set of short-range wireless technologies, typically requiring a distance of 10 cm or less. In some embodiments, the NFC may operate at 13.56 MHz on ISO/IEC 18000-3 air interface and at rates ranging from 106 kbit/s to 424 kbit/s. In some embodiments, the NFC may involve an initiator and a target; the initiator actively generates an RF field that may power a passive target. In some embodiment, this may enable NFC targets to take very simple form factors such as tags, stickers, key fobs, or cards that do not require batteries. In some embodiments, the NFC's peer-to-peer communication may be conducted when a plurality of NFC-enable devices (e.g., smartphones) within close proximity of each other.


The material disclosed herein may be implemented in software or firmware or a combination of them or as instructions stored on a machine-readable medium, which may be read and executed by one or more processors. A machine-readable medium may include any medium and/or mechanism for storing or transmitting information in a form readable by a machine (e.g., a computing device). For example, a machine-readable medium may include read only memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; electrical, optical, acoustical or other forms of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.), and others.


As used herein, the terms “computer engine” and “engine” identify at least one software component and/or a combination of at least one software component and at least one hardware component which are designed/programmed/configured to manage/control other software and/or hardware components (such as the libraries, software development kits (SDKs), objects, etc.).


Examples of hardware elements may include processors, microprocessors, circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, application specific integrated circuits (ASIC), programmable logic devices (PLD), digital signal processors (DSP), field programmable gate array (FPGA), logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth. In some embodiments, the one or more processors may be implemented as a Complex Instruction Set Computer (CISC) or Reduced Instruction Set Computer (RISC) processors; x86 instruction set compatible processors, multi-core, or any other microprocessor or central processing unit (CPU). In various implementations, the one or more processors may be dual-core processor(s), dual-core mobile processor(s), and so forth.


Computer-related systems, computer systems, and systems, as used herein, include any combination of hardware and software. Examples of software may include software components, programs, applications, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, application program interfaces (API), instruction sets, computer code, computer code segments, words, values, symbols, or any combination thereof. Determining whether an embodiment is implemented using hardware elements and/or software elements may vary in accordance with any number of factors, such as desired computational rate, power levels, heat tolerances, processing cycle memory budget, input data rates, output data rates, memory resources, data bus speeds and other design or performance constraints.


One or more aspects of at least one embodiment may be implemented by representative instructions stored on a machine-readable medium which represents various logic within the processor, which when read by a machine causes the machine to fabricate logic to perform the techniques described herein. Such representations, known as “IP cores,” may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that make the logic or processor. Of note, various embodiments described herein may, of course, be implemented using any appropriate hardware and/or computing software languages (e.g., C++, Objective-C, Swift, Java, JavaScript, Python, Perl, QT, etc.).


In some embodiments, one or more of illustrative computer-based systems or platforms of the present disclosure may include or be incorporated, partially or entirely into at least one personal computer (PC), laptop computer, ultra-laptop computer, tablet, touch pad, portable computer, handheld computer, palmtop computer, personal digital assistant (PDA), cellular telephone, combination cellular telephone/PDA, television, smart device (e.g., smart phone, smart tablet or smart television), mobile internet device (MID), messaging device, data communication device, and so forth.


As used herein, term “server” may be understood to refer to a service point which provides processing, database, and communication facilities. By way of example, and not limitation, the term “server” may refer to a single, physical processor with associated communications and data storage and database facilities, or it may refer to a networked or clustered complex of processors and associated network and storage devices, as well as operating software and one or more database systems and application software that support the services provided by the server. Cloud servers are examples.


In some embodiments, as detailed herein, one or more of the computer-based systems of the present disclosure may obtain, manipulate, transfer, store, transform, generate, and/or output any digital object and/or data unit (e.g., from inside and/or outside of a particular application) that may be in any suitable form such as, without limitation, a file, a contact, a task, an email, a message, a map, an entire application (e.g., a calculator), data points, and other suitable data. In some embodiments, as detailed herein, one or more of the computer-based systems of the present disclosure may be implemented across one or more of various computer platforms such as, but not limited to: (1) FreeBSD, NetBSD, OpenBSD; (2) Linux; (3) Microsoft Windows™; (4) OpenVMS™; (5) OS X (MacOS™); (6) UNIX™; (7) Android; (8) iOS™; (9) Embedded Linux; (10) Tizen™; (11) WebOS™; (12) Adobe AIR™; (13) Binary Runtime Environment for Wireless (BREW™); (14) Cocoa™ (API); (15) Cocoa™ Touch; (16) Java™ Platforms; (17) JavaFX™; (18) QNX™; (19) Mono; (20) Google Blink; (21) Apple WebKit; (22) Mozilla Gecko™; (23) Mozilla XUL; (24) .NET Framework; (25) Silverlight™; (26) Open Web Platform; (27) Oracle Database; (28) Qt™; (29) SAP NetWeaver™; (30) Smartface™; (31) Vexi™; (32) Kubernetes™ and (33) Windows Runtime (WinRT™) or other suitable computer platforms or any combination thereof. In some embodiments, illustrative computer-based systems or platforms of the present disclosure may be configured to utilize hardwired circuitry that may be used in place of or in combination with software instructions to implement features consistent with principles of the disclosure. Thus, implementations consistent with principles of the disclosure are not limited to any specific combination of hardware circuitry and software. For example, various embodiments may be embodied in many different ways as a software component such as, without limitation, a stand-alone software package, a combination of software packages, or it may be a software package incorporated as a “tool” in a larger software product.


For example, exemplary software specifically programmed in accordance with one or more principles of the present disclosure may be downloadable from a network, for example, a website, as a stand-alone product or as an add-in package for installation in an existing software application. For example, exemplary software specifically programmed in accordance with one or more principles of the present disclosure may also be available as a client-server software application, or as a web-enabled software application. For example, exemplary software specifically programmed in accordance with one or more principles of the present disclosure may also be embodied as a software package installed on a hardware device.


In some embodiments, illustrative computer-based systems or platforms of the present disclosure may be configured to handle numerous concurrent users that may be, but is not limited to, at least 100 (e.g., but not limited to, 100-999), at least 1,000 (e.g., but not limited to, 1,000-9,999), at least 10,000 (e.g., but not limited to, 10,000-99,999), at least 100,000 (e.g., but not limited to, 100,000-999,999), at least 1,000,000 (e.g., but not limited to, 1,000,000-9,999,999), at least 10,000,000 (e.g., but not limited to, 10,000,000-99,999,999), at least 100,000,000 (e.g., but not limited to, 100,000,000-999,999,999), at least 1,000,000,000 (e.g., but not limited to, 1,000,000,000-999,999,999,999), and so on.


In some embodiments, illustrative computer-based systems or platforms of the present disclosure may be configured to output to distinct, specifically programmed graphical user interface implementations of the present disclosure (e.g., a desktop, a web app., etc.). In various implementations of the present disclosure, a final output may be displayed on a displaying screen which may be, without limitation, a screen of a computer, a screen of a mobile device, or the like. In various implementations, the display may be a holographic display. In various implementations, the display may be a transparent surface that may receive a visual projection. Such projections may convey various forms of information, images, or objects. For example, such projections may be a visual overlay for a mobile augmented reality (MAR) application.


In some embodiments, illustrative computer-based systems or platforms of the present disclosure may be configured to be utilized in various applications which may include, but not limited to, gaming, mobile-device games, video chats, video conferences, live video streaming, video streaming and/or augmented reality applications, mobile-device messenger applications, and others similarly suitable computer-device applications.


As used herein, the term “mobile electronic device,” or the like, may refer to any portable electronic device that may or may not be enabled with location tracking functionality (e.g., MAC address, Internet Protocol (IP) address, or the like). For example, a mobile electronic device may include, but is not limited to, a mobile phone, Personal Digital Assistant (PDA), Blackberry™, Pager, Smartphone, or any other reasonable mobile electronic device.


As used herein, terms “proximity detection,” “locating,” “location data,” “location information,” and “location tracking” refer to any form of location tracking technology or locating method that may be used to provide a location of, for example, a particular computing device, system or platform of the present disclosure and any associated computing devices, based at least in part on one or more of the following techniques and devices, without limitation: accelerometer(s), gyroscope(s), Global Positioning Systems (GPS); GPS accessed using Bluetooth™; GPS accessed using any reasonable form of wireless and non-wireless communication; WiFi™ server location data; Bluetooth™ based location data; triangulation such as, but not limited to, network based triangulation, WiFi™ server information based triangulation, Bluetooth™ server information based triangulation; Cell Identification based triangulation, Enhanced Cell Identification based triangulation, Uplink-Time difference of arrival (U-TDOA) based triangulation, Time of arrival (TOA) based triangulation, Angle of arrival (AOA) based triangulation; techniques and systems using a geographic coordinate system such as, but not limited to, longitudinal and latitudinal based, geodesic height based, Cartesian coordinates based; Radio Frequency Identification such as, but not limited to, Long range RFID, Short range RFID; using any form of RFID tag such as, but not limited to active RFID tags, passive RFID tags, battery assisted passive RFID tags; or any other reasonable way to determine location. For ease, at times the above variations are not listed or are only partially listed; this is in no way meant to be a limitation.


As used herein, terms “cloud,” “Internet cloud,” “cloud computing,” “cloud architecture,” and similar terms correspond to at least one of the following: (1) a large number of computers connected through a real-time communication network (e.g., Internet); (2) providing the ability to run a program or application on many connected computers (e.g., physical machines, virtual machines (VMs)) at the same time; (3) network-based services, which appear to be provided by real server hardware, and are in fact served up by virtual hardware (e.g., virtual servers), simulated by software running on one or more real machines (e.g., allowing to be moved around and scaled up (or down) on the fly without affecting the end user).


In some embodiments, the illustrative computer-based systems or platforms of the present disclosure may be configured to securely store and/or transmit data by utilizing one or more of encryption techniques (e.g., private/public key pair, Triple Data Encryption Standard (3DES), block cipher algorithms (e.g., IDEA, RC2, RCS, CAST and Skipjack), cryptographic hash algorithms (e.g., MD5, RIPEMD-160, RTRO, SHA-1, SHA-2, Tiger (TTH),WHIRLPOOL, RNGs).


As used herein, the term “user” shall have a meaning of at least one user. In some embodiments, the terms “user”, “subscriber” “consumer” or “customer” may be understood to refer to a user of an application or applications as described herein and/or a consumer of data supplied by a data provider. By way of example, and not limitation, the terms “user” or “subscriber” may refer to a person who receives data provided by the data or service provider over the Internet in a browser session or may refer to an automated software application which receives the data and stores or processes the data.


The aforementioned examples are, of course, illustrative and not restrictive.


Publications cited throughout this document are hereby incorporated by reference in their entirety. While one or more embodiments of the present disclosure have been described, it is understood that these embodiments are illustrative only, and not restrictive, and that many modifications may become apparent to those of ordinary skill in the art, including that various embodiments of the inventive methodologies, the illustrative systems and platforms, and the illustrative devices described herein may be utilized in any combination with each other. Further still, the various steps may be carried out in any desired order (and any desired steps may be added and/or any desired steps may be eliminated).

Claims
  • 1. A method comprising: receiving, by at least one processor, at least: i) a memory capacity metric defining a maximum memory size available;ii) a model identifier that identifies a deep neural network, the deep neural network comprising at least one weight matrix that is arranged in at least one layer and comprises a plurality of weights; andiii) a training data comprising a plurality of input data records and a plurality of ground truth data records; wherein each input data record is associated with a corresponding ground truth data record;wherein the corresponding ground truth data record defines a ground truth output;training, by the at least one processor, based on the training data and an optimization problem, the deep neural network to produce a trained deep neural network by iteratively updating the at least one layer of the deep neural network, at each iteration, with: a rank value representing a tensor rank to apply to the at least one weight matrix, andthe plurality of weights of the at least one weight matrix upon decomposition to the tensor of the rank value;wherein the optimization problem is configured to: minimize the rank value at each iteration until the memory capacity metric is satisfied,minimize a loss function based at least in part on the training data and the plurality of weights,backpropagate the loss function to the plurality weights, andstop iteratively updating the at least one layer upon the memory capacity metric being satisfied and the loss function being minimized within the memory capacity metric;utilizing, by the at least one processor, a tensor-based decomposition to compress the trained deep neural network based at least in part on the rank value and the plurality of weights to obtain a trained tensor decomposition format deep neural network;training, by the at least one processor, the trained tensor decomposition format deep neural network based at least in part on the training data to obtain a fine-tuned trained tensor decomposition format deep neural network; anddeploying, by the at least one processor, the fine-tuned trained tensor decomposition format deep neural network to at least one hardware device that satisfies the memory capacity metric.
  • 2. The method of claim 1, wherein the tensor-based decomposition comprises at least one of: tensor train decomposition,tensor ring decomposition,singular value decomposition (SVD),higher-order singular value decomposition (HOSVD),Tucker decomposition,hierarchical Tucker decomposition, orblock term decomposition.
  • 3. The method of claim 1, wherein the optimization problem comprises a dual problem optimization method.
  • 4. The method of claim 3, wherein the dual problem optimization method comprises dual ascent method (DAM).
  • 5. The method of claim 3, wherein the dual problem optimization method comprises dual decomposition.
  • 6. The method of claim 3, wherein the dual problem optimization method comprises Alternating Direction Method of Multipliers (ADMM).
  • 7. The method of claim 1, further comprising: determining, by the at least one processor, a minimum tensor rank based at least in part on the memory capacity metric; andterminating, by the at least one processor, the iteratively updating of the at least one layer of the deep neural network upon the rank value reaching the minimum tensor rank.
  • 8. The method of claim 1, wherein the deep neural network comprises at least one of: a transformer,a multi-layer perceptron,a convolutional neural network, ora recurrent neural network.
  • 9. The method of claim 8, wherein the transformer comprises at least one vision transformer.
  • 10. The method of claim 1, wherein the at least one hardware device comprises an embedded internet-of-things (IoT) device.
  • 11. A system comprising: at least one processor; andat least one non-transitory computer readable medium having software instructions stored thereon, wherein, upon execution of the software instructions, the at least one processor is configured to: receive at least: i) a memory capacity metric defining a maximum memory size available;ii) a model identifier that identifies a deep neural network, the deep neural network comprising at least one weight matrix that is arranged in at least one layer and comprises a plurality of weights; andiii) a training data comprising a plurality of input data records and a plurality of ground truth data records; wherein each input data record is associated with a corresponding ground truth data record;wherein the corresponding ground truth data record defines a ground truth output;train based on the training data and an optimization problem, the deep neural network to produce a trained deep neural network by iteratively updating the at least one layer of the deep neural network, at each iteration, with: a rank value representing a tensor rank to apply to the at least one weight matrix, andthe plurality of weights of the at least one weight matrix upon decomposition to the tensor of the rank value;wherein the optimization problem is configured to: minimize the rank value at each iteration until the memory capacity metric is satisfied,minimize a loss function based at least in part on the training data and the plurality of weights,backpropagate the loss function to the plurality weights, andstop iteratively updating the at least one layer upon the memory capacity metric being satisfied and the loss function being minimized within the memory capacity metric;utilize a tensor-based decomposition to compress the trained deep neural network based at least in part on the rank value and the plurality of weights to obtain a trained tensor decomposition format deep neural network;train the trained tensor decomposition format deep neural network based at least in part on the training data to obtain a fine-tuned trained tensor decomposition format deep neural network; anddeploy the fine-tuned trained tensor decomposition format deep neural network to at least one hardware device that satisfies the memory capacity metric.
  • 12. The system of claim 11, wherein the tensor-based decomposition comprises at least one of: tensor train decomposition,tensor ring decomposition,singular value decomposition (SVD),higher-order singular value decomposition (HOSVD),Tucker decomposition,hierarchical Tucker decomposition, orblock term decomposition.
  • 13. The system of claim 11, wherein the optimization problem comprises a dual problem optimization system.
  • 14. The system of claim 13, wherein the dual problem optimization system comprises dual ascent system (DAM).
  • 15. The system of claim 13, wherein the dual problem optimization system comprises dual decomposition.
  • 16. The system of claim 13, wherein the dual problem optimization system comprises Alternating Direction System of Multipliers (ADMM).
  • 17. The system of claim 11, wherein, upon execution of the software instructions, the at least one processor is further configured to: determine a minimum tensor rank based at least in part on the memory capacity metric; andterminate the iteratively updating of the at least one layer of the deep neural network upon the rank value reaching the minimum tensor rank.
  • 18. The system of claim 11, wherein the deep neural network comprises at least one of: a transformer,a multi-layer perceptron,a convolutional neural network, ora recurrent neural network.
  • 19. The system of claim 18, wherein the transformer comprises at least one vision transformer.
  • 20. The system of claim 11, wherein the at least one hardware device comprises an embedded internet-of-things (IoT) device.
CLAIM TO PRIORITY

This application is a Continuation Application relating to and claiming the benefit of commonly-owned, co-pending PCT International Application No. PCT/US2022/030864, filed May 25, 2022, which claims priority to and the benefit of commonly-owned U.S. Provisional Application 63/193,692 filed on May 27, 2021, which is incorporated herein by reference in its entirety.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

This invention was made with government support under grant number 1955909 awarded by the National Science Foundation. The government has certain rights in the invention.

Provisional Applications (1)
Number Date Country
63193692 May 2021 US
Continuations (1)
Number Date Country
Parent PCT/US22/30864 May 2022 US
Child 18516213 US