The present disclosure generally relates to computer-based platforms and systems configured for neural network compression including tensor decomposition-based compression techniques for training and inferencing with tensorized neural networks and methods thereof.
Deep Neural Network (DNNs) have widespread applications in many tasks, such as image classification, video recognition objective detection, and image caption. For most embedded and Internet-of-Things (IoT) systems, the sizes of DNN models are too large, thereby causing high storage and computational demands and severely hindering the practical deployment of DNNs.
This invention proposes a systematic framework for tensor decomposition-based model compression by applying an optimization technique including a dual problem approach, such as, e.g., Alternating Direction Method of Multipliers (ADMM). By formulating TT decomposition-based model compression as an optimization problem with constraints on tensor ranks, it leverages ADMM technique to systemically solve the optimization problem in an iterative way. During this procedure, the entire DNN model is trained in the original structure instead of tensor decomposed, but gradually enjoys the desired low tensor rank characteristics. The model may be decomposed to tensor decomposed and fine-tuned to finally obtain a high-accuracy tensor decomposed format DNN model.
In some aspects, the techniques described herein relate to a method including: receiving, by at least one processor, at least: i) a memory capacity metric defining a maximum memory size available; ii) a model identifier that identifies a deep neural network, the deep neural network including at least one weight matrix that is arranged in at least one layer and includes a plurality of weights; iii) a training data including a plurality of input data records and a plurality of ground truth data records; wherein each input data record is associated with a corresponding ground truth data record; wherein the corresponding ground truth data record defines a ground truth output; training, by the at least one processor, based on the training data and an optimization problem, the deep neural network to produce a trained deep neural network by iteratively updating the at least one layer of the deep neural network, at each iteration, with: a rank value representing a tensor rank to apply to the at least one weight matrix, and the plurality of weights of the at least one weight matrix upon decomposition to the tensor of the rank value; wherein the optimization problem is configured to: minimize the rank value at each iteration until the memory capacity metric is satisfied, minimize a loss function based at least in part on the training data and the plurality of weights, backpropagate the loss function to the plurality weights, and stop iteratively updating the at least one layer upon the memory capacity metric being satisfied and the loss function being minimized within the memory capacity metric; utilizing, by the at least one processor, a tensor-based decomposition to compress the trained deep neural network based at least in part on the rank value and the plurality of weights to obtain a trained tensor decomposition format deep neural network; training, by the at least one processor, the trained tensor decomposition format deep neural network based at least in part on the training data to obtain a fine-tuned trained tensor decomposition format deep neural network; and deploying, by the at least one processor, the fine-tuned trained tensor decomposition format deep neural network to at least one hardware device that satisfies the memory capacity metric.
In some aspects, the techniques described herein relate to a method, wherein the tensor-based decomposition includes at least one of: tensor train decomposition, tensor ring decomposition, singular value decomposition (SVD) higher-order singular value decomposition (HOSVD), Tucker decomposition, hierarchical Tucker decomposition, or block term decomposition.
In some aspects, the techniques described herein relate to a method, wherein the optimization problem includes a dual problem optimization method.
In some aspects, the techniques described herein relate to a method, wherein the dual problem optimization method includes dual ascent method (DAM).
In some aspects, the techniques described herein relate to a method, wherein the dual problem optimization method includes dual decomposition.
In some aspects, the techniques described herein relate to a method, wherein the dual problem optimization method Alternating Direction Method of Multipliers (ADMM).
In some aspects, the techniques described herein relate to a method, further including: determining, by the at least one processor, a minimum tensor rank based at least in part on the memory capacity metric; and terminating, by the at least one processor, the iteratively updating of the at least one layer of the deep neural network upon the rank value reaching the minimum tensor rank.
In some aspects, the techniques described herein relate to a method, wherein the deep neural network includes at least one of: a transformer, a multi-layer perceptron, a convolutional neural network, or a recurrent neural network.
In some aspects, the techniques described herein relate to a method, wherein the transformer includes at least one vision transformer.
In some aspects, the techniques described herein relate to a method, wherein the at least one hardware device includes an embedded internet-of-things (IoT) device.
In some aspects, the techniques described herein relate to a system including: at least one processor; and at least one non-transitory computer readable medium having software instructions stored thereon, wherein, upon execution of the software instructions, the at least one processor is configured to: receive at least: i) a memory capacity metric defining a maximum memory size available; ii) a model identifier that identifies a deep neural network, the deep neural network including at least one weight matrix that is arranged in at least one layer and includes a plurality of weights; iii) a training data including a plurality of input data records and a plurality of ground truth data records; wherein each input data record is associated with a corresponding ground truth data record; wherein the corresponding ground truth data record defines a ground truth output; train based on the training data and an optimization problem, the deep neural network to produce a trained deep neural network by iteratively updating the at least one layer of the deep neural network, at each iteration, with: a rank value representing a tensor rank to apply to the at least one weight matrix, and the plurality of weights of the at least one weight matrix upon decomposition to the tensor of the rank value; wherein the optimization problem is configured to: minimize the rank value at each iteration until the memory capacity metric is satisfied, minimize a loss function based at least in part on the training data and the plurality of weights, backpropagate the loss function to the plurality weights, and stop iteratively updating the at least one layer upon the memory capacity metric being satisfied and the loss function being minimized within the memory capacity metric; utilize a tensor-based decomposition to compress the trained deep neural network based at least in part on the rank value and the plurality of weights to obtain a trained tensor decomposition format deep neural network; train the trained tensor decomposition format deep neural network based at least in part on the training data to obtain a fine-tuned trained tensor decomposition format deep neural network; and deploy the fine-tuned trained tensor decomposition format deep neural network to at least one hardware device that satisfies the memory capacity metric.
In some aspects, the techniques described herein relate to a system, wherein the tensor-based decomposition includes at least one of: tensor train decomposition, tensor ring decomposition, singular value decomposition (SVD) higher-order singular value decomposition (HOSVD), Tucker decomposition, hierarchical Tucker decomposition, or block term decomposition.
In some aspects, the techniques described herein relate to a system, wherein the optimization problem includes a dual problem optimization system.
In some aspects, the techniques described herein relate to a system, wherein the dual problem optimization system includes dual ascent system (DAM).
In some aspects, the techniques described herein relate to a system, wherein the dual problem optimization system includes dual decomposition.
In some aspects, the techniques described herein relate to a system, wherein the dual problem optimization system Alternating Direction System of Multipliers (ADMM).
In some aspects, the techniques described herein relate to a system, wherein, upon execution of the software instructions, the at least one processor is further configured to: determine a minimum tensor rank based at least in part on the memory capacity metric; and terminate the iteratively updating of the at least one layer of the deep neural network upon the rank value reaching the minimum tensor rank.
In some aspects, the techniques described herein relate to a system, wherein the deep neural network includes at least one of: a transformer, a multi-layer perceptron, a convolutional neural network, or a recurrent neural network.
In some aspects, the techniques described herein relate to a system, wherein the transformer includes at least one vision transformer.
In some aspects, the techniques described herein relate to a system, wherein the at least one hardware device includes an embedded internet-of-things (IoT) device.
Various embodiments of the present disclosure may be further explained with reference to the attached drawings, wherein like structures are referred to by like numerals throughout the several views. The drawings shown are not necessarily to scale, with emphasis instead generally being placed upon illustrating the principles of the present disclosure. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for teaching one skilled in the art to variously employ one or more illustrative embodiments.
The present disclosure describes systems and methods to enable deployment of Deep Neural Networks (DNNs) to chip sets having constrained resources. DNNs may be employed in widespread applications for machine learning-based recognition, prediction and segmentation, such as, e.g., in many computers vision tasks including image classification, video recognition objective detection, and image caption, prediction tasks, and other modelling, prediction, segmentation, classification, regression and other tasks or any combination thereof. Such applications may advantageously be deployed to resource constrained environments, such as embedded and Internet-of-Things (IoT) systems (e.g., smart home devices and security systems, wearable health monitors, ultra-high speed wireless internet, etc.), portable computing devices (e.g., smartphones, tablets, laptop computers, wearable devices, etc.), or other power, energy, memory, and/or processing constrained environments or any combination thereof.
Providing DNN models that have storage and processing requirements within the bounds of such resource-constrained environments is a difficult challenge due to the many parameters and nodes of DNNs. Compression of DNNs provides a potential avenue to producing DNNs with resource requirements within resource-constrained environments. Current implementations of compression may include:
In some embodiments, the compression, inferencing and training techniques of the present disclosure solves the problems of existing methods using the optimization technique as an approach to the tensor decomposition compression approach.
Tensor decomposition, uniquely, may provide ultra-high compression ratio, especially for recurrent neural network (RNN) models. The advanced tensor decomposition approaches, such as tensor train (TT) and tensor ring (TR) and including dual problem singular value decomposition (SVD), higher-order singular value decomposition (HOSVD), Tucker decomposition, hierarchical Tucker decomposition, block term decomposition among others or any combination thereof, may bring more than 1,000 times parameter reduction to the input-to-hidden layers of RNN models. Tensor-decomposed models are also well suited for hardware-based acceleration. However, current TT-based approaches have significant accuracy losses. The technique has much less loss for all neural network models and especially for CNN models.
In some embodiments, a systematic framework is described for tensor decomposition-based model compression by applying an optimization technique such as ADMM. By formulating TT decomposition-based model compression as an optimization problem with constraints on tensor ranks, it leverages ADMM technique to systemically solve this optimization problem in an iterative way. During optimization, the DNN model is trained in the original structure instead of tensor decomposed, while leveraging decreased tensor rank characteristics. In some embodiments, the uncompressed model may be decomposed to tensor decomposed and fine-tuned to obtain a final high-accuracy tensor decomposed format DNN model.
This framework is general in applicability, and therefore works for both CNNs and RNNs, among other neural networks, and may be modified to fit other tensor decomposition approaches. The present disclosure provides the framework for different DNN models for image classification and video recognition tasks as examples, though any suitable neural network task may be employed. Experimental results show that dual problem optimization-based tensor decomposed format models demonstrate very high compression performance with high accuracy.
In some embodiments, a model deployment system 110 may be configured to train, compress and deploy machine learning models, such as DNNs, to a resource constrained environment 100. In some embodiments, the model deployment system 110 may communicated with the resource constrained environment 100 via a network 102, or by any other suitable direct or indirect communication (e.g., a suitable wired and/or wireless hardware interface and/or via portable storage devices such as flash drives or USB storage drives, etc.). In some embodiments, the network 102 may include any suitable wired or wireless network with any suitable hardware and/or software configurations, such as, e.g., WiFi, Local Area Network (LAN), telecommunications network, the Internet, Bluetooth, among other networking hardware and/or protocols or any combination thereof.
In some embodiments, the resource constrained environment 100 may include any suitable device and/or system of devices having constraints on resources, such as, e.g., energy constraints, storage constraints, memory constraints, processor constraints, or any other suitable constraints that limit the performance of the resource constrained environment 100 in inferencing tasks using a DNN model. For example, the resource constrained environment 100 may include, e.g., a user computing device (a laptop computer or desktop computer), a mobile computing device (smartphone, tablet, wearable device, augmented reality device, virtual reality device, etc.), an Internet-of-Things (IoT) device (e.g., security camera, smart assistant device, smart TV, smart lights, smart thermostat, smart appliance, etc.), networking equipment, or any other suitable resource constrained device or system of devices or any combination thereof
In some embodiments, the model deployment system 110 may include hardware components such as a processor 111, which may include local or remote processing components. In some embodiments, the processor 111 may include any type of data processing capacity, such as a hardware logic circuit, for example an application specific integrated circuit (ASIC) and a programmable logic, or such as a computing device, for example, a microcomputer or microcontroller that include a programmable microprocessor. In some embodiments, the processor 111 may include data-processing capacity provided by the microprocessor. In some embodiments, the microprocessor may include memory, processing, interface resources, controllers, and counters. In some embodiments, the microprocessor may also include one or more programs stored in memory.
Similarly, the model deployment system 110 may include storage 112, such as one or more local and/or remote data storage solutions such as, e.g., local hard-drive, solid-state drive, flash drive, database or other local data storage solutions or any combination thereof, and/or remote data storage solutions such as a server, mainframe, database or cloud services, distributed database or other suitable data storage solutions or any combination thereof. In some embodiments, the storage 111 may include, e.g., a suitable non-transient computer readable medium such as, e.g., random access memory (RAM), read only memory (ROM), one or more buffers and/or caches, among other memory devices or any combination thereof
In some embodiments, the model deployment system 110 may implement a Dual problem training engine 120 configured for training a DNN, a tensor decomposition engine 130 configured for compressing a trained DNN and a trained compressed DNN fine-tuning engine 140 configured for fine-tuning the compressed trained DNN before deploying a dual optimization problem-based tensor decomposed format model 106. In some embodiments, the terms “computer engine” and “engine” identify at least one software component and/or a combination of at least one software component and at least one hardware component which are designed/programmed/configured to manage/control other software and/or hardware components (such as the libraries, software development kits (SDKs), objects, etc.).
Examples of hardware elements may include processors, microprocessors, circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, application specific integrated circuits (ASIC), programmable logic devices (PLD), digital signal processors (DSP), field programmable gate array (FPGA), logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth. In some embodiments, the one or more processors may be implemented as a Complex Instruction Set Computer (CISC) or Reduced Instruction Set Computer (RISC) processors; x86 instruction set compatible processors, multi- core, or any other microprocessor or central processing unit (CPU). In various implementations, the one or more processors may be dual-core processor(s), dual-core mobile processor(s), and so forth.
Examples of software may include software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, application program interfaces (API), instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof. Determining whether an embodiment is implemented using hardware elements and/or software elements may vary in accordance with any number of factors, such as desired computational rate, power levels, heat tolerances, processing cycle budget, input data rates, output data rates, memory resources, data bus speeds and other design or performance constraints.
In some embodiments, the model deployment system 110 may receive a request 104 for a trained DNN for deployment to the resource constrained environment 100. In some embodiments, the request 104 may include parameters for the trained DNN such as, e.g., a model identifier identifying a DNN architecture or DNN type or DNN task or any combination thereof, a resource capacity metric identifying a maximum amount of resources for a local DNN model, a training data set and/or training data set identifier associated with input data records and ground truth data records for training the DNN model to perform the DNN task, among other parameters or any suitable combination thereof. In some embodiments, the resource capacity metric may include, e.g., a memory capacity metric defining a maximum memory size available, a processing capacity metric defining a maximum processing performance available, an energy capacity metric defining a maximum energy use available, among others or any combination thereof
In some embodiments, the DNN model identified by the model identifier may include a suitable deep neural network having at least one weight matrix that is arranged in at least one layer. Each layer of the model may have weights that may be trained to correlate an input to an output.
In some embodiments and, optionally, in combination of any embodiment described above or below, an exemplary implementation of a deep neural network may be executed as follows:
In some embodiments and, optionally, in combination of any embodiment described above or below, the exemplary trained neural network model may specify a neural network by at least a neural network topology, a series of activation functions, and connection weights. For example, the topology of a neural network may include a configuration of nodes of the neural network and connections between such nodes. In some embodiments and, optionally, in combination of any embodiment described above or below, the exemplary trained neural network model may also be specified to include other parameters, including but not limited to, bias values/functions and/or aggregation functions. For example, an activation function of a node may be a step function, sine function, continuous or piecewise linear function, sigmoid function, hyperbolic tangent function, or other type of mathematical function that represents a threshold at which the node is activated. In some embodiments and, optionally, in combination of any embodiment described above or below, the exemplary aggregation function may be a mathematical function that combines (e.g., sum, product, etc.) input signals to the node. In some embodiments and, optionally, in combination of any embodiment described above or below, an output of the exemplary aggregation function may be used as input to the exemplary activation function. In some embodiments and, optionally, in combination of any embodiment described above or below, the bias may be a constant value or function that may be used by the aggregation function and/or the activation function to make the node more or less likely to be activated.
In some embodiments, the model deployment system 110 may instantiate the DNN training 120 to train a DNN according to the request 104. In some embodiments, based on the model identifier, the Dual problem training engine 120 may access a DNN model library 115 in the storage 112. In some embodiments, the DNN model library 115 may include uninitialized DNN models having one or more architectures. For example, the DNNs in the DNN model library 115 may include, e.g., one or more architectural designs of a support vector machine, transformer, a multi-layer perceptron, autoencoder, a convolutional neural network (CNN), or a recurrent neural network (RNN), among others or any combination thereof. Thus, based on the DNN architecture, type, and/or model identified by the model identifier of the request 104, the Dual problem training engine 120 may query and retrieve the associated DNN model in the DNN model library 115. Alternatively, the request 104 may include the DNN model itself rather than or in addition to the model identifier. Thus, the Dual problem training engine 120 may retrieve the DNN model directly from the request 104.
In some embodiments, the Dual problem training engine 120 may also load a training data set based on the training data parameter of the request 104. In some embodiments, the storage 112 may include a library of training data 114, e.g., organized by task, training data set identifier, or any other suitable catalog of training data set. Thus, based on the request 104, the Dual problem training engine 120 may query the storage 112 and retrieve the training data set associated with the request 104 for training the DNN model. Alternatively, the request 104 may include the training data set itself. Thus, the Dual problem training engine 120 may retrieve the training data set directly from the request 104.
In some embodiments, the training data set may include a set of input data records and a set of ground truth, or target or output, data records, where each pair of input to ground truth records defines a known input and output. Accordingly, the Dual problem training engine 120 may initialize the DNN model and use the training data set to train the DNN model.
In some embodiments, to overcome the current limitations of tensor decomposition and fully unlock its potentials for model compression, the dual problem training engine 120 may employ a dual problem optimization problem, such as, e.g., dual problem optimization method, dual ascent method (DAM), dual decomposition, Alternating Direction Method of Multipliers (ADMM), among other suitable dual problem optimization problems or any combination thereof. By formulating tensor decomposition-based model compression to an optimization problem with constraints on tensor ranks, the dual problem optimization problem may be leveraged to systemically solve the optimization problem in an iterative way. During this procedure the entire DNN model is trained in the original structure instead of tensor decomposed structure, but gradually enjoys the desired low tensor rank characteristics. The trained uncompressed model may then be decomposed to tensor decomposed format, and fine-tuned to finally obtain a high-accuracy trained tensor decomposed format DNN model.
In some embodiments, the systematic framework may include formulating and solving the tensor decomposition-based model compression problem. With formulating this problem to a constrained non-convex optimization problem, embodiments of the present framework gradually restricts the DNN model to the target tensor ranks without explicitly training on the tensor decomposed format, thereby maintaining the model capacity as well as minimizing approximation error and increased network depth.
In some embodiments, the systematic framework employs a dual optimization problem such as ADMM to efficiently solve the reformulated optimization problem via separately solving two sub-problems. In some embodiments, a first sub-problem may be to directly optimize the loss function with a regularization of the DNN, e.g., by stochastic gradient descent or other suitable optimization method. In some embodiments, the second sub-problem may use the introduced projection to constrain the tensor ranks analytically.
Thus, in some embodiments, the dual problem training engine 120 may produce a trained DNN model by iteratively updating the at least one layer of the deep neural network, at each iteration, with: a rank value representing a tensor rank to apply to the at least one weight matrix, and the weights of the weight matrix upon decomposition to the tensor of the rank value. To do so, the Dual problem training engine 120 may, at each training iteration, minimize the rank value at each iteration, decompose the weight matrix to a tensor having the rank value, minimize a loss function based on the training data and the weights, backpropagate a loss to the plurality weights, and stop iteratively updating the at least one layer upon the resource capacity metric being satisfied and the loss function being minimized within the resource capacity metric.
In some embodiments, the resource capacity metric may define a maximum memory available for the DNN model. Thus, the dual problem training engine 120 may perform training iterations until the tensor rank is minimized to below the resource capacity metric such that the DNN model being decomposed to have one or more tensors of the tensor rank satisfies the maximum memory capacity. Accordingly, the dual problem training engine 120 may determine the minimum tensor rank associated with the available memory capacity and terminate training iterations where the minimization of the loss converges after the tensor rank is minimized to the minimum tensor rank associated with the available memory capacity.
In some embodiments, upon training the DNN model, the trained DNN model may be provided to a tensor decomposition engine 130 to compress the trained DNN model via tensor decomposition to obtain a trained tensor decomposition format DNN. Because the optimization procedure has already imposed the desired low tensor decomposed rank structure to the uncompressed model, such direction decomposition may avoid significant approximation error.
Thus, in some embodiments, the tensor decomposition engine 130 may utilize the tensor rank determined during training to decompose the weight matrix of the DNN model to a tensor having the tensor rank. In some embodiments, the tensor decomposition technique employed by the tensor decomposition engine 130 may include, e.g., tensor train decomposition, tensor ring decomposition, singular value decomposition (SVD), higher-order singular value decomposition (HOSVD), Tucker decomposition, hierarchical Tucker decomposition, or block term decomposition.
In some embodiments, upon compressing the trained DNN model to a trained tensor decomposition format DNN model, the fine-tuning engine 140 may refine the training of the trained tensor decomposition format DNN model to recover any losses due to the compression process. Thus, the fine-tuning engine 140 may employ the training data set to train the trained tensor decomposition format DNN model, e.g., using a suitable optimization technique such as, e.g., stochastic gradient descent, backtracking line search, coordinate descent, stochastic hill climbing, stochastic variance reduction, among others or any combination thereof. In some embodiments, the fine-tuning phase may be very fast relative to the initial training, e.g., requiring only a few iterations. In some embodiments, the speed of tine-tuning may be because the trained tensor decomposition format DNN model at the starting point of the fine-tuning phase may benefit from decreased accuracy loss relative to the original uncompressed DNN model.
In some embodiments, upon fine-tuning, the model deployment system 110 may deploy the fine-tuned trained tensor decomposition format DNN model to the resource constrained environment 100 such that the fine-tuned trained tensor decomposition format DNN model may be implemented in, e.g., memory constrained environment of the device and/or system of devices associated with the resource constrained environment. In some embodiments, to address the fundamental challenges, the trained tensor decomposition format DNN model and inferencing engine of the present disclosure enables the hardware-friendly inference scheme. In some embodiments, a theoretical limit for minimum number of multiplications needed for tensor decomposed-format inference may be calculated, a computation-efficient inference scheme may be developed. The inference scheme may be configured for the tensor decomposition of the fine-tuned trained tensor decomposition format DNN model, which may have two benefits: 1) it is very compact because the required number of multiplications of this scheme is identical to the theoretical limit, thus eliminating all the unnecessary redundant computations; and 2) based on its multi-stage processing style, the computing engine only needs to access one tensor core in each stage, thereby leading to significant saving in memory access.
In some embodiments, an inferencing engine based on the inferencing scheme may be developed to form a tensor decomposed format DNN inference engine (“TIE”), which may include a specialized hardware architecture based on tensor decomposed-DNN. TIE is designed to fully reap the benefits of embodiments of the present hardware-friendly inference scheme and achieves high computation efficiency as well as simple memory access. Also, TIE 15 flexible and may be adapted to various network types, values of ranks, number of tensor dimensions, and combinations of factorization factors, thereby making itself well suited for various application scenarios and tasks.
In order to facilitate and promote the widespread deployment of DNNs in broader scope of application scenarios, both ML and hardware communities have conducted extensive investigations on compressing DNNs with affordable accuracy loss. Specifically, due to the well-recognized and verified redundancy of DNN models, different compression approaches, such as pruning, clustering, low rank decomposition and low bit-width etc., have been and adopted to remove the redundancy on structure, layer, weight or number precision of DNN models. Correspondingly, several compression-oriented DNN accelerators have also been customized for those compression approaches to achieve high hardware performance.
Among various DNN compression techniques, tensor decomposition is unique due to its extremely high compression ratios. Tensor decomposition may include one or more tensor-based compression techniques such as, e.g., tensor train (TT) decomposition, tensor ring (TR) decomposition, dual problem singular value decomposition (SVD), higher-order singular value decomposition (HOSVD), Tucker decomposition, hierarchical Tucker decomposition, block term decomposition among others or any combination thereof
For instance, experiments in show that applying tensor decomposition to the fully-connected (FC) layers of VGG-16 on ImageNet dataset may bring record-breaking 50000 times compression ratio, while other compression approaches typically only achieve much less compression on FC layers. Moreover, due to the generality of tenso decomposition, this approach may also be applied to compressing convolutional (CONV) layers via decomposing the weight tensors of CONV layers. In some embodiments, tensor decomposition may be effective in several representative types of DNNs, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs).
From the perspective of tensor theory, the impressive compression capability of tensor decomposition such as TT decomposition may come from its unique tensor factorization scheme. As illustrated in
Due to the advantages of TT decomposition on model compression, exploiting efficient DNN hardware architecture based on TT decomposition (referred as TT-DNN) may provide a solution to the drawbacks of typically DNN compression techniques. Considering the high compression ratios that TT decomposition may bring, such specialized architecture may execute the state-of-the-art CNN and RNN models using smaller memory resources than typically DNNs including DNNs compressed by typical means, thereby leading to more area and energy efficient solutions for resource-constrained DNN accelerator design.
In some embodiments, realizing a high-performance TT-DNN accelerator may overcome the challenge on an inefficient inference scheme based on a tensor decomposed format DNN model. In some embodiments, a tensor decomposed inference scheme may have a larger amount of redundant computations, leading to higher computational cost in the inference phase relative to a standard DNN model. Moreover, those inherent redundant computations also incur intensive memory accesses because the tensor cores need to be frequently accessed when calculating each element of output tensor, thereby causing high energy consumption. As a result, despite the high compression ratios, the inherent inefficiency of a tensor decomposed inference scheme may directly impede the potential deployment of TT-DNN accelerator in energy-constrained applications.
In some embodiments, to address the fundamental challenges, the tensor composed DNN, and inferencing engine of the present disclosure enables the hardware-friendly inference scheme. In some embodiments, a theoretical limit for minimum number of multiplications needed for tensor decomposed-format inference may be calculated, a computation-efficient inference scheme may be developed. The tensor decomposed-format inference scheme has two benefits: 1) it is very compact because the required number of multiplications of this scheme is identical to the theoretical limit, thus eliminating all the unnecessary redundant computations; and 2) based on its multi-stage processing style, the computing engine only needs to access one tensor core in each stage, thereby leading to significant saving in memory access.
In some embodiments, an inferencing engine based on the inferencing scheme may be developed to form a tensor decomposed format DNN inference engine (“TIE”), which may include a specialized hardware architecture based on tensor decomposed-DNN. TIE is designed to fully reap the benefits of embodiments of the present hardware-friendly inference scheme and achieves high computation efficiency as well as simple memory access. Also, TIE 15 flexible and may be adapted to various network types, values of ranks, number of tensor dimensions, and combinations of factorization factors, thereby making itself well suited for various application scenarios and tasks.
In some embodiments, an example of the TIE design may include a prototype TIE design using CMOS 28 nm technology for tensor train format DNNs (TT-DNNs). With 16 processing elements (PEs) operating on 1000 MHz, the TIE accelerator consumes 1.74 mm2 and 154.8 mW. Compared to the typical compressed DNN-oriented accelerators using other compression methods, such as sparsification and structured matrices (CIRCNN), the TT decomposition-based TIE exhibits significant advantages in hardware performance. Compared with EIE, TIE achieves 7.22ט10.66× better area efficiency and 3.03ט4.48× better energy efficiency on different workloads, respectively. Compared with CIRCNN, TIE achieves 5.96× and 4.56× higher throughput and energy efficiency, respectively.
In some embodiments, TT decomposition is an efficient compression approach to reduce DNN model sizes. In general, TT decomposition may decompose a large-size multidimensional tensor into a set of small-size 3-dimensional tensors.
Specifically, for a d-dimensional n1×n1×. . . ×nd tensor of , after TT decomposition is stored in the tensor decomposed using d tensor cores k∈r
(j1, . . . ,jd)=G1[j1]×G2[j2]×. . . ×Gd[jd], Equation (1)
where k[jk]∈r
In some embodiments, the value of rk may be set as a small value, so that the parameter saving resulting from TT decomposition may be very significant. Consequently, leveraging TT decomposition to perform efficient compression on DNN models is very attractive since the fully-connected (FC) and convolutional (CONV) layers of DNNs are in the format of matrix and tensor, which may be decomposed and represented in the tensor decomposed. In some embodiments, in order to maintain high test accuracy, the TT decomposition is typically not directly applied to the 2D weight matrix or 4D weight tensor but to their reshaped format. For instance, as illustrated in
In some embodiments, when a DNN model is stored in the tensor decomposed, the corresponding inference and training schemes may be re-formulated since the underlying representation for weight matrix and tensor of FC and CONV layers have been changed.
In some embodiments inference on tensor decomposed FC layers. In some embodiments, with weight matrix W∈M×N, input vector x∈N and output vector y∈M, the inference procedure on FC layer is y=Wx, where bias is combined with W for simplicity. In the scenario of representing weight matrix in the tensor decomposed, such inference scheme may be re-formulated as follows:
where ∈m
In some embodiments, due to its generality for arbitrary tensor, tensor decomposition may also enable efficient inference on CONV layer that is affiliated with a weight tensor. In some embodiments, there are two methods to represent the conventional 4D weight tensor of CONV layer in the tensor decomposed format. The first one is to directly apply tensor decomposition to the 4D tensor and obtain the corresponding tensor cores. However, as indicated in, such method is not very efficient for CONV layer with small kernel sizes (e.g., 1×1 convolution). In some embodiments, another method is to reshape the 4D weight tensor to 2D matrix, and then use the same procedure for inference on FC layer to perform inference on CONV layer. As illustrated in
In some embodiments, regarding training tensor decomposed format DNN models, in general, after the sizes of tensor cores k have been determined, a DNN model in the tensor decomposed format may be either trained from the scratch or obtained from a pretrained non-tensor decomposed format model. In some embodiments, the train-from-scratch strategy assigns initial values for each tensor core and then performs backward propagation scheme in to update them. On the other hand, if converting a non-tensor decomposed trained model to the tensor decomposed format is needed, the standard TT decomposition in is first applied to the weight matrix/tensor of the FC/CONV layer of model to form the initial values of tensor cores. Then backward propagation-based fine-tuning process is performed to retain original high accuracy. In some embodiments, other techniques for training a tensor decomposed format neural network are described further below.
In some embodiments, based on the training and inference described in above, the tensor decomposed format DNN models may be trained and tested. Table 1-Table 3 list the test accuracy and compression ratio (CR) of different types of DNN models (convolutional neural network (CNN) and recurrent neural network (RNN)) on different datasets. Here the CR is measured as the reduction of number of parameters of the model. Specifically, the experimental settings are shown in Tables 1 through 3 as follows:
In some embodiments, as shown in Table 1-Table 3, TT decomposition enables significant reduction in the number of parameters of the decomposed layers and the entire DNN model sizes. Meanwhile, it preserves high task accuracy on different datasets, facilitating practical deployment of DNN models, e.g., in resource constrained environments. However, a drawback of typical tensor decomposition including low computational efficiency in the inference phase, which may be elaborated in Section 3.1, impedes wide adoption in practical systems. Indeed, the TT-decomposed models achieve similar test accuracy with the state-of-the-artwork having 80.8% accuracy.
In some embodiments, as described in Equation (2), the inference on the tensor decomposed layers of DNN models may be performed via multi-dimensional summation of the products of slices of different tensor cores. This implementation, though classical and straightforward, incurs severe challenge that leads to low computational efficiency.
In general, the low computational efficiency of tensor decomposed inference scheme comes from its inherent redundant computations. Recall that in Equation (2), calculating a specific element of output tensor ((i1, . . . , id))requires consecutive multiplications of k[ik,jk] over all jk's. Since each (i1, . . . , id) always has the partially same indices ik with many other (i1, . . . , id)'s, calculating those indices shared elements inherently contains multiple times of consecutive multiplication among the same k[ik,jk], thereby causing unnecessary computational redundancy.
To further quantify the computational redundancy, an analytic evaluation on the total number of multiplications consumed in Equation (2) and the minimum required number of multiplications for calculating all (i1, . . . ,id), respectively, may be performed. For simplicity, only multiplication is counted for computational cost.
Analysis on total number of multiplications in Equation (2): First, the total required number of multiplications consumed in Equation (2) may be examined. As indicated in
Analysis on minimum number of multiplications for (i1, . . . , id): Next the minimum number of required multiplications for calculating all (i1, . . . , id)∈ may be analyzed. The general procedure is to first determine the computational cost for (i1, . . . , id−1, :)4 when i1˜id−1 are specific. In some embodiments, the ‘:’ is used in the i-th dimension of tensor to denote all elements in the i-th dimension of this tensor. The number of non-redundant multiplications for calculating (i1, . . . , id−2, :, :) when i1˜id−2 are specific is determined based on the computation cost when i1˜id−1 are specific. As result, the computations involved with d[id,jd] are not considered since such computations have been included before, thereby avoiding counting repeated computation. The similar analysis from the (d−2)-th dimension to the 1-st dimension of may be performed, and finally the minimum required number of multiplications for calculating all (i1, . . . , id)'s as (:, . . . , :, :) may be obtained. In general, such recursive counting method ensures that all the multiplications involved with the calculation of all (i1, . . . , id)'s are included and meanwhile those multiplications are not counted repeatedly.
Specifically, the detail of above analysis procedure is described as follows. First consider the computational cost for (i1, . . . , id−1, :) (referred as stage-1). Recall that Equation (2) indicates calculating one (i1, . . . , id−1, id) requires all (j1, . . . , jd−1, jd)'s and dk[ik,jk]'s where k=1,2, . . . , d. Additionally, as illustrated in
Next, the additional computational cost for calculating (i1, . . . , id−2, :, :) (referred as stage-2) may be determined. Similar to the previous analysis on recursive computation, when computing (i1, . . . , id−2, :, :), the computation involved with (i1, . . . , id−2, id−1, :) has already been considered and may be not be re-counted again. Therefore, the additional number of multiplications for calculating (i1, . . . , id−2, :, :) with (:, . . . , :, :) is:
By generalizing Equation 5, the additional number of multiplications for calculating (i1, . . . , il, :, . . . , :) withinn stage-1 may be derived as:
Consequently, because Y has d dimensions, the total minimum number of multiplications for calculating (i1, . . . , il, :, . . . , :) in all d stages is:
Equation 7 gives the analytical result of minimum number of multiplications for performing tensor decomposed inference. Comparing this theoretical limit with Equation 3, the conventional tensor decomposed scheme contains computational redundancy. For instance, for the FC-6 layer in VGG-16 with d=6 and ri=4, the number of multiplications consumed in Equation 3 is 1073 times than that in Equation 7. Such redundancy in multiplication results in reduced computational efficiency of conventional approaches.
In some embodiments, to address the challenge of low computational efficiency, a computation-efficient tensor decomposed inference scheme is configured to calculate all the elements of output tensor in parallel without any redundant computations in a compact inference scheme, thereby improving computational efficiency over the conventional tensor decomposed inference scheme.
In some embodiments, the design of the compact inference scheme leverages the theoretical analysis on the minimum required number of computations in Section 3. Recall that in the previous analysis the minimum number of multiplications is counted based on the assumption that all the computations involved with k[ik,jk] are not included for the future computations involved with k−1[ik−1,jk−1]. To achieve this, in embodiments, the computation on different k's may be performed one by one. In other words, different from Equation 2 that calculates one output tensor element using dk[ik,jk]'s where k=1, 2, . . . ,d, in some embodiments, the compact inference scheme may perform computation using all mk,nkk[ik,jk]'s with one specific k, and then not use all mk,nkk[ik,jk]'s in the future for other k's. Consequently, such computing arrangement breaks the original data dependency and eliminates the potential computational redundancy.
In some embodiments, as described in Section 2.1 above and shown in
(j1, . . . , jd−1, jd)→X′(p,q), Equation (8)
where p=jd and q=Σl=1d−1Πi=1l−1ni. Then, a compact matrix format multiplication may be performed as:
V
d
={tilde over (G)}
d
X′, Equation (9)
where Vd is the intermediate matrix to be sent to stage-2, and {tilde over (G)}d is the matrix format of unfolded d (see
In some embodiments,
V
h(p,q)→V′h(p′,q′),
where p=ihrh−1+th−1, p′=jh−1rh−1+th−1,
q=(Σl=1h−1jtΠi=1l−1ni)Πk=1d−hmd−k+1+Σg=2d−h(Πk=gd−hmd−k+1)id−g+2,
q′=(Σl=1h−2jtΠi=1l−1ni)Πk=1d−h+1md−k+1+Σg=2d−h+1(Πk=gd−h+1md−k+1)id−g+2. Equation (10)
After performing this transform, a compact matrix-format computation for stage-(d−h+2) is as:
V
h−1
={tilde over (G)}
h−1
V′
h Equation (11)
where Vh−1 may be then be sent to stage-(d−h+3) and being transformed again.
In some embodiments, as illustrated in
In some embodiments, at each stage of compact tensor decomposed inference scheme, the intermediate values may be buffered on-chip for the processing of next stage. In some embodiments, the storage capacity to store the intermediate values from state-(d−h+1) is max(rh−1Πk=1k−1nk)Πk=hdmk, where h=1 . . . d. In some embodiments, both the input and output of each stage may be stored. The overall storage overhead is 2×max(rh−1Πk−1k−1nk)Πk=hdmk, where h=1 . . . d. In some embodiments, the activation size is much less than weights size in other compression techniques, thus, the storage overhead brought by compact tensor decomposed inference scheme is low enough to be implemented in an embedded or other resource constrained device.
A standard convolution includes performing each of Equations 12 and 13 below:
Y(h′,w′,o)=Σk
h=(h′−1)s+k1−p,w=(w′−1)s+k2−p Equation (13)
where s is the stride, and p is the zero-padding size. As a result, the computational cost of standard convolution is HWOIk2.
Where a convolutional network is compressed using TT, the computation of inferencing with the tensor decomposed CNN may be as follows in Equation 14:
Y′=(h′,w′,o1, . . . od)=Σk
However, this result in a computation cost of H′W′OIk2R1 . . . Rd, is R1 . . . Rd times as costly as a standard convolution computation.
Accordingly, the compact TT convolution computation for inferencing with a tensor decomposed CNN may be performed in accordance with embodiments described herein using Equation 15 below:
Z
1(h,w,r2)=ΣiIG3(r2,i)X(h,w,i),
Z
2(h′,w′,r1)=Σkk
Y(h′,w′,o)=Σr
This results in a computation cost of HWIR1+H′W′k2R1R2+H′W′OR1, which is less costly than the conventional TT convolution scheme.
Equation 15 has been represented for a particular case of a tensor decomposed CNN. A general case compact TT convolution computation may be performed according to Equation 16 below:
In some embodiments, based on the efficient tensor decomposed inference scheme in TIE, a specially configured hardware architecture of tensor train-based inference engine may be produced.
In some embodiments, based on the data mapping and processing scheme described above, the overall architecture of TIE is shown in
In some embodiments, with respect to data-path, as shown in
In some embodiments, with respect to weight SRAM, the weight SRAM of TIE may store the weight parameters of tensor cores {tilde over (G)}h's. In some embodiments, though each layer of tensor decomposed format DNN models is affiliated with d tensor cores, the sequential access to different {tilde over (G)}h's in different computation stages, which is described in Section 2.2, enables the simple storing strategy of locating all {tilde over (G)}h's in the same weight SRAM sequentially from h=1 to d. However, different from such sequential placement for the consecutive {tilde over (G)}h's, the data allocation within the same {tilde over (G)}h may not always be sequentially in the weight SRAM. For instance, as illustrated in
In some embodiments, as indicated in Algorithm 1, a transform from Vh to V′h may be used in each stage of computation to ensure the functional correctness of the inference scheme. Conventionally, such transform, including matrix reshape and transpose, which demands extra memory resource to implement those matrix operations, thereby degrading the hardware performance of the entire design on both area efficiency and power efficiency.
In some embodiments, to address such problem, efficient read and write schemes may be designed for working SRAMs to achieve zero-cost matrix transform. In some embodiments, of the present general methodology is to ensure that the data-path reads the elements of V′h from the working SRAM that stores Vh, thereby enabling on-the-fly matrix transform for Vh. In some embodiments, to achieve the on-the-fly transform, working SRAM may be partitioned to multiple groups with well-designed data selection mechanism.
In some embodiments, the writing scheme may be consistent with the computing scheme described above. As described above, the computing scheme (Section 3.1) each PE calculates NMAC elements in the same column of Vh after every NG
In some embodiments, a reading scheme may be designed for on-the-fly transform of Vh. As described above, the matrix transform operation on Vh may be performed during the reading phase using a partitioned group-based data selection mechanism. In some embodiments, Algorithm 2 describes an example of the mechanism in detail. In some embodiments, the transform mechanism may to utilize the indices of SRAM groups, component SRAMs and element to locate and read the targeted element of V′h in a mathematically equivalent manner.
In some embodiments, besides the transform from Vh to V′h, the inference on entire DNN models also require transform from V15 of this layer to X′ of next layer. In some embodiments, before being transformed, V1 needs to be processed by the activation units in PEs first. Interestingly, embodiments of the present mathematical analysis shows that such inter-layer transform is identical to the intra-layer transform described before. Therefore, when the TIE is performing the computation between two consecutive layers, it may still utilize the working SRAM read scheme.
In some embodiments, high-level functional behavior of TIE may be modeled by a bit-accurate cycle-accurate simulator. Based on the model, an RTL model may be developed using Verilog and verified the functional validity of the RTL model. In some embodiments, the verified RTL model may be synthesized using Synopsis Design Compiler with CMOS 28 nm library. Here the gate-level netlist may be annotated with toggle rate that may be obtained from the extracted switching activity during the simulation. After that we used Synopsis IC Compiler to perform place and route and generate layout (see
Benchmarks. To evaluate the performance of TIE on different tasks, we choose several workloads from two models used in image classification and video classification tasks, respectively. Here the same-size layers with different TT-decomposition setting are viewed as different workloads. Table 4 lists the information of four benchmark layers, including the size, TT-decomposition settings (d, n, m, and r) and compression ratio.
Design Configuration. Table 5 shows the configuration information of the TIE hardware. The entire design consists of 16 PEs with 16-bit quantization. For each PE, it is equipped with 16 MACs and 16 activation units, where each MAC contains one 16-bit width multiplier and one 24-bit width accumulator. Regarding the memory, a 16 KB Weight SRAM is used to store up to 8192 16-bit weights on the chip. According to Section 2.3, such memory budgeted capacity for weight SRAM is sufficient for most TT-DNN models. For working SRAM, it contains two copies acting as ping-pong buffer, where each copy has the capacity as 384 KB. Therefore, the total capacity of working RAM is 384×2=768 KB.
Hardware Resources and Performance.
4.3 Comparison with EIE, CIRCNN, and Eyeriss
In this subsection, we compare TIE with two state-of-the-art compressed DNN-targeted accelerators: EIE and CIRCNN. Different from TIE, model compression in EIE and CIRCNN comes from other sources: For EIE, model compression is achieved via network sparsification; for CIRCNN, model compression is from structuring topology. Moreover, to evaluate the performance of TIE on CONV layers, we also compare TIE with representative CONV-oriented work: Eyeriss.
Comparison with EIE. Table 7 summarizes the design parameters and hardware performance of EIE and TIE. Due to the different technology nodes adopted in the two works, the clock frequency, silicon area and power consumption of EIE are also projected under the same 28 nm technology for fair comparison. Such projection is based on the scaling rule used in linear, quadratic and constant scaling for frequency, area and power, respectively.
Comparison with CIRCNN. Table 8 compares the hardware performance of CIRCNN and TIE. Notice that here the listed performance metrics of the two designs are obtained from their synthesis reports for fair comparison since CIRCNN reports synthesis results.
Meanwhile, due to the lack of area information of CIRCNN, we compare the overall throughput (in term of TOPS) and energy efficiency (in term of TOPS/W) of the two designs. After projecting performance of CIRCNN to the same 28 nm technology for fair comparison, it is seen that TIE achieves 5.96× and 4.56× higher throughput and energy efficiency than CIRCNN, respectively.
Comparison with Eyeriss. Table 9 summarizes the design parameters and hardware performance of Eyeriss and TIE on CONV layers of VGG. For fair comparison, the clock frequency, silicon area and power consumption of Eyeriss are also projected under the 28 nm technology. We used core area and processing latency of Eyeriss instead of chip area and total latency for fair comparison with TIE.
TIE is designed to provide sufficient flexibility to support the needs of different TT models having different layer sizes and decomposition settings. As illustrated in
In some embodiments, tensor decomposition is a tool that explores the low tensor rank characteristics of the largescale tensor data. Different from other model compression methods, tensor decomposition, uniquely, may provide ultra-high compression ratio for DNNs, including CNN and RNN models. The advanced tensor decomposition approaches, such as tensor train (TT) and tensor ring (TR), among others, may bring more than one thousand times parameter reduction to the input-to-hidden layers of DNN models, and meanwhile the corresponding classification accuracy in the video recognition task may be even significantly improved.
In some embodiments, typical tensor decomposition approaches, including TT and TR, suffer accuracy loss when compressing a trained DNN. CNN models exhibit the greatest accuracy loss.
The accuracy loss due to tensor decomposition is mainly due to the unique challenges involved in training the tensor decomposed format DNN models. In general, there are typically two ways to use tensor decomposition to obtain a compressed model: 1) Train from scratch in the decomposed format; and 2) Decompose a pre-trained uncompressed model and then retrain. In the former case, when the required tensor decomposition-based, e.g., tensor decomposed format model, is directly trained from scratch, because the structure of the models are already pre-set to low tensor rank format before the training, the corresponding model capacity is typically limited as compared to the full-rank structure, thereby causing the training process being very sensitive to initialization and more challenging to achieve high accuracy. In the latter scenario, though the pre-trained uncompressed model provides good initialization position, the straightforwardly decomposing full-rank uncompressed model into low tensor rank format causes inevitable and non-negligible approximation error, which is very difficult to be recovered even after long-time re-training period. Moreover, in either training approach, tensor decomposition always brings linear increase in network depth, which implies training the tensor decomposition-format DNNs are typically more prone to gradient vanishing problem and hence being difficult to be trained well.
In some embodiments, to overcome the current limitations of tensor decomposition and fully unlock its potentials for model compression, a systematic framework for tensor decomposition-based model compression using a dual problem optimization problem, such as, e.g., alternating direction method of multipliers (ADMM), dual ascent method (DAM), dual decomposition, among other suitable dual problem optimization problems or any combination thereof. By formulating TT decomposition-based model compression to an optimization problem with constraints on tensor ranks, the dual problem optimization problem may be leveraged to systemically solve the optimization problem in an iterative way. During this procedure the entire DNN model is trained in the original structure instead of tensor decomposed format, but gradually enjoys the desired low tensor rank characteristics. The trained uncompressed model may then be decomposed to tensor decomposed format and fine-tuned to finally obtain a high-accuracy trained tensor decomposed format DNN model.
In some embodiments, the systematic framework may include formulating and solving the tensor decomposition-based model compression problem. With formulating this problem to a constrained non-convex optimization problem, embodiments of the present framework gradually restricts the DNN model to the target tensor ranks without explicitly training on the tensor decomposed, thereby maintaining the model capacity as well as minimizing approximation error and increased network depth.
In some embodiments, the systematic framework employs a dual optimization problem such as ADMM to efficiently solve the reformulated optimization problem via separately solving two sub-problems. In some embodiments, a first sub-problem may be to directly optimize the loss function with a regularization of the DNN, e.g., by stochastic gradient descent or other suitable optimization method. In some embodiments, the second sub-problem may use the introduced projection to constrain the tensor ranks analytically.
In an example implementation of the framework, on an evaluation of different example DNN models for image classification and video recognition tasks are described. The example evaluation results show that embodiments of the present dual optimization problem-based tensor decomposed format models demonstrate high compression performance with high accuracy. For example, on CIFAR-100, with 2.3 times and 2.4 times compression ratios, embodiments of the present models have 1.96% and 2.21% higher top-1 accuracy than the original ResNet-20 and ResNet-32, respectively. For compressing an example ResNet-18 on ImageNet, embodiments of the present model achieves 2.47 FLOPs reduction with no accuracy loss.
In some embodiments, ∈n
In some embodiments, given a tensor ∈n
where k∈rk−1×n
In some embodiments, a simple fully-connected layer with weight matrix ∈M×N and input x∈N, where M=Πk=1dmk and N=Πk=1dnk, the output x∈M may be obtained by y=x. In some embodiments, in order to transform this standard layer to TT fully-connected (TT-FC) layer, the weight matrix may be tensorized to a weight tensor ∈(m
((i
,j
), . . . , (i
,j
))=1(i, i
In some embodiments, each TT-core k∈r
where ∈m
In some embodiments, for a conventional convolutional layer, forward computation performs convolution between a 3-order input tensor ∈W×H×N and a 4-order weight tensor ∈K×K×M×N to produce the 3-order output tensor ∈(W−K+1)×(H−K+1)×M. In some embodiments, in a TT convolutional (TT-CONV) layer, the input tensor is reshaped to a tensor ∈W×H×n
((k
,k
),(i
,j
), . . . ,(i
,j
))=0(k
where M=Πk=1dmk and N=Πk=1dnk. Similar with TT-FC layer, here k∈r
In some embodiments, additional description of a TT-CONV layer may be found in Timur Garipov, Dmitry Podoprikhin, Alexander Novikov, and Dmitry Vetrov. “Ultimate tensorization: compressing convolutional and fc layers alike.” arXiv preprint arXiv:1611.03214, 2016, which is herein incorporated by reference in its entirety.
In some embodiments, tensor decomposed format as the TT-FC layer, the TT-CONV layer and the corresponding forward propagation schemes are formulated, a suitable optimization method, such as, standard stochastic gradient descent (SGD) algorithm, may be used to update the TT-cores with the rank set r, which determines the target compression ratio. The initialization of the TT-cores may be either randomly set or obtained from directly TT-decomposing a pre-trained uncompressed model.
In some embodiments, tensor decomposed format as described above, currently a tensor decomposed format DNN is either 1) trained from with randomly initialized tensor cores; or 2) trained from a direct decomposition of pre-trained model. For the first strategy, information related to the high-accuracy uncompressed model is unused and thus lost. For the second strategy, though the knowledge of the pre-trained model is indeed utilized, because the pre-trained model generally lacks low TT-rank property, after direct low-rank tensor decomposition the approximation error is too significant to be properly recovered even using long-time re-training. Such inherent limitations of the existing training strategies, consequently, cause significant accuracy loss for the compressed tensor decomposed format DNN models.
In some embodiments, to overcome the above described limitations to maximally retain the knowledge contained in the uncompressed model, or in other words, minimize the approximation error after tensor decomposition with given target tensor ranks, an optimization problem is formulated to minimize the loss function of the uncompressed model with low tensor rank constraints. With proper advanced optimization technique (e.g., a dual optimization problem such as ADMM, Dam, dual decomposition, etc.) regularized training procedure, the uncompressed DNN models may gradually exhibit low tensor rank properties. After the regularized training phase, the approximation error brought by the explicit low-rank tensor decomposition becomes negligible and may be easily recovered by the fine-tuning (e.g., with stochastic gradient descent or other suitable optimization method).
In some embodiments, as described above, the first phase of the present framework may be to iteratively impose low tensor rank characteristics onto a high-accuracy uncompressed DNN model. Mathematically, this goal may be formulated as an optimization problem to minimize the loss function of the object model with constraints on TT-ranks of each layer (convolutional or fully-connected):
where is the loss function of the DNN , rank(⋅) is a function that returns the TT-ranks r=[r0, . . . ,rd] of the weight tensor cores, and r*=[r*0, . . . , r*d] are the desired TT-ranks for the layer. To simplify the notation, here r≤r* means ri≤r*i, i=0, . . . , d, for each ri in r.
In some embodiments, solving the equation (22) may be generally difficult via normal optimization algorithms since rank(⋅) is non-differentiable. In some embodiments, to overcome this challenge, Equation 22 may be rewritten as:
where ={|rank()≤r*}. Hence, the objective form (23) is a classic non-convex optimization problem with constraints, which may be properly solved by the dual optimization problem. Specifically, we may first introduce an auxiliary variable and an indicator function g(⋅) of , i.e.
And then the equation (23) is equivalent to the following form:
In some embodiments, to ensure convergence without assumptions like strict convexity or finiteness of , instead of Lagrangian, the corresponding augmented Lagrangian in the scaled dual form of the above equation is given by:
where is the dual multiplier, and ρ>0 is the penalty parameter. Thus, the iterative ADMIVI scheme may be explicitly performed as
where t is the iterative step. In some embodiments, the original equation (25) may be separated to two sub-equations (27) and (28), which may be solved individually. In some embodiments, each sub-equation may be solved at each training iteration.
In some embodiments, with regards to the -sub-equation (29). The -sub-equation (27) may be reformulated as follows:
where the first term is the loss function, e.g., cross-entropy loss in classification tasks, of the DNN model, and the second term is the L2 -regularization. In some embodiments, sub-problem (30) may be directly solved by stochastic gradient descent since both of the two terms are differentiable. Correspondingly, the partial derivative of (30) with respect to is calculated as:
And hence may be updated by:
where ρ is the learning rate.
In some embodiments, with regards to the -sub-problem (28), to solve -sub-problem (28) may be explicitly formulated it as follows:
where the indicator function g(⋅) of the non-convex set is non-differentiable. Then, in this format updating may be performed as:
t+1=(t+1+t), Equation (34)
where (⋅) is the projection of singular values onto , by which the TT-ranks of (t+1+t) are truncated to target ranks r*. Algorithm 3 describes an example of a specific procedure of this projection in the tensor decomposed format scenario.
In some embodiments, in each dual optimization problem iteration, upon the update of and , the dual multiplier is updated by (29). In overall, to solve (25), the entire dual optimization problem-regularized training procedure is performed in an iterative way until convergence or reaching the pre-set maximum iteration number. The overall procedure is summarized in Algorithm 4.
indicates data missing or illegible when filed
In some embodiments, upon dual optimization problem-regularized training, decompose the trained uncompressed DNN model may be decomposed into tensor decomposed. Here the decomposition may be performed with the target TT-ranks r* for tensor cores. Because the optimization procedure has already imposed the desired low TT-rank structure to the uncompressed model, such direction decomposition, unlike their counterpart in the existing tensor decomposed format DNN training, may be not bring significant approximation error (more detail may be provided below in Section 7.1). In some embodiments, the decomposed tensor decomposed format model may then be fine-tuned using standard stochastic gradient descent or other suitable optimization method. In some embodiments, in the fine-tuning phase the loss function is ({i}) may be formulated without another regularization term that would be introduced by the dual optimization problem. In some embodiments, the fine-tuning phase may be very fast relative to the initial training, e.g., requiring only a few iterations. In some embodiments, the speed of tine-tuning may be because the decomposed TT model at the starting point of the fine-tuning phase may benefit from decreased accuracy loss relative to the original uncompressed model.
To demonstrate the effectiveness and generality of the compression framework, examples of different DNN models in different computer vision tasks may be evaluated. For image classification tasks, multiple CNN models may be evaluated on MNIST, CIFAR-10, CIFAR-100 and ImageNet datasets. For video classification tasks, different LSTM models may be evaluated on UCF11 and HMDB51 datasets. To simplify selection procedure, some, all or more than half of the ranks in the same layer may be set to equal.
In some embodiments, as shown in (26), ρ is the additional hyperparameter introduced in the dual optimization problem-regularized training phase. To study the effect of p on the performance as well as facilitating hyperparameter selection, we study the convergence and sensitivity of the ADMM-regularized training for ResNet-32 model with different p settings on CIFAR-10 dataset.
In some embodiments, as shown in
In some embodiments, considering the similar convergence behavior does not necessarily mean that different ρ would bring the similar accuracy, the performance sensitivity of dual optimization problem-regularized training with respect to ρ may be analyzed. In some embodiments, after dual optimization problem-regularized training, W, in the uncompressed format, may exhibit strong low TT-rank characteristics and meanwhile enjoy high accuracy. Once meets such two criteria concurrently, the TT-cores{i}, whose initialization is decomposed from , may have high accuracy even before fine-tuning.
In some embodiments, to examine the required low TT-rank behavior of , ∥−∥F2, which measures the similarity between and , may be observed in the dual optimization problem-regularized training (see
Table 10 shows the experimental results of LeNet-5 model on MNIST dataset. Embodiments of the present dual optimization problem-based tensor decomposed format model may be compared with the uncompressed model as well as a typical TT/TR-format. It is seen that embodiments of the present dual optimization problem-based compression may achieve the highest compression ratio and the best accuracy.
Table 11 compares embodiments of the present dual optimization problem-based tensor decomposed format ResNet-20 and ResNet-32 models with the typical TT/TR-format on CIFAR-10 dataset. For ResNet-20, it is seen that standard training on TT/TR-format models causes greater accuracy loss. Even for the typical design using some advanced techniques, such as heuristic rank selection (PSTRN-M/S) and reinforcement learning (TR-RL), the performance degradation is still larger than the dual optimization problem-regularization framework of the present disclosure, especially with high compression ratio 6.8 times. On the other side, with the same high compression ratio embodiments of the present dual optimization problem-based tensor decomposed format model has only 0.22% accuracy drop, which means 2.53% higher than the typical PSTRN-M. Furthermore, with moderate compression ratio 4.5 time embodiments of the present method may even outperform the uncompressed model with 0.22% accuracy increase.
For ResNet-32, again, standard training on compressed models using TT or TR decomposition causes larger performance degradation than other techniques. The typical PSTRN-S/M indeed brings performance improvement, but the test accuracy is still not satisfied. Instead, embodiments of the present highly compressed (5.8 times) dual optimization problem-regularized tensor decomposed format model has 0.53% accuracy loss, which means it has 1.36% higher accuracy than PSTRN-M with the same compression ratio. More importantly, when compression ratio is relaxed to 4.8 times, embodiments of the present dual optimization problem-based tensor decomposed format model achieves 92.87%, which is even 0.38% higher than the uncompressed model.
Table 12 shows the experimental results on CIFAR-100 dataset. Again, embodiments of the present dual optimization problem-based tensor decomposed format model outperforms the typical techniques. For ResNet-20, with even higher compression ratio (5.6 times the dual optimization problem-based tensor decomposed format model versus 4.7 times in PSTRN-M), embodiments of the present model achieves 1.3% accuracy increase. With 2.3 times compression ratio, embodiments of the present model achieves 67.36% Top-1 accuracy, which is even 1.96% higher than the uncompressed model. For ResNet-32, with the same 5.2 times compression ratio, embodiments of the present approach brings 0.4% accuracy increase over the typical PSTRN-M. With the same 2.4 times compression ratio, embodiments of the present approach has 2.26% higher accuracy than PSTRN-S. embodiments of the present model even outperforms the uncompressed model with 2.21% accuracy increase.
Table 13 shows the results of compressing ResNet-18 on ImageNet dataset. Because no prior TT/TR compression works report results on this dataset, a standard TT and TR-based training may be used for comparison. Embodiments of the present approach may also be compared with other compression methods, including pruning and matrix SVD. Since these works report FLOPs reduction instead of compression ratio, the FLOPs reduction brought by tensor decomposition according to embodiments of the present dual optimization problem-based tensor decomposed format model may be compared. It is shown that with the similar FLOPs reduction ratio (4.62 times), embodiments of the present dual optimization problem-based tensor decomposed format model has 1.83% and 1.18% higher accuracy than standard TT and TR, respectively. Compared with other compression approaches with non-negligible accuracy loss, embodiments of the present dual optimization problem-based tensor decomposed format model tensor decomposed format achieves better accuracy with more FLOPs reduction. In particular, with 2.47 times FLOPs reduction, embodiments of the present model has the same accuracy as the uncompressed baseline model.
Table 14 compares embodiments of the present dual optimization problems-based tensor decomposed format LSTM with an uncompressed LSTM model and the typical TT-LSTM and TR-LSTM. Note that does not report the performance of PSTRN-M/S on UCF11 dataset.
From Table 14, it is seen that both TT-LSTM and TR-LSTM provide performance improvement and compression ratio improvement relative to the uncompressed LSTM due to the feature extraction capabilities of TT/TR-format LSTM models on the ultra-high-dimensional inputs. In comparison, embodiments of the present dual optimization problems-based tensor decomposed LSTM tensor decomposed format achieves even greater performance. With fewer parameters, embodiments of the present dual optimization problems-based tensor decomposed LSTM results in 2.1% higher top-1 accuracy than the typical TR-LSTM.
An Inception-V3 as the front-end pre-trained CNN model, and a backend uncompressed LSTM model may be compared to a dual optimization problems-based tensor decomposed version.
Table 15 summarizes the experimental results. It is seen that comparing with the typical TT/TR-format designs, embodiments of the present dual optimization problems-based tensor decomposed tensor decomposed format model results in greater performance. With the highest compression ratio (84.0 times), embodiments of the present model achieves 64.09% top-1 accuracy. Compared with the typical TR-LSTM, embodiments of the present model brings 3.35 times more compression ratio with additional 0.29% accuracy increase.
As described above, the dual optimization problem-based tensor decomposed format models consistently outperform the existing TT/TR-format models with higher accuracy and higher compression ratio over various datasets, thereby comprehensively demonstrating the huge benefits brought by embodiments of the present framework.
In some embodiments, embodiments of the present disclosure may use tensor train-format DNN models. However, because dual optimization problems such as ADMM are general optimization techniques, embodiments of the present framework may be easily applied for model compression using other tensor decomposition approaches, such as Tensor Ring (TR), Block-term (BT), Tucker, Hierarchical Tucker, singular value decomposition (SVD), higher-order singular value decomposition (HOSVD), etc. To adapt to other tensor decomposition scenario, the main modification on embodiments of the present framework is to modify the Euclidean projection (Algorithm 3) to make the truncating methods being compatible to the corresponding tensor decomposition methods.
In some embodiments, as described above, training of a tensor decomposed format DNN may be further improved by leveraging the tensor rank as a hyperparameter during training. Selecting the optimal rank is very challenging. Different from the rank selection in matrix decomposition, where the rank is a scalar, the rank selection in tensor decomposition needs to identify the proper vector-format tensor ranks. As indicated in, exact determination of tensor ranks for linear tensor problem is theoretically NP-hard. Even worse, since a modern DNN model typically consists of tens or even hundreds of layers, the overall search space for determining the rank for the decomposed DNN model is extremely huge.
In some embodiments, to systematically overcome the rank selection challenge and obtain high-performance compressed DNN models within the desired memory budget (e.g., model size or computational cost), an optimization-based framework may be configured to automatically select the rank configurations for the tensor decomposed format DNN models. Embodiments of the present framework may integrate the rank selection procedure into the training procedure, and let the models automatically learn the suitable rank setting from the data. For example, as described above, the original complicated NP-hard rank selection problem may be relaxed to a tensor nuclear norm-regularized optimization problem with the constraint of model size for DNN training. After such reformulation, during the training procedure a suitable dual optimization problem technique may be used to solve this optimization problem via solving two sub-problems in an iterative way. Upon the end of this iterative solving, the suitable tensor ranks are automatically learned, and thereby bringing highly-compressed highly-accurate tensor decomposed format DNN models with target compression ratio. Herein, boldface calligraphic script letters denote tensor, e.g., eq. 2-dimensional tensor is a matrix, and a 1-dimensional tensor is a vector, which are represented by boldface capital letters and boldface lower-case letters, respectively, e.g. A and a. Also, non-boldface letters with indices (i1, . . . , id), A(i, j) and a(i) denote the entry of d-dimensional tensor , matrix A and vector a, respectively.
Let A=UΣVT be the singular value decomposition (SVD) of matrix A. The shrinkage operation is defined as
(A)=UΣTVT, Equation (36)
where Στ=diag(max(σi−τ), 0) denotes the i-th largest singular value.
Given a tensor ∈n
j=1+Σp=1,p≠kd(ip−1)Jp with Jp=Πq=1,q≠kp−1nq, Equation (37)
Based on the matricization, the following two operations may be defined as:
unfoldk()=A(k),
foldk(A(k))=. Equation (38)
The Tucker rank of a d-dimensional tensor is denoted as n-rank(), which is the tuple of the ranks of mode-k matricizations:
n−rank()=[R1, . . . , Rd]; Equation (39)
where Rk=rank(A(k)), and k=1, . . . , d.
The Tucker decomposition of a tensor ∈n
(i1, . . . , id)=Σr
where ∈R
Considering a convolutional layer with kernel tensor ∈S×C×K×K, the input tensor ∈W×H×C is transformed into the output tensor ∈W′×H′×S convolving with the kernel:
(w′, h′, s)=Σi=1KΣj=1KΣc=1C(s, c, i, j)(w, h, c) with w=w; +i and h=h′+j−1, Equation (41)
where W′=W−K+1 and H′=H−K+1.
In some embodiments, in a Tucker-format convolutional layer, in order to retain the spatial information, the orders associated to the kernel size are not decomposed. This decomposing approach is called Tucker-2 decomposition, by which the kernel tensor may be represented as:
where ∈R
With the Tucker-2 format, the original full convolution layer is decomposed into a sequence of small convolutions, thus the output tensor is obtained by:
Without loss of generality, we aim to compress an L-layer convolutional neural network (CNN) using Tucker decomposition. While Tucker decomposition is used here, Tucker decomposition is just an illustration of a particular form of tensor decomposition with the tensor cores used as a hyperparameter to be trained during the training process. Any suitable tensor decomposition may be used according to similar processes described below. To compress other types of DNNs, e.g., recurrent neural network (RNN), or use other tensor decomposition, e.g., tensor train or tensor ring, similar solution as described above may also be derived under the optimization framework. To simplify the notation, may be used without subscript to represent the weight of all layers ({i}i=1L, i) is a 4-dimensional kernel tensor) and omit the bias parameters. The loss function is denoted by () (e.g., crossentropy loss for classification task).
With such notation, embodiments of the present optimization framework is to minimize the loss function and the Tucker rank of each layer simultaneously, e.g., min() and minn−rank(), with memory budget constraints to satisfy the target compression ratio. In some embodiments, decreasing the tensor ranks may lower the capacity of the CNN model, thus making it more difficult for the loss function to reach the optimal point. Therefore, the problem may be solved by identifying the loss in the acceptable range and while the tensor ranks also satisfy the compression demand. Since exact determination of tensor ranks for linear tensor problem is NP-hard, it is impossible to obtain the optimal tensor ranks via direct searching approaches.
In some embodiments, to overcome this challenge, based on the fact that the nuclear norm or trace norm ∥⋅∥, is the tightest surrogate for rank(⋅), the second objective may be relaxed and the original equation may be approximated in a reformulation as follows:
where ∥∥* is the tensor nuclear norm of , (⋅) returns the overall model size with the current ranks, and *, as the “memory budget”, is the desired model size after compression. The tensor nuclear norm of a d-dimensional tensor is the convex combination of the nuclear norms of all unfolded matrices along each mode:
where {αi}i=1d are constants that satisfy a αi≥0 and Σi=1dαi=1. Recall for the Tucker-2 convolution, only the first mode and the second mode are decomposed to retain the spatial information. Therefore, the tensor nuclear norm of the kernel tensor is the sum of the nuclear norms of the mode-1 and mode-2 matricization:
∥∥*=α1∥W(1)∥*α2∥W(2)∥*. Equation (48)
In some embodiments, the dimensions of input channel and the output channel may have the same influence on the kernel tensor, thus a 1 and a 2 may be set to be equal such that they may be eliminated in the above equation. Hence, the formulated optimization equation (46) may be rewritten as:
Minimizing the objective that contains two entries-shared non-smooth nuclear norm terms (e.g., W(1) and W(2)), as the problem format that Equation (43) exactly exhibits, is challenging. Therefore, in some embodiments, a dual optimization problem may provide an efficient optimization technique for problems containing non-smooth terms, e.g., via equation (49). Thus, in some embodiments, the objective may be transformed into a format that fits the dual optimization problem framework. For example, separate variables may be attributed to mode-1 and mode-2 matricizations of , e.g., introduce auxiliary variables 1 and 2 whose shapes are identical to such that 1
Hence, the corresponding augmented Lagrangian with dual multipliers of (50) may be defined as:
Where 1 and 2 are the dual multipliers.
In some embodiments, the dual optimization problem technique may be directly applied with the augmented Lagrangian function and solve the equation iteratively.
8.2.1 Updating Step for Z1;Z2-Variable
In some embodiments, considering the terms in Equation (51) either contain 1 or 2 and they are independent non-negative functions, thus each may be minimized with respect to 1 and 2 separately. Taking the sub-equation of 2 as a detailed example, the matrix form of 1 along mode-1 is updated by:
where k is the iteration step. According to, Equation (46) has the closed-form solution τ
1
k+1=fold1(Sτ
For updating 2, since this procedure is independent from 1-update, 2 may be updated in parallel using the similar way as follows:
2
k+1=fold2(Sτ
In such updating steps, the automatic determination of the Tucker ranks R1 and R2 may be performed via the shrinkage operation (⋅) in Equation (47) and (48) with τ1, τ2 for the current iteration. In some embodiments, the singular value decomposition U1diag(σ1)V1T=W(1)k+M1
In some embodiments, τ1 and τ2 may then be calculated by solving the following 0-1 knapsack problem:
where s1 and s2 are binary vectors. In some embodiments, the lengths of s1 and s2 may be identical to σ1 and σ2, respectively, and the number of non-zero elements in s1 and s2 may be equal to R1 and R2, respectively. Thus, once τ1 and τ2 are obtained, R1 and R2 are also correspondingly determined. In such knapsack problem, σ1 and σ2 may be the “profit” and (⋅) is the “weight”. In some embodiments, to determine the optimal ranks, the selected “profit” may be maximized and the overall “weight” may be prevented from exceeding the memory budget. Although perfectly solving such 0-1 knapsack problem is very challenging, an efficient greedy algorithm may obtain sub-optimal solution with very low computational complexity. Thus, in some embodiments, the largest value in σ1 and σ2 may be selected every time. When the “weight” reaches the memory budget *, the algorithm terminates. Hence, τ1 and τ2 may be set as the current selected values in σ1 and σ2. Correspondingly, R1 and R2 may be naturally obtained by:
R
1=max{i|σ1i>τ1, i=1, 2, . . . }, Equation (56)
R
2=max{i|σ2i>τ2, i=1, 2, . . . }. Equation (57)
Upon finding the solution, the singular values that are smaller than τ1 and τ2 are truncated.
In some embodiments, after determining the tensor ranks and updating 1,2-variable at the current iteration, may be updated via minimizing over . In some embodiments, fixing all variables except , the minimization of () reformulates to minimizing a quadratic function:
which is differentiable. Hence, the gradient of () over is given by:
Then, in some embodiments, may be updated by an optimization algorithm such as standard stochastic gradient descent as:
where is the learning rate for DNN training.
In some embodiments, once -variable is updated, another round of 1,2-variable update may be performed based on the latest -variable. Such iterative update may continue until the end of training procedure. After that, the desired tensor decomposed weight tensor are well trained and obtained. Meanwhile, the finally selected tensor ranks are automatically determined as the R1 and R2 calculated by Equation (56) (57) at the last iteration. Thus, in some embodiments, the overall procedure of the dual optimization problem-regularized training procedure with automatic rank selection may be summarized in Algorithm 5.
In some embodiments, referring to
In some embodiments, the exemplary network 2305 may provide network access, data transport and/or other services to any computing device coupled to it. In some embodiments, the exemplary network 2305 may include and implement at least one specialized network architecture that may be based at least in part on one or more standards set by, for example, without limitation, Global System for Mobile communication (GSM) Association, the Internet Engineering Task Force (IETF), and the Worldwide Interoperability for Microwave Access (WiMAX) forum. In some embodiments, the exemplary network 2305 may implement one or more of a GSM architecture, a General Packet Radio Service (GPRS) architecture, a Universal Mobile Telecommunications System (UMTS) architecture, and an evolution of UMTS referred to as Long Term Evolution (LTE). In some embodiments, the exemplary network 2305 may include and implement, as an alternative or in conjunction with one or more of the above, a WiMAX architecture defined by the WiMAX forum. In some embodiments and, optionally, in combination of any embodiment described above or below, the exemplary network 2305 may also include, for instance, at least one of a local area network (LAN), a wide area network (WAN), the Internet, a virtual LAN (VLAN), an enterprise LAN, a layer 3 virtual private network (VPN), an enterprise IP network, or any combination thereof. In some embodiments and, optionally, in combination of any embodiment described above or below, at least one computer network communication over the exemplary network 2305 may be transmitted based at least in part on one of more communication modes such as but not limited to: NFC, RFID, Narrow Band Internet of Things (NBIOT), ZigBee, 3G, 4G, 5G, GSM, GPRS, WiFi, WiMax, CDMA, OFDM, OFDMA, LTE, satellite and any combination thereof. In some embodiments, the exemplary network 2305 may also include mass storage, such as network attached storage (NAS), a storage area network (SAN), a content delivery network (CDN) or other forms of computer or machine readable media.
In some embodiments, the exemplary server 2306 or the exemplary server 2307 may be a web server (or a series of servers) running a network operating system, examples of which may include but are not limited to Apache on Linux or Microsoft IIS (Internet Information Services). In some embodiments, the exemplary server 2306 or the exemplary server 2307 may be used for and/or provide cloud and/or network computing. Although not shown in
In some embodiments, one or more of the exemplary servers 2306 and 2307 may be specifically programmed to perform, in non-limiting example, as authentication servers, search servers, email servers, social networking services servers, Short Message Service (SMS) servers, Instant Messaging (IM) servers, Multimedia Messaging Service (MMS) servers, exchange servers, photo-sharing services servers, advertisement providing servers, financial/banking-related services servers, travel services servers, or any similarly suitable service-base servers for users of the member computing devices 2301-2304.
In some embodiments and, optionally, in combination of any embodiment described above or below, for example, one or more exemplary computing member devices 2302-2304, the exemplary server 2306, and/or the exemplary server 2307 may include a specifically programmed software module that may be configured to send, process, and receive information using a scripting language, a remote procedure call, an email, a tweet, Short Message Service (SMS), Multimedia Message Service (MMS), instant messaging (IM), an application programming interface, Simple Object Access Protocol (SOAP) methods, Common Object Request Broker Architecture (CORBA), HTTP (Hypertext Transfer Protocol), REST (Representational State Transfer), SOAP (Simple Object Transfer Protocol), MLLP (Minimum Lower Layer Protocol), or any combination thereof.
In some embodiments, member computing devices 2402a through 2402n may also comprise a number of external or internal devices such as a mouse, a CD-ROM, DVD, a physical or virtual keyboard, a display, or other input or output devices. In some embodiments, examples of member computing devices 2402a through 2402n (e.g., clients) may be any type of processor-based platforms that are connected to a network 2406 such as, without limitation, personal computers, digital assistants, personal digital assistants, smart phones, pagers, digital tablets, laptop computers, Internet appliances, and other processor-based devices. In some embodiments, member computing devices 2402a through 2402n may be specifically programmed with one or more application programs in accordance with one or more principles/methodologies detailed herein. In some embodiments, member computing devices 2402a through 2402n may operate on any operating system capable of supporting a browser or browser-enabled application, such as Microsoft™, Windows™, and/or Linux. In some embodiments, member computing devices 2402a through 2402n shown may include, for example, personal computers executing a browser application program such as Microsoft Corporation's Internet Explorer™, Apple Computer, Inc.'s Safari™, Mozilla Firefox, and/or Opera. In some embodiments, through the member computing client devices 2402a through 2402n, user 2412a, user 2412b through user 2412n, may communicate over the exemplary network 2406 with each other and/or with other systems and/or devices coupled to the network 2406. As shown in
In some embodiments, at least one database of exemplary databases 2407 and 2415 may be any type of database, including a database managed by a database management system (DBMS). In some embodiments, an exemplary DBMS-managed database may be specifically programmed as an engine that controls organization, storage, management, and/or retrieval of data in the respective database. In some embodiments, the exemplary DBMS-managed database may be specifically programmed to provide the ability to query, backup and replicate, enforce rules, provide security, compute, perform change and access logging, and/or automate optimization. In some embodiments, the exemplary DBMS-managed database may be chosen from Oracle database, IBM DB2, Adaptive Server Enterprise, FileMaker, Microsoft Access, Microsoft SQL Server, My SQL, PostgreSQL, and a NoSQL implementation. In some embodiments, the exemplary DBMS-managed database may be specifically programmed to define each respective schema of each database in the exemplary DBMS, according to a particular database model of the present disclosure which may include a hierarchical model, network model, relational model, object model, or some other suitable organization that may result in one or more applicable data structures that may include fields, records, files, and/or objects. In some embodiments, the exemplary DBMS-managed database may be specifically programmed to include metadata about the data that is stored.
In some embodiments, the exemplary inventive computer-based systems/platforms, the exemplary inventive computer-based devices, and/or the exemplary inventive computer-based components of the present disclosure may be specifically configured to operate in a cloud computing/architecture 2425 such as, but not limiting to: infrastructure a service (IaaS) 2610, platform as a service (PaaS) 2608, and/or software as a service (SaaS) 2606 using a web browser, mobile app, thin client, terminal emulator or other endpoint 2604.
It is understood that at least one aspect/functionality of various embodiments described herein may be performed in real-time and/or dynamically. As used herein, the term “real-time” is directed to an event/action that may occur instantaneously or almost instantaneously in time when another event/action has occurred. For example, the “real-time processing,” “real-time computation,” and “real-time execution” all pertain to the performance of a computation during the actual time that the related physical process (e.g., a user interacting with an application on a mobile device) occurs, in order that results of the computation may be used in guiding the physical process.
As used herein, the term “dynamically” and term “automatically,” and their logical and/or linguistic relatives and/or derivatives, mean that certain events and/or actions may be triggered and/or occur without any human intervention. In some embodiments, events and/or actions in accordance with the present disclosure may be in real-time and/or based on a predetermined periodicity of at least one of: nanosecond, several nanoseconds, millisecond, several milliseconds, second, several seconds, minute, several minutes, hourly, several hours, daily, several days, weekly, monthly, etc.
As used herein, the term “runtime” corresponds to any behavior that is dynamically determined during an execution of a software application or at least a portion of software application.
In some embodiments, exemplary inventive, specially programmed computing systems and platforms with associated devices are configured to operate in the distributed network environment, communicating with one another over one or more suitable data communication networks (e.g., the Internet, satellite, etc.) and utilizing one or more suitable data communication protocols/modes such as, without limitation, IPX/SPX, X.25, AX.25, AppleTalk™, TCP/IP (e.g., HTTP), near-field wireless communication (NFC), RFID, Narrow Band Internet of Things (NBIOT), 3G, 4G, 5G, GSM, GPRS, WiFi, WiMax, CDMA, satellite, ZigBee, and other suitable communication modes.
In some embodiments, the NFC may represent a short-range wireless communications technology in which NFC-enabled devices are “swiped,” “bumped,” “tap” or otherwise moved in close proximity to communicate. In some embodiments, the NFC could include a set of short-range wireless technologies, typically requiring a distance of 10 cm or less. In some embodiments, the NFC may operate at 13.56 MHz on ISO/IEC 18000-3 air interface and at rates ranging from 106 kbit/s to 424 kbit/s. In some embodiments, the NFC may involve an initiator and a target; the initiator actively generates an RF field that may power a passive target. In some embodiment, this may enable NFC targets to take very simple form factors such as tags, stickers, key fobs, or cards that do not require batteries. In some embodiments, the NFC's peer-to-peer communication may be conducted when a plurality of NFC-enable devices (e.g., smartphones) within close proximity of each other.
The material disclosed herein may be implemented in software or firmware or a combination of them or as instructions stored on a machine-readable medium, which may be read and executed by one or more processors. A machine-readable medium may include any medium and/or mechanism for storing or transmitting information in a form readable by a machine (e.g., a computing device). For example, a machine-readable medium may include read only memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; electrical, optical, acoustical or other forms of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.), and others.
As used herein, the terms “computer engine” and “engine” identify at least one software component and/or a combination of at least one software component and at least one hardware component which are designed/programmed/configured to manage/control other software and/or hardware components (such as the libraries, software development kits (SDKs), objects, etc.).
Examples of hardware elements may include processors, microprocessors, circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, application specific integrated circuits (ASIC), programmable logic devices (PLD), digital signal processors (DSP), field programmable gate array (FPGA), logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth. In some embodiments, the one or more processors may be implemented as a Complex Instruction Set Computer (CISC) or Reduced Instruction Set Computer (RISC) processors; x86 instruction set compatible processors, multi-core, or any other microprocessor or central processing unit (CPU). In various implementations, the one or more processors may be dual-core processor(s), dual-core mobile processor(s), and so forth.
Computer-related systems, computer systems, and systems, as used herein, include any combination of hardware and software. Examples of software may include software components, programs, applications, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, application program interfaces (API), instruction sets, computer code, computer code segments, words, values, symbols, or any combination thereof. Determining whether an embodiment is implemented using hardware elements and/or software elements may vary in accordance with any number of factors, such as desired computational rate, power levels, heat tolerances, processing cycle memory budget, input data rates, output data rates, memory resources, data bus speeds and other design or performance constraints.
One or more aspects of at least one embodiment may be implemented by representative instructions stored on a machine-readable medium which represents various logic within the processor, which when read by a machine causes the machine to fabricate logic to perform the techniques described herein. Such representations, known as “IP cores,” may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that make the logic or processor. Of note, various embodiments described herein may, of course, be implemented using any appropriate hardware and/or computing software languages (e.g., C++, Objective-C, Swift, Java, JavaScript, Python, Perl, QT, etc.).
In some embodiments, one or more of illustrative computer-based systems or platforms of the present disclosure may include or be incorporated, partially or entirely into at least one personal computer (PC), laptop computer, ultra-laptop computer, tablet, touch pad, portable computer, handheld computer, palmtop computer, personal digital assistant (PDA), cellular telephone, combination cellular telephone/PDA, television, smart device (e.g., smart phone, smart tablet or smart television), mobile internet device (MID), messaging device, data communication device, and so forth.
As used herein, term “server” may be understood to refer to a service point which provides processing, database, and communication facilities. By way of example, and not limitation, the term “server” may refer to a single, physical processor with associated communications and data storage and database facilities, or it may refer to a networked or clustered complex of processors and associated network and storage devices, as well as operating software and one or more database systems and application software that support the services provided by the server. Cloud servers are examples.
In some embodiments, as detailed herein, one or more of the computer-based systems of the present disclosure may obtain, manipulate, transfer, store, transform, generate, and/or output any digital object and/or data unit (e.g., from inside and/or outside of a particular application) that may be in any suitable form such as, without limitation, a file, a contact, a task, an email, a message, a map, an entire application (e.g., a calculator), data points, and other suitable data. In some embodiments, as detailed herein, one or more of the computer-based systems of the present disclosure may be implemented across one or more of various computer platforms such as, but not limited to: (1) FreeBSD, NetBSD, OpenBSD; (2) Linux; (3) Microsoft Windows™; (4) OpenVMS™; (5) OS X (MacOS™); (6) UNIX™; (7) Android; (8) iOS™; (9) Embedded Linux; (10) Tizen™; (11) WebOS™; (12) Adobe AIR™; (13) Binary Runtime Environment for Wireless (BREW™); (14) Cocoa™ (API); (15) Cocoa™ Touch; (16) Java™ Platforms; (17) JavaFX™; (18) QNX™; (19) Mono; (20) Google Blink; (21) Apple WebKit; (22) Mozilla Gecko™; (23) Mozilla XUL; (24) .NET Framework; (25) Silverlight™; (26) Open Web Platform; (27) Oracle Database; (28) Qt™; (29) SAP NetWeaver™; (30) Smartface™; (31) Vexi™; (32) Kubernetes™ and (33) Windows Runtime (WinRT™) or other suitable computer platforms or any combination thereof. In some embodiments, illustrative computer-based systems or platforms of the present disclosure may be configured to utilize hardwired circuitry that may be used in place of or in combination with software instructions to implement features consistent with principles of the disclosure. Thus, implementations consistent with principles of the disclosure are not limited to any specific combination of hardware circuitry and software. For example, various embodiments may be embodied in many different ways as a software component such as, without limitation, a stand-alone software package, a combination of software packages, or it may be a software package incorporated as a “tool” in a larger software product.
For example, exemplary software specifically programmed in accordance with one or more principles of the present disclosure may be downloadable from a network, for example, a website, as a stand-alone product or as an add-in package for installation in an existing software application. For example, exemplary software specifically programmed in accordance with one or more principles of the present disclosure may also be available as a client-server software application, or as a web-enabled software application. For example, exemplary software specifically programmed in accordance with one or more principles of the present disclosure may also be embodied as a software package installed on a hardware device.
In some embodiments, illustrative computer-based systems or platforms of the present disclosure may be configured to handle numerous concurrent users that may be, but is not limited to, at least 100 (e.g., but not limited to, 100-999), at least 1,000 (e.g., but not limited to, 1,000-9,999), at least 10,000 (e.g., but not limited to, 10,000-99,999), at least 100,000 (e.g., but not limited to, 100,000-999,999), at least 1,000,000 (e.g., but not limited to, 1,000,000-9,999,999), at least 10,000,000 (e.g., but not limited to, 10,000,000-99,999,999), at least 100,000,000 (e.g., but not limited to, 100,000,000-999,999,999), at least 1,000,000,000 (e.g., but not limited to, 1,000,000,000-999,999,999,999), and so on.
In some embodiments, illustrative computer-based systems or platforms of the present disclosure may be configured to output to distinct, specifically programmed graphical user interface implementations of the present disclosure (e.g., a desktop, a web app., etc.). In various implementations of the present disclosure, a final output may be displayed on a displaying screen which may be, without limitation, a screen of a computer, a screen of a mobile device, or the like. In various implementations, the display may be a holographic display. In various implementations, the display may be a transparent surface that may receive a visual projection. Such projections may convey various forms of information, images, or objects. For example, such projections may be a visual overlay for a mobile augmented reality (MAR) application.
In some embodiments, illustrative computer-based systems or platforms of the present disclosure may be configured to be utilized in various applications which may include, but not limited to, gaming, mobile-device games, video chats, video conferences, live video streaming, video streaming and/or augmented reality applications, mobile-device messenger applications, and others similarly suitable computer-device applications.
As used herein, the term “mobile electronic device,” or the like, may refer to any portable electronic device that may or may not be enabled with location tracking functionality (e.g., MAC address, Internet Protocol (IP) address, or the like). For example, a mobile electronic device may include, but is not limited to, a mobile phone, Personal Digital Assistant (PDA), Blackberry™, Pager, Smartphone, or any other reasonable mobile electronic device.
As used herein, terms “proximity detection,” “locating,” “location data,” “location information,” and “location tracking” refer to any form of location tracking technology or locating method that may be used to provide a location of, for example, a particular computing device, system or platform of the present disclosure and any associated computing devices, based at least in part on one or more of the following techniques and devices, without limitation: accelerometer(s), gyroscope(s), Global Positioning Systems (GPS); GPS accessed using Bluetooth™; GPS accessed using any reasonable form of wireless and non-wireless communication; WiFi™ server location data; Bluetooth™ based location data; triangulation such as, but not limited to, network based triangulation, WiFi™ server information based triangulation, Bluetooth™ server information based triangulation; Cell Identification based triangulation, Enhanced Cell Identification based triangulation, Uplink-Time difference of arrival (U-TDOA) based triangulation, Time of arrival (TOA) based triangulation, Angle of arrival (AOA) based triangulation; techniques and systems using a geographic coordinate system such as, but not limited to, longitudinal and latitudinal based, geodesic height based, Cartesian coordinates based; Radio Frequency Identification such as, but not limited to, Long range RFID, Short range RFID; using any form of RFID tag such as, but not limited to active RFID tags, passive RFID tags, battery assisted passive RFID tags; or any other reasonable way to determine location. For ease, at times the above variations are not listed or are only partially listed; this is in no way meant to be a limitation.
As used herein, terms “cloud,” “Internet cloud,” “cloud computing,” “cloud architecture,” and similar terms correspond to at least one of the following: (1) a large number of computers connected through a real-time communication network (e.g., Internet); (2) providing the ability to run a program or application on many connected computers (e.g., physical machines, virtual machines (VMs)) at the same time; (3) network-based services, which appear to be provided by real server hardware, and are in fact served up by virtual hardware (e.g., virtual servers), simulated by software running on one or more real machines (e.g., allowing to be moved around and scaled up (or down) on the fly without affecting the end user).
In some embodiments, the illustrative computer-based systems or platforms of the present disclosure may be configured to securely store and/or transmit data by utilizing one or more of encryption techniques (e.g., private/public key pair, Triple Data Encryption Standard (3DES), block cipher algorithms (e.g., IDEA, RC2, RCS, CAST and Skipjack), cryptographic hash algorithms (e.g., MD5, RIPEMD-160, RTRO, SHA-1, SHA-2, Tiger (TTH),WHIRLPOOL, RNGs).
As used herein, the term “user” shall have a meaning of at least one user. In some embodiments, the terms “user”, “subscriber” “consumer” or “customer” may be understood to refer to a user of an application or applications as described herein and/or a consumer of data supplied by a data provider. By way of example, and not limitation, the terms “user” or “subscriber” may refer to a person who receives data provided by the data or service provider over the Internet in a browser session or may refer to an automated software application which receives the data and stores or processes the data.
The aforementioned examples are, of course, illustrative and not restrictive.
Publications cited throughout this document are hereby incorporated by reference in their entirety. While one or more embodiments of the present disclosure have been described, it is understood that these embodiments are illustrative only, and not restrictive, and that many modifications may become apparent to those of ordinary skill in the art, including that various embodiments of the inventive methodologies, the illustrative systems and platforms, and the illustrative devices described herein may be utilized in any combination with each other. Further still, the various steps may be carried out in any desired order (and any desired steps may be added and/or any desired steps may be eliminated).
This application is a Continuation Application relating to and claiming the benefit of commonly-owned, co-pending PCT International Application No. PCT/US2022/030864, filed May 25, 2022, which claims priority to and the benefit of commonly-owned U.S. Provisional Application 63/193,692 filed on May 27, 2021, which is incorporated herein by reference in its entirety.
This invention was made with government support under grant number 1955909 awarded by the National Science Foundation. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63193692 | May 2021 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US22/30864 | May 2022 | US |
Child | 18516213 | US |