Machine learning models are used for various purposes including decision making, predicting trends and generating language and images. The models are generally trained initially on training data and then can perform such predictive and generative actions. During the training process, the models acquire the knowledge implicitly represented by the training data and may then be applied to make decisions in the future based on the acquired knowledge. Complex language models that are becoming common today are comprehensive models, constructed, e.g., via an artificial neural network, and may include millions of nodes across many layers. Both training and maintaining such a comprehensive model can be expensive in terms of time, space, computing resources, capital, etc. Although a comprehensive model may be used to handle a subset of the tasks that it is trained for, it may not be justified considering the cost.
The methods, systems and/or programming described herein are further described in terms of exemplary embodiments. These exemplary embodiments are described in detail with reference to the drawings. These embodiments are non-limiting exemplary embodiments, in which like reference numerals represent similar structures throughout the several views of the drawings, and wherein:
In the following detailed description, numerous specific details are set forth by way of examples in order to facilitate a thorough understanding of the relevant teachings. However, it should be apparent to those skilled in the art that the present teachings may be practiced without such details. In other instances, well known methods, procedures, components, and/or system have been described at a relatively high-level, without detail, in order to avoid unnecessarily obscuring aspects of the present teachings.
The present teaching is directed to a framework of compressing a previously trained original model for a particular application based on an application-dependent dataset. The original model may be previously constructed to accomplish comprehensive tasks (e.g. a large language model) and may be utilized in different applications. A trained model can be complex in order to capture comprehensive knowledge. For instance, an artificial neural network (ANN) model may comprise millions of nodes in multiple layers with many connections with weights and biases. A particular application may utilize a comprehensive model to handle a subset of the comprehensive tasks that the comprehensive model is trained for, however it will likely be excessive. In addition, excessive computing resources are required in order to execute the complex model to carry out a subset of tasks, therefore wasting the valuable resources. Furthermore, because of the complexity in computation, these models tend to take a significant amount of time to operate, more than available. In some situations, using a complex model may even overly complicate the problem solution. For example, a language model for, e.g., language understanding and generation, may be previously trained to capture comprehensive knowledge during learning in this domain. Although such a trained comprehensive model may be utilized in a particular application for, e.g., generating an interactive voice response (IVR) with a limited vocabulary, it may be an overkill and running the overall model for generating an IVR response with limited vocabulary may result in delay of voice prompts and responses to users.
The present teaching is directed to a framework to compress a previously trained model (e.g., a more comprehensive model with a larger set of nodes and layers, and the like) to reduce the complexity, the resources required for execution, the execution time and the storage space for the model, etc. This is shown in
With the trained original model 100, data samples of the application-dependent dataset are fed to the original model 100 as input, one at a time, so that the nodes/layers in the model operate to respond to the input data sample and generate corresponding output vectors. Such output vectors from nodes/layers may then be used to assess, with respect to each of the nodes/layers in the original model, whether any of the nodes/layer has vector representation similar to that of others to detect non-contributing nodes/layers as candidates for removal. That is, if node A and node B perform the same function, one of them may be redundant and may potentially be removed. Each of the identified removal candidates (either a node or a layer) may then be evaluated on its impact on the overall performance of the entire model (e.g., based on the loss) to ensure that removal of a node/layer does not impact the overall performance of the entire model, beyond a preconfigured threshold. Such a threshold may be configured based on, e.g., specific application needs. In some embodiments, the threshold may be system generated dynamically based on some condition such as the least drop in performance or largest drop in model size and complexity (if any). Those removal candidates that do not impact the overall performance of the entire mode (according to some specified criterion) may then be removed from the original model 100 to generate the compressed model 120.
The output vectors from nodes/layers of the original model 100 according to each data sample of the application-dependent dataset are provided to the model compression pipeline 110 for the removal operation. In some embodiments, the removal candidate determination may be carried out based on aggregated vectors from nodes/layers. That is, output vectors of a node/layer produced in response to different data samples may be aggregated via different means. For example, in some embodiments, such output vectors from the same node/layer may be averaged to generate an aggregated output vector. The aggregated output vector associated with each node/layer may be used for removal evaluation. As discussed herein, the removal consideration may be based on an assessment on whether the functional roles between nodes/layers are equivalent and the assessment may be based on the aggregated output vectors associated with such nodes/layers. For example, if these two nodes/layers are functionally similar, their aggregated output vectors may reach a certain level of similarity, i.e., one of the two nodes/layers may be functionally non-contributing (redundant) and may be selected as a removal candidate. In some embodiments, the removal evaluation with respect to nodes and that with respect to layers may be performed separately.
In this illustrated embodiment, the model compression pipeline 110 comprises an output vector aggregator 300, a node level vector comparator 310, a layer level vector comparator 320, a loss-based removal candidate determiner 330, and a compressed model configurator 340.
In some embodiments, the aggregated output vectors for each node/layer may be obtained in an iterative manner. To support the iterative operation, a vector database 370 may be provided. For example, when the first data sample is input to the original model 100, the node/layer produces an output vector, which is stored in the vector database 370. When the second data sample is input to the original model 100, the output vector aggregator 300 may retrieve the output vector for the previous data sample from the vector database 370 and aggregate it with the output vector from the node/layer for the second data sample to generate an aggregated output vector for the node/layer, which is then stored in the vector database 370. This iterative process continues until all the data samples have been provided to the original model so that the last stored aggregated output vector for the node may be used for further processing.
Based on the aggregated output vectors for all nodes/layers of the original model, the node level vector comparator 310 may then be invoked to create, at 335, a similarity matrix for all nodes in the original model.
In some embodiments, the similarity matrix may encompass all nodes in the original model, some may be at the same layer, and some may be from different layers. In some embodiments, the original model may include multiple connected sub-ANN networks, each of which may have its own input, output, and intermediate layers. Some of such sub-networks may be of different types, such as a multilayer perceptron neural network, a feedforward neural network, a long-short term memory (LSTM) neural network, a convolutional neural network (CNN), a recurrent neural network (RNN), etc. With an original model having different types of connected sub-neural networks, the similarity matrix for nodes may include, either all nodes from the sub-networks or some nodes from some of the sub-networks. The choice of which nodes are included in the similarity assessment may be determined based on, e.g., application needs or any practical considerations arising from the applications in hand.
Although a similarity matrix as illustrated in
With the computed similarity matrix for nodes in the original model, the loss-based removal candidate determiner 330 may be invoked to select, at 345, nodes that are considered as redundant via a loss-based evaluation. As discussed herein, this evaluation process may first select removal candidates based on similarity measures from the similarity matrix and each of such candidates may be assessed based on the impact on the overall performance of the model. Details related to the operation of the loss-based removal candidate determiner 330 will be provided below with reference to
The removal candidate selector 510 may be provided for selecting node/layer removal candidates based on similarity measures in a similarity matrix (either for nodes or for layers). The removal candidates may be those that perform a function sufficiently similar to another node/layer and, thus, potentially redundant. In some embodiments, the sufficiency may be defined that the similarity measure between two aggregated output vectors from two nodes or two layers is above a specified level. For instance, a similarity above 0.9 may be defined to be sufficiently similar. In this case, node 2 in
The network loss determiner 520 is provided to simulate a removal of each removal candidate (a node or a layer) and then determine a loss of the model after the simulated removal. The loss assessment unit 540 is provided for comparing the losses of the model prior to and after the simulated removal. If there is no increase in loss after the simulated removal, it indicates that not only the removal candidate performs a redundant function but also its removal does not cause any negative impact on the performance of the model. That is, the candidate is functionally non-contributing. In this case, the removal candidate can be removed (or compressed) from the model. The loss assessment unit 540 may send information to the model configurator 340 for affirming that the removal candidate can be removed to compress the model.
If the loss of the model increases due to a simulated removal, this may indicate that although the candidate performs a similar function as another node/layer, it implicitly plays some other functionally contributing roles in the model so that the candidate is not redundant and, hence, should not be removed. In this case, the removal of the candidate is not affirmed because it does not pass the loss-based assessment. In some embodiments, such loss-based redundancy assessment may be carried out with respect to a pre-determined loss-based redundancy condition 550, which may be defined as a certain percent of the total loss. For example, the loss-based redundancy condition 550 may be specified as 10% of the total loss, i.e., so long as the increased loss is within 10% of the loss without a simulated removal, the removal candidate may still be removed. Such a condition may be set up based on, e.g., a compression rate desired. If a higher compression rate is desired, the percent may increase and vice versa. In addition, the redundancy condition for nodes may differ from that for layers. Furthermore, the loss-based redundancy condition for nodes from the same layer may be different from that for nodes from different layers.
In some embodiments, to improve the efficiency of the model compression pipeline 110, a removal candidate that is not may be removed from further consideration and may be deleted from a corresponding similarity matrix so that it is no longer considered in subsequent processing. This is performed by the candidate removal unit 560. For instance, if node 2 in
In operation, when the loss-based removal candidate determiner 330 receives, at 505, a similarity matrix, it may identify, at 515, a next row (corresponding to a node or a layer) in the matrix to process. With respect to the selected row, the removal candidate selector 510 selects, at 525, a next node/layer in the row that performs a similar function, i.e., the similarity measure exceeds a pre-determined level. The selected removal candidate is then sent to the network loss determiner 520 to simulate the removal of the candidate from the model and then measures, at 535, the overall loss of the model in accordance with defined loss function (specified in 530). The measured overall loss after the simulated removal of the candidate is forwarded to the loss assessment unit 540, where it is evaluated, at 545, whether the simulated removal has a specified negative impact on the overall performance of the model. If the candidate is redundant (there is no loss increase or the loss increase is below the specified redundancy condition), determined at 555, the loss assessment unit 540 sends, at 565, information about the candidate to be removed to the compressed model configurator 340. If the candidate is not redundant, the candidate may be optionally deleted from the input similarity matrix at 575.
The above disclosed removal candidate determination process repeats for each of the candidates selected from each row of the matrix as performing a similar function until all rows are processed. Specifically, when there is more to consider with respect to a row, determined at 585, the processing goes back to step 525 to select the next candidate performing a similar function from the same row and then simulate the removal in order to determine whether the next candidate is redundant based on loss-based evaluation. When all candidates from the same row are processed, it is determined, at 595, whether there are more rows to be processed in the input similarity matrix. If so, the processing returns to step 515 to process the next row. If all rows have been processed, the loss-based removal candidate determiner 330 returns to step 505 to wait to receive another similarity matrix.
As shown in
To implement various modules, units, and their functionalities as described in the present disclosure, computer hardware platforms may be used as the hardware platform(s) for one or more of the elements described herein. The hardware elements, operating systems and programming languages of such computers are conventional in nature, and it is presumed that those skilled in the art are adequately familiar with to adapt those technologies to appropriate settings as described herein. A computer with user interface elements may be used to implement a personal computer (PC) or other type of workstation or terminal device, although a computer may also act as a server if appropriately programmed. It is believed that those skilled in the art are familiar with the structure, programming, and general operation of such computer equipment and as a result the drawings should be self-explanatory.
Computer 700, for example, includes COM ports 750 connected to and from a network connected thereto to facilitate data communications. Computer 700 also includes a central processing unit (CPU) 720, in the form of one or more processors, for executing program instructions. The exemplary computer platform includes an internal communication bus 710, program storage and data storage of different forms (e.g., disk 770, read only memory (ROM) 730, or random-access memory (RAM) 740), for various data files to be processed and/or communicated by computer 700, as well as possibly program instructions to be executed by CPU 720. Computer 700 also includes an I/O component 760, supporting input/output flows between the computer and other components therein such as user interface elements 780. Computer 700 may also receive programming and data via network communications.
Hence, aspects of the methods of information analytics and management and/or other processes, as outlined above, may be embodied in programming. Program aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Tangible non-transitory “storage” type media include any or all of the memory or other storage for the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide storage at any time for the software programming.
All or portions of the software may at times be communicated through a network such as the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, in connection with information analytics and management. Thus, another type of media that may bear the software elements includes optical, electrical, and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links, or the like, also may be considered as media bearing the software. As used herein, unless restricted to tangible “storage” media, terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.
Hence, a machine-readable medium may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, which may be used to implement the system or any of its components as shown in the drawings. Volatile storage media include dynamic memory, such as a main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that form a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a physical processor for execution.
It is noted that the present teachings are amenable to a variety of modifications and/or enhancements. For example, although the implementation of various components described above may be embodied in a hardware device, it may also be implemented as a software only solution, e.g., an installation on an existing server. In addition, the techniques as disclosed herein may be implemented as a firmware, firmware/software combination, firmware/hardware combination, or a hardware/firmware/software combination.
In the preceding specification, various example embodiments have been described with reference to the accompanying drawings. It will, however, be evident that various modifications and changes may be made thereto, and additional embodiments may be implemented, without departing from the broader scope of the present teaching as set forth in the claims that follow. The specification and drawings are accordingly to be regarded in an illustrative rather than restrictive sense.