The present invention relates generally to artificial neural network techniques, and more particularly to methods and system including a multi-dimensional deep neural network.
Artificial neural networks (ANNs) are computing systems inspired by biological neural networks of a human brain. Such computing systems are widely used in a variety of fields, such as Natural Language Processing (NLP), Image Processing, Computer Vision and/or the like. Typically, an artificial neural network (ANN) is a directed weighted graph with interconnected neurons (i.e., nodes). These interconnected neurons are grouped into layers. Each layer of the ANN performs a mathematical manipulation (e.g., non-linear transformation) on input data to generate output data. The ANN may have an input layer, a hidden layer, and an output layer to process the input data. In between each of the layers there is an activation function for determining the output of the ANN. To increase accuracy of processing the input data, some ANNs have multiple hidden layers. Such an ANN with multiple layers between the input layer and the output layer is known as a deep neural network (DNN). The interconnected neurons of the DNN contain data values. When the DNN receives the input data, the input data is propagated in forward direction through each of the layers. Each of the layers calculates an output and provides the output as input to the next layer. Thus, the input data is propagated in a feed-forward manner. For instance, feed-forward DNNs perform function approximation that filters inputs of weighted combinations through non-linear activation functions. The non-linear activation functions are organized into a cascade of fully connected hidden layers.
Such DNNs are required to be trained to accomplish tasks of the variety of fields. However, training the DNNs is a tedious process when the number of hidden layers increases for better approximation. For instance, activation functions in the DNN cause complex problems, such as vanishing gradient problem during backpropagation of the objective function gradient among the layers of the DNN. The backpropagation determines gradients of a loss function with respect to the weights in the DNN. However, the large number of hidden layers in the DNN may drive gradients to zero (i.e., the vanishing gradient problem), which may be far from an optimum value. Further, the DNN may suffer difficulties in optimizing weights of the neurons due to the large number of hidden layers. This may delay the training process of the DNN and may slow down improvements of model parameters of the DNN, which affects accuracy of the DNN. The vanishing gradient problem in the training process of the DNN may be overcome by introducing residual neural network layers in the DNN.
A residual neural network (ResNet) utilizes skip connections that add outputs from previous layers of the DNN to the input of other non-adjacent layers. Typically, the ResNet may be implemented with double or triple layer of skip connections. Furthermore, the ResNet allows skipping of layers only in forward direction of input propagation. This prevents forming of cycles or loops, which are computationally cumbersome in both the training and inference processes of the DNN. However, the forward direction of propagating the input may be an undesirable limitation for some situations. It may be possible to compensate for this by increasing the number of hidden layers. As a consequence, number of parameters may increase as additional hidden layers are increased, while propagating the input in the forward direction. The increase in the number of parameters may also delay the training process, which is undesirable.
Accordingly, there is a need for a technical solution to overcome the above-mentioned limitation. More specifically, there is need to train neural networks with multiple hidden layers in an efficient and feasible manner, while avoiding the vanishing gradient problem and problem of increase in number of parameters.
It is an object of some embodiments to provide an artificial neural network (ANN), such as deep neural network (DNN) having a deep architecture with multiple hidden layers that allows connections among layers regardless of their respective position in the ANN. A DNN has a plurality of layers, where each layer of the plurality of layers may be connected to respective non-adjacent layers of the plurality of layers. Additionally, or alternatively, it is an object of some embodiments to increase number of hidden layers of the DNN without increasing number of trained parameters of such DNN. Additionally, or alternatively, it is an object of some embodiments to provide a DNN architecture that allows reusing outputs of different layers to enhance performance of the DNN without increasing the number of parameters.
Some embodiments are based on understanding of advantages of sharing information among layers of DNN in both directions of data propagation. For example, while outputs from previous layers of the DNN can be added to the input of other adjacent and non-adjacent layers of DNN, it can also be beneficial to have outputs computed at later layers to help better process the input data or intermediate outputs from earlier layers. In such a manner, the data can be exchanged in both directions to add additional flexibility on data processing. However, propagating data in both directions may create logical loops jeopardizing training and execution of DNNs.
Some embodiments are based on realization that this loop problem may be addressed by rolling out DNN in a direction different from a direction of data propagation by cloning or duplicating parameters of DNN. For example, some embodiments are based on realization that a sequence of hidden layers that sequentially processes an input can provide insightful information for another parallel sequence of hidden DNN layers that also processes sequentially the same input. In some implementations, both sequences are feed-forward neural networks with identical parameters. In such a manner, having multiple sequences of hidden layers does not increase the number of parameters. In some embodiments, at least some layers of one sequence of hidden layers are connected to at least some layers of another sequence of hidden layers to combine at least some intermediate outputs of the sequence of hidden layers with at least some inputs to another sequence of hidden layers. Each of the sequences of hidden layers corresponds to a DNN. The sequences of hidden layers, i.e. the DNNs are arranged in a direction different from a direction of propagation of the input in the layers of each of the DNNs. To that end, the DNNs in the sequence of DNNs are connected to one another. For example, at least some layers of first DNN are connected to at least some layers of subsequent DNNs. As used herein, the two layers are connected when at least a function of an output of a layer forms at least part of an input to another connected layer. The connections between the DNNs combine to form a single neural network, such as a multi-dimensional neural network. As used herein, in the multi-dimensional neural network, the input data is propagated along multiple directions, i.e., from input to output layer of a DNN and across the sequence of DNNs forming the multi-dimensional neural network.
In various embodiments, the multi-dimensional neural network may have different numbers of DNNs. In one embodiment, the multi-dimensional neural network includes two DNNs, as an inner DNN and an outer DNN. Each of the DNNs, i.e. the inner DNN and the outer DNN includes one or more intermediate (hidden) layers. When the layers of the inner DNN are connected to the layers of the outer DNN, the layers are connected on an input/output level in order to preserve dimensions of the inner DNN and outer DNN layers. For instance, an output of a layer of the inner DNN may be combined with an input to a layer of the outer DNN by adding them together.
The layers of the inner DNN and the layers of the outer DNN have a plurality of connections. For example, all layers of the inner DNN can be connected to all layers of the outer DNN. Such a connection pattern is referred herein as full connection, making the multi-dimensional neural network being fully connected. Alternatively, the multi-dimensional neural network can be partially connected. For example, in a partially connected multi-dimensional neural network, one or more layers of the inner DNN can be connected to multiple layers of the outer DNN. Additionally, or alternatively, in a partially connected multi-dimensional neural network, multiple layers of the inner DNN can be connected to a layer of the outer DNN.
Different connection patterns used by different embodiments allow to adapt the multi-dimensional neural network for different applications. For example, in some embodiments, the output of a given layer of the inner DNN may only contribute to the input of a unique layer of the outer DNN. In some embodiments, outputs of two given layers in the inner DNN may contribute to the input of the same layer of the outer DNN.
In addition to different patterns of the connections between layers of different DNNs in the multi-dimensional neural network, some embodiments use the connections of different types. For example, various embodiments use hard connections, soft connections or combinations thereof. For example, in a hard connection, outputs of layers of the inner DNN are added to inputs of layers of the outer DNN in their entirety. That is, the layers are either connected or not. If the layers are connected, the output of one layer is combined with the input of another layer without additional scaling and/or weight multiplication. If the layers are not connected, nothing from the output of the layer is added to the input of the other layer.
Hence, according to the principles of hard connection, the output of a layer of the inner DNN may either contribute to the input of a layer of the outer DNN or may not contribute to the input of that layer of the outer DNN. The principle of data propagation according to hard connections differs from principles of data propagation between layers of a single DNN. Thus, the hard connections allow to decouple principles of data propagations in different directions. In turn, such a decoupling allows to search for a better pattern of hard connections on top of training the parameters of DNNs, which adds additional flexibility into the architecture of multi-dimensional neural network.
In some embodiments, during the training process of the multi-dimensional neural network, the pattern of hard connections is selected among a plurality of patterns of connections. For each selected connection pattern, corresponding multi-dimensional neural network is trained. The trained multi-dimensional network that gives best performance is selected among all trained multi-dimensional networks. More specifically, the hard connection patterns are selected based on a search algorithm, for example a random search algorithm. The random search algorithm randomly samples a certain number of connection patterns from the plurality of connections, and trains a model for each of the connection patterns. Then one model is chosen based on a performance measure (e.g. accuracy, F1, BLEU score, etc.) for a validation set. For instance, one or more connection patterns with high scores may be selected for runtime execution. In some cases, the selected connection patterns may be manipulated by making small modifications.
Additionally, or alternatively, in a soft type of connection, only a portion of the output of one layer is combined with the input of another layer. Specifically, the output of a layer softly connected to another layer is “weighted” before being added to the input of another layer. The weights of soft connection may vary for different soft connections.
In some other embodiments, the plurality of connections may correspond to soft connection patterns. In the case of the soft connection patterns, outputs of layers of the inner DNN are added to input of each layer of the outer DNN along with weights. In some example embodiments, these weights of the soft connection patterns may be associated with all connections or a subset of the connections between layers of the inner DNN and layers of the outer DNN. The weights may indicate strength of the connections between a given layer of the inner DNN and a given layer of the outer DNN. An output of the given layer of the inner DNN may be scaled by a factor that depends upon a set of connection weights prior to combination with the input of the given layer of the outer DNN. In some embodiments, during the training process of the multi-dimensional neural network, the connection weights are trained simultaneously with parameters of the DNNs. In such a manner, in contrast with the hard connections, the estimation of the soft connections or weights of the soft connections can be implemented as a process integrated with training neural networks. Hence, the process of establishing the soft connections is more aligned with the principles of neural networks.
For example, in some embodiments the multi-dimensional neural network is fully connected with soft connections. The full connection reflects the maximum connection pattern considered reasonable by a network designer. The nature of soft connections allows to let the training decide which connection is more important than the other.
For example, in some embodiments, the trained weights of the soft connections can be pruned by retaining only subsets of the connection based on values of the weights. For example, only connections with a weight above a threshold may be retained, or only the connection with the largest weight among all connections out of a given layer of the inner DNN may be retained, or only the connection with the largest weight among all connections into a given layer of the outer DNN may be retained. After the connections have been pruned, the network may be further trained using only the remaining connections, the weights of the remaining connections being simultaneously trained. In another embodiment, the remaining soft connections may be converted into hard connections, and the obtained network further trained.
In another embodiment, the multi-dimensional neural network includes one or multiple hidden DNNs in between the inner DNN and the outer DNN. The DNNs of the multi-dimensional neural network are connected into a forward direction from the inner DNN to the outer DNN. For instance, an input is propagated in forward direction from the inner DNN to the outer DNN. The propagation of the input in the forward direction prevents cycles or loops among the layers, while allowing a later layer of one DNN to be connected with an earlier layer of a subsequent DNN. Hence, addition of hidden DNNs in existing ANN provides a deep architecture, i.e. a multi-dimensional neural network without increasing number of parameters and without creating any cycles among the corresponding layers.
In one example embodiment, the multi-dimensional neural network forms a multi-pass transformer (MPT) architecture for an NLP application, such as a machine translation of languages. The MPT includes an inner network and an outer network. The inner network corresponds to the inner DNN of the multi-dimensional neural network and the outer network corresponds to the outer DNN of the multi-dimensional neural network. The outer network utilizes features from layers of the inner network by adding output from layers of the inner network to the original input of at least one of the layers of the outer network. In the MPT, same parameters of the inner network are shared to the outer network. As the same parameters are shared between the inner network and the outer network, there is no increase in the number of parameters. The MPT also performs feature refinement in an iterative manner, which improves performance for the machine translation significantly. Furthermore, the MPT may be combined with a self-attention network and a convolutional neural network or a feed-forward neural network for the machine translation. In some example embodiments, the MPT may be generated by performing a search (such as a heuristic based search) on a search space on the plurality of possible connection patterns. The heuristic based search may be performed using an evolutionary search algorithm. In some example embodiments, the MPT may include connection weights that determine strength of the connection between layers of the inner network and layers of the outer network of the MPT. The connection weights may be learned together with the other neural network parameters. Additionally, or alternatively, the MPT model for the machine translation includes layers with a dual network or path consisting of a self-attention subnetwork and a feed-forward neural (FFN) subnetwork (e.g. a convolutional neural network). Such dual combination of the self-attention subnetwork and the FFN subnetwork can achieve better performance than a pure self-attention network.
Accordingly, one embodiment discloses a computer-based artificial intelligence (AI) system. The AI system comprises an input interface configured to accept input data; a memory configured to store a multi-dimensional neural network having a sequence of deep neural networks (DNN) including an inner DNN and an outer DNN; a processor configured to submit the input data to the multi-dimensional neural network to produce an output of the outer DNN and an output interface configured to render at least a function of the output of the outer DNN. In the multi-dimensional neural network, each DNN includes a sequence of layers and corresponding layers of different DNNs have identical parameters. Each DNN is configured to process the input data sequentially by the sequence of layers along a first dimension of data propagation. The DNNs in the sequence of DNNs are arranged along a second dimension of data propagation starting from the inner DNN till the outer DNN. The DNNs in the sequence of DNNs are connected, such that at least an output of an intermediate layer or a final layer of a DNN is combined with an input to at least one layer of the subsequent DNN in the sequence of DNNs. The multi-dimensional neural network receives the input data submitted by the processor to produce the output of the outer DNN.
Accordingly, another embodiment discloses a method for generating an output of a multi-dimensional neural network. The method includes accepting input data via an input interface. The method includes submitting the input data to the multi-dimensional neural network having a sequence of DNNs including an inner DNN and an outer DNN. Each DNN includes a sequence of layers and corresponding layers of different DNNs have identical parameters. Each DNN is configured to process the input data sequentially by the sequence of layers along a first dimension of data propagation. The DNNs in the sequence of DNNs are arranged along a second dimension of data propagation starting from the inner DNN till the outer DNN. The DNNs in the sequence of DNNs are connected, and at least one intermediate or final output of a DNN is combined with an input to at least one layer of the subsequent DNN in the sequence of DNNs. The method includes generating an output of the outer DNN. The method further includes rendering at least a function of the output of the outer DNN.
The presently disclosed embodiments will be further explained with reference to the attached drawings. The drawings shown are not necessarily to scale, with emphasis instead generally being placed upon illustrating the principles of the presently disclosed embodiments.
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be apparent, however, to one skilled in the art that the present disclosure may be practiced without these specific details. In other instances, apparatuses and methods are shown in block diagram form only in order to avoid obscuring the present disclosure.
As used in this specification and claims, the terms “for example,” “for instance,” and “such as,” and the verbs “comprising,” “having,” “including,” and their other verb forms, when used in conjunction with a listing of one or more components or other items, are each to be construed as open ended, meaning that the listing is not to be considered as excluding other, additional components or items. The term “based on” means at least partially based on. Further, it is to be understood that the phraseology and terminology employed herein are for the purpose of the description and should not be regarded as limiting. Any heading utilized within this description is for convenience only and has no legal or limiting effect.
In recent years, architecture or structure of neural networks has evolved from recurrent neural network (RNN) to long-short-term memory (LSTM), convolutional neural network (CNN) with convolutional sequential architecture and transformer. Generally, the convolutional sequential architecture and the transformer are popularly used for Natural Language Processing, such as language representation learning. For computer vision application, a neural architecture of a neural network corresponds to a multi-path approach for efficient information flow in layers of the neural network. For the NLP application, the neural architecture corresponds to a sequential based neural architecture. The sequential based neural architecture utilizes features from last layer (i.e. output layer) of the neural network, which provides a limited information flow. Some embodiments are based on a realization to gain insights from the multi-path based neural architecture.
Specifically, some embodiments are based on understanding of advantages of sharing information among layers of DNN in both directions of data propagation. For example, while outputs from previous layers of the DNN can be added to the input of other adjacent and non-adjacent layers of DNN, it can also be beneficial to have outputs computed at later layers to help better process the input data or intermediate outputs from earlier layers. In such a manner, the data can be exchanged in both directions to add additional flexibility on data processing. However, propagating data in both directions may create logical loops jeopardizing training and execution of DNNs.
Some embodiments are based on realization that this loop problem may be addressed by rolling out DNN in a direction different from a direction of data propagation by cloning or duplicating parameters of DNN. For example, some embodiments are based on realization that a sequence of hidden layers that sequentially processes an input can provide insightful information for another parallel sequence of hidden DNN layers that also processes sequentially the same input.
To that end, some exemplar embodiments disclose a multi-stage fusion mechanism that combines residual connections and dense connections is applied for obtaining a robust neural architecture for the applications, such as the NLP application, the computer vision application, or combination thereof. The residual connections enable skipping connection of a feature of a layer of a neural network to other non-adjacent layers of the neural network. The dense connections enable all possible connection between layers of a neural network. The residual connections and the multi-stage connections are implemented based on operations, such as concatenation, addition and recurrent fusion. The residual connections and the dense connections enable combination of information of features from lower layers and higher layers of the neural network in an efficient manner. More specifically, the residual connections allow gradients (or vectors) to flow through a neural network without passing through non-linear activation functions between layers of the neural network. In this manner, the residual connection enables skipping one or more layers of the neural network. This prevents problem of vanishing gradient in the neural network. The prevention of the vanishing gradient problem improves training process of the neural network.
A few examples of application of the multi-stage fusion include object detection, machine translation, and/or the like. However, such model lacks to capture multi-stage information due to the limited capacity of the concatenation, addition, and recurrent fusion operations.
Some embodiments are based on a realization to determine an optimal structure for constructing parameter models (e.g. image models for computer vision application or language models for NLP application). To that end, the optimal structure is determined based on a neural architecture search (NAS) algorithm. Additionally, or alternatively, reinforcement learning and evolutionary algorithm based learning may be used in the neural architecture search. Some embodiments randomly sample output (e.g. output feature) from different layers of the neural network during a training stage to determine the optimal structure. This results in training multiple architectures at a time and provides a form of regularization for preventing overfitting of features or parameters in the neural network. By using the output feature from the inner network of the optimal neural architecture, a parameter model (i.e. the optimal structure) is obtained. Such parameter model may be obtained with lower computational cost as the multiple architectures are trained at a same time.
The processor 108 is configured to submit the input data to the multi-dimensional neural network 106 to produce an output of the outer DNN. In some embodiments, the processor 108 is configured to randomly sample output from different layers of the multi-dimensional neural network 106 during training stage. This results in training multiple architectures for different applications at a time and improves efficiency in processing time of the AI system 100. In some embodiments, the processor 108 is configured to establish connection between one or more pairs of a layer of a DNN and a layer of a subsequent DNN of the multi-dimensional neural network based on a plurality of connections. The connections can have different patterns and different types. The different patterns of connections connect different layers of neighboring DNNs. The different types of connection include hard connections and soft connection, as described below.
The output interface 110 is configured to render at least a function of the output of the outer DNN. For instance, the function of the output corresponds to parameter model for applications, such as NLP application, computer vision application or a combination thereof. More specifically, the function may be another DNN that accepts the output of the outer DNN and outputs a class label for classification tasks such as optical character recognition, object recognition, and speaker recognition. Moreover, the function may be a decoder network that accepts the output of the outer DNN and generates a sequence of words for sentence generation tasks such as speech recognition, machine translation, and image captioning.
The DNNs 200 and 210 include corresponding layers, i.e., the layers having the same parameters. For example, the layer 202 corresponds to the layer 212, the layer 204 corresponds to the layers 214, the layer 206 corresponds to the layers 216, and the layer 208 corresponds to the layers 218. The corresponding layers are arranged in the same order making at least some portions of the structure of DNNs 200 and 210 identical to each other. In such a manner, the variation of parameters of the multi-dimensional neural network 106 is reduced, which increases flexibility of its structure.
The inner DNN 200 is configured to process input data sequentially by the layers i.e. the DNN layer 202, the DNN layer 204, the DNN layer 206 along a first dimension 220 of data propagation. In a similar manner, the outer DNN 210 is configured to process input data sequentially by the layers i.e. the DNN layer 212, the DNN layer 214, the DNN layer 216 along the first dimension 220 of data propagation. The inner DNN 200 and the outer DNN 210 are arranged along a second dimension 222 of data propagation.
The layers (i.e. the input layer 202, the hidden layers 204 and 206, and the output layer 208) of the inner DNN 200 are connected to the layers (i.e. the input layer 212, the hidden layers 214 and 216, and the output layer 218) of the outer DNN 210 on an input/output level. In one example embodiment, the layers of the inner DNN 200 have a plurality of connections with the layers (i.e., the input layer 212, the hidden layers 214 and 216, and the output layer 218) of the outer DNN 210. This plurality of connections herein corresponds to a plurality of hard connections arranged in a pattern 200a (hereinafter referred to as hard connection patterns), as shown in
Different embodiments may use different connection patterns 200a to adapt the multi-dimensional neural network for different applications. For example, in some embodiments, the output of a given layer of the inner DNN may only contribute to the input of a unique layer of the outer DNN. In some embodiments, outputs of two given layers in the inner DNN may contribute to the input of the same layer of the outer DNN. For example, all layers of the inner DNN can be connected to all layers of the outer DNN. Such a connection pattern is referred herein as full connection, making the multi-dimensional neural network being fully connected. Alternatively, the multi-dimensional neural network can be partially connected. For example, in a partially connected multi-dimensional neural network, one or more layers of the inner DNN can be connected to multiple layers of the outer DNN. Additionally, or alternatively, in a partially connected multi-dimensional neural network, multiple layers of the inner DNN can be connected to a layer of the outer DNN.
In addition to different patterns of the connections between layers of different DNNs in the multi-dimensional neural network, some embodiments use the connections of different types. For example, various embodiments use hard connections, soft connections or combinations thereof. For example, in a hard connection, outputs of layers of the inner DNN are added to inputs of layers of the outer DNN in its entirety. That is, the layers are either connected or not. If the layers are connected, the output of one layer is combined with the input of another layer without additional scaling and/or weight multiplication. If the layers are not connected, nothing from the output of the layer is added to the input of the other layer. The pattern 200a shows an exemplar pattern of hard connections.
Hence, according to the principles of hard connection, the output of a layer of the inner DNN may either contribute to the input of a layer of the outer DNN or may not contribute to the input of that layer of the outer DNN. The principle of data propagation according to hard connections differs from principles of the propagation between layers of a single DNN. Thus, the hard connections allow to decouple principles of data propagations in different directions. In turn, such a decoupling allows to search for a better pattern of hard connections on top of training the parameters of DNNs, which adds additional flexibility into the architecture of multi-dimensional neural network.
In some embodiments, during the training process of the multi-dimensional neural network, the pattern of hard connections is selected among a plurality of patterns of connections. For each selected connection pattern, corresponding multi-dimensional neural network is trained. The trained multi-dimensional network that gives best performance is selected among all trained multi-dimensional networks. More specifically, the hard connection patterns are selected based on a search algorithm, for example a random search algorithm. The random search algorithm randomly samples a certain number of connection patterns from the plurality of connections, and trains a model for each of the connection patterns. Then one model is chosen based on a performance measure (e.g. accuracy, F1, BLEU score, etc.) for a validation set.
In some embodiments, new connection patterns may be selected for including in the search algorithm. The pre-determined connection patterns may be identified based on scores associated with each of the pre-determined connection patterns. For instance, one or more pre-determined connection patterns with high scores may be selected as the new connection patterns. In some cases, the selected pre-determined connection patterns may be manipulated by making small modifications.
Additionally, or alternatively, in a soft type of connection, only a portion of the output of one layer is combined with the input of another layer. Specifically, the output of a layer softly connected to another layer is “weighted” before being added to the input of another layer. The weights of soft connection may vary for different soft connections.
In one example embodiment, the residual connection allows skipping of the connection of one layer of the inner DNN 200 to other non-adjacent layers of the inner DNN 200. For instance, output of the input layer 202 can be added as input to the hidden layer 206 by skipping the hidden layer 204 based on the residual connection. In some example embodiments, the output of the input layer 202 triggers an activation function in case of the addition of the output of the input layer 202 as the input to the hidden layer 206. Such an activation function may be a rectified linear activation unit (ReLu) that transforms any non-linearity in the output of the input layer 202. Further, in some example embodiments, the layers 202-208 of the inner DNN 200 and corresponding layers 212-218 of the outer DNN 210 share identical parameters (e.g. weight values or feature vectors). The layers 212-218 of outer DNN 210 computes the data input to provide an output 224.
In another embodiment, the multi-dimensional neural network 106 may have one or multiple hidden DNNs between the inner DNN 200 and the outer DNN 210, as shown in
The hidden DNNs of the multi-dimensional neural network 106, i.e. the hidden DNNs 224 and 226 are connected into a forward direction (i.e. along the second dimension 222). The connection between hidden DNNs (e.g. the DNNs 224 and 226) along the second dimension 222 prevents loop connections or cyclic connections among the layers in the multi-dimensional neural network 106. Moreover, the connection along the second dimension 222 may increase number of the hidden DNNs in between the inner DNN 200 and the outer DNN 210 for providing accurate output. The increase in the number of hidden DNNs (i.e. the hidden DNNs 224 and 226) does not increase number of parameters as identical parameters are shared between the inner DNN 200, the hidden DNNs 224 and 226, and the outer DNN 210.
In one example embodiment, the inner DNN 200 provides output to any other DNNs, such as the hidden DNN 226 via a residual connection. The residual connection allows skipping one or more hidden DNNs (e.g. the hidden DNN 224) and adding the output of a layer of the inner DNN 200 to a layer of the hidden DNN 226. For instance, output of the inner DNN 200 can be added as input to the hidden DNN 226 by skipping the hidden DNN 224 based on the residual connection.
At operation 304, the input data 302 is obtained by the processor 108 from the input interface 102. At operation 306, the processor 108 submits the input data 302 to the multi-dimensional neural network 106. At operation 308, the multi-dimensional neural network 106 processes the input data 302. In some embodiments, the multi-dimensional neural network 106 processes the input data 302 for providing output of one of the DNNs 200, 224, 226, and 210 as input to a subsequent DNN of the DNNs 200, 224, 226, and 210. In some example embodiments, the input data 302 may be processed using pre-determined connection patterns. In some cases, the pre-determined connection patterns may correspond to hard connection patterns optimized during a random search at the training process of the multi-dimensional neural network 106. In some other cases, the pre-determined connection patterns may correspond to soft connection patterns learned simultaneously with parameters of the multi-dimensional neural network 106 during the training process of the multi-dimensional neural network 106. At operation 310, the multi-dimensional neural network 106 renders a function of an output of the outer DNN 210. The output is provided as output data 312 via the output interface 110. In some example embodiments, the function of the output of the outer DNN 210 includes encoded form of the input data to be produced as the output of the AI system 100 via the output interface 110. Further, the produced output may be displayed through graphical representation or visualization via the output interface. In one example embodiment, the input data may be processed by a decoder and produce the decoded data as the output.
Additionally, or alternatively, the AI system 100 may determine an optimal connection pattern from the plurality of connections. In some embodiments where the connection patterns are hard connection patterns, the optimal connection pattern may be determined based on a random search algorithm. The random search algorithm selects a certain number of connection patterns randomly from the plurality of connections. A model is chosen based on a performance measure for validation data prepared for a target application. For instance, the performance measure may be recognition of accuracy for classification applications and F1 or BLEU score for machine translation applications.
The connection pattern of MPT 402 can be formed by hard and/or soft connections. The determination of optimal connection pattern in the hard connection patterns is explained further with reference to
In some implementations, the MPT 402 forms an encoder for the machine translation. The MPT 402 includes the inner network and an outer network. The inner network corresponds to the inner DNN 200 and the outer network corresponds to the outer DNN 210. Similar to sharing of identical parameters between the inner DNN 200 and the outer DNN 210, same parameters are shared between the inner network and the outer network. The output of one of the layers of the inner network is added via a residual connection that allows adding the output to the input of one of the layers of the outer network in the MPT 402. Further, in some embodiments, in the training process, the MPT 402 may randomly sample features to be used for applications, such as the machine translation from last layer (i.e. output layer) in either the inner network or the outer network. In some embodiments, the MPT 402 may use the output of the outer DNN 210 for applications, such as machine translation application.
For the machine translation, a source sentence 406A (e.g. English sentence) is provided as input data (e.g. the input data 302) to the MPT 402 via the input interface 102. For instance, the source sentence may be provided as a speech input, a textual input or combination thereof. The MPT 402 translates the source sentence 406A to a target sentence 406B (e.g. German sentence). In one example embodiment, the input interface 102 tokenizes an input sentence to form source sentence 406A, which is sent to layer 202 of the inner DNN 200 and the layer 212 of the outer DNN 210. The input sentence may be tokenized based on byte-pair encoding (BPE) and further transformed by a word embedding layer into a vector representation. The vector representation may include L C-dimensional vectors, where L corresponds to a sentence length, i.e., the number of tokens in the sentence, and C corresponds to a word embedding dimension. Further, the position of each word of the source sentence 406A is encoded into a position embedding space and added to the vector representation, forming the final source sentence sequence 406A used as input to the MPT 402. The MPT 402 then computes encodings from the input, wherein the encodings are obtained as the output 224 of the layer 218 of the outer DNN 210. In one embodiment, the encodings computed by the MPT 402 are provided to the decoder 404. The decoder 404 computes a target sentence 406B from the encodings and provides the target sentence as output via the output interface 110. The target sentence 406B may be provided as a speech output, a textual output or a combination thereof.
As shown in
An ith hard MPT architecture may be denoted using a sequence of indices corresponding to the image [τ0(i), . . . , τN(i)] of the sequence [0, . . . , N] via an associated ith permutation. In the ith hard MPT architecture, output of a layer τk(i) in the inner network 408 is added to input of kth attention module in the outer network 410. For example, for the inner network 408 with N hard=6 attention modules 408A-408F, the best model 412 for the MPT 402 is obtained with connection pattern [0, 4, 1, 5, 2, 3], in which the output of the 0th inner layer is added to the input of the 0th outer layer, the output of the 4th inner layer is added to the input of the 1st outer layer, the output of the 1st inner layer is added to the input of the 2nd outer layer, etc. A connection pattern [0, 1, 2, 3, 4, 5] denotes a default architecture of the MPT 402 during a setup.
The output of one or more of the attention modules 408A-408F of the inner network 408 are combined with input to one of the attention modules 410A-410F of the outer network 410. For example, input of the attention module 410A is combined with output of the attention module 408A. The connections between the attention modules 408A-408F and the attention modules 410A-410F may be configured from any output of an intermediate layer or output layer of the inner network 408 to any input of an input layer or intermediate layer of the outer network 410.
Different embodiments can use the weights 418b in direct or indirect manner. For example, in one embodiment, each soft connection has an associated weight, and the embodiment directly uses that weight to scale the contribution of the inner layer into the corresponding outer layer. Hence, the weight of each soft connection represents its strength. In alternative embodiment, the weight wj of each soft connection between a layer j of the inner DNN and a layer k of the outer DNN is not used directly to determine the strength of the connection, but instead fed to a function such as a softmax function, such that the strength of each connection depends on the weights for other connections.
In some cases, the MPT 418 is fully connected with soft connections. The weights are learned during the training process for the residual connection between each layer pair between the attention modules 408A-408D and the attention modules 410A-410D. For example, the output of the kth attention module in the outer network 410 denoted by
k
out=AttModule (
where, AttModule(.) denotes the attention module (e.g., the attention modules 410A-410D) including a self-attention network and a feed-forward neural network, Sjout is output of the jth inner layer and αkj represents a weight for the connection from the jth inner layer to the kth outer layer. The connection weight is computed via softmax as αkj=exp(wkj)/Σjexp(wkj) with learnable parameters wkj, to enforce 0≤αkj≤1 and Σj(αkj)=1.
In some example embodiments, the MPT may be trained during the training process based on random minimization for an input sequence S and a target sequence T. During the training process, an objective function L(
Each of the layers of the attention modules 408A-408D and the attention modules 410A-410D of the corresponding inner network 408 and the outer network 410, includes a self-attention network and a feed-forward neural network, which is described further with reference to
In an example scenario, the self-attention subnetwork 504 receives an input, such as a sentence S represented by S∈RL×C. The self-attention subnetwork 504 translates S into a key (Sk), a query (Sq) and a value (Sv) via linear transforms. By using an attention value between Sk and Sq, each word of S aggregates information from other words using the self-attention. For key K, query Q and value V, the attention value can be calculated using equation (2)
The attention value is modulated by the square root of feature dimension, dk. After aggregating information from other words in the self-attention network 504, the FFN subnetwork 506 combines the information in a position-wise manner. In some embodiments, the self-attention subnetwork 504 corresponds to a multi-head attention. A stack of such self-attention subnetwork 504 and the FFN subnetwork 506 constitutes the attention module 502, processing the input S as follows:
S
mid=Attention (Sq, Sk, Sv) (3)
S
out=FFN (Smid) (4)
where,
Smid is a feature from an intermediate layer (e.g. one of the attention modules 408B-408E or one of the attention modules 410B-410E) inside each of inner network 408 and outer network 410; and
Sout is output provided by the FFN subnetwork 506.
In some embodiments, in decoding stage of the encodings of the output, self-attention is performed on each target sentence's embedding representation T, followed by co-attention and FFN. The decoding stage can be denoted as follows, where SA stands for self-attention:
T
q
SA=Attention (Tq, Tk, Tv) (5)
T
q
out=FFN (Attention (TqSA, Sk, Sv) (6
The word embedding layer is shared between the encoder and decoder of the encoder-decoder architecture. After obtaining the representation for next word, i.e. Tqout in the decoder 404, a linear transform and a softmax operation are applied for Tqout to obtain probabilities of possible next words. Then, a cross-entropy loss based on the probability of the next words is utilized for training all the connected networks using a back-propagation technique for ANNs.
In one example embodiment, a final output 618A of the attention module 608 is added to an input of an attention module 610 prior to the residual connection associated with the self-attention subnetwork 606A of the attention module 610. The residual connection of the self-attention subnetwork 606A includes the output 618A, i.e., the sum of the output 618A and the input 612 is added to output of the self-attention subnetwork 606A in add and norm sublayer 614A.
The table 700 shows that combining information before initiation of the residual connection leads to a better performance. The performance difference between best model (i.e., the MPT transformer 412) and least performing model (i.e., the base transformer 702) is 0.8 as the MPT transformer 412 obtains 28.4 while the base transformer 702 obtains 27.3. The different models in the table 700 are analyzed to determine factors that influence performance of the MPT of different embodiments. Further, based on the factors (such as searched network), performance tends to improve when features from deeper layers in the inner network 408 are added to features in the outer network 410, except when adding features from the last layer of the inner network 408 to features of the first layer i.e. the layer 410A of the outer network 410. Moreover, performance is also improved when features from shallow layers in the inner network 408 are directly linked to deeper layers in the outer network 410.
The MPT 402 with hard connections and MPT 418 with soft connections may achieve performance better than an evolved transformer, which is described next with reference to
The evolved transformer performs architecture search on a larger search space by using an evolutionary algorithm. The architecture search may be performed depending on size of self-attention heads, number of layers, different cascades between convolution and self-attention networks, dense-residual fusion and architecture search is performed jointly on an encoder and decoder of an encoder-decoder architecture neural network. The evolved transformer may also include larger search space than the MPT 402. The MPT 402 with hard connection patterns performs random search on a restricted search space. The reduced search space enables the MPT 402 to achieve better performance than the evolved transformer. The MPT 418 may estimate optimal connection pattern without the random search, which provides better performance than the evolved transformer. As shown in the table 800, BLEU metric score of the MPT 402 on the EN-DE dataset is 28.4 and on the EN-FR dataset is 41.8 with a smaller number of parameters (i.e., 61.2 for the EN-DE and 111.4 for the EN-FR). In a similar manner, the BLEU metric score of the MPT 418 on the EN-DE dataset is 28.4 and on the EN-FR dataset is 41.6 with a smaller number of parameters (i.e., 61.2 for the EN-DE and 111.4 for the EN-FR). However, BLEU metric score of the evolved transformer on the EN-DE dataset is 28.2 and on the EN-FR dataset is 41.3 with a higher number of parameters (i.e., 64.1 for the EN-DE and 221.2 for the EN-FR).
The sentential context max pooling transformer combines features from all layers in the encoder network based on addition, recurrent fusion, concatenation, or attention operators. Furthermore, the operators like concatenation and recurrent fusion may significantly increase the number of parameters. For instance, the number of parameters for the sentinel context max pooling transformer is 106.9 million, which is more than the number of parameters of the MPT 402. Thus, MPT 402 can achieve much better performance than the sentential context max pooling transformer with fewer number of parameters. Similarly, the dynamic combination with the BT and the dynamic routing with the BT shares same concept with the sentential context max pooling transformer. The dynamic combination with the BT and the dynamic routing also utilize a multi-layer information fusion mechanism based on expectation-maximization (EM) algorithm. However, the dynamic combination with the BT and the dynamic routing increases the number of parameters, such as 113.2 millions and 125.8 millions.
Notably, the MPT 402 and 418 can be also compared with deeper transformers that have more layers but only one dimension, i.e., there is no sequence of DNN and no data propagation along the second dimension. For example, a “deeper” transformer with 12 levels performs approximately as well as MPT with six layers, but deeper transformer uses more parameters and thus more memory.
At block 904, the input data is submitted to a multi-dimensional neural network 106 having a sequence of deep neural networks (DNN) including an inner DNN and an outer DNN. Each DNN includes a sequence of layers and corresponding layers of different DNNs have identical parameters, each DNN is configured to process the input data sequentially by the sequence of layers along a first dimension of data propagation, the DNNs in the sequence of DNNs are arranged along a second dimension of data propagation starting from the inner DNN till the outer DNN, wherein the DNNs in the sequence of DNNs are connected, such that at least an output of an intermediate layer or a final layer of a DNN is combined with an input to at least one layer of the subsequent DNN in the sequence of DNNs as described above in description of
At block 906, an output of the outer DNN is produced. At 908, at least a function of the output of the outer DNN is rendered. The output of the outer DNN is rendered via the output interface 110.
The input interface 1002 is configured to accept the input data 1016. In some embodiments, the AI system 1000 receives the input data 1016 via network 1014 using the NIC 1012. In some cases, the input data 1016 may be online data received via the network 1014. In some other cases, the input data 1016 may be a recorded data stored in the storage device 1022. In some embodiments, the storage device 1022 is configured to store training dataset for training the multi-dimensional neural network 1008.
The processor 1004 is configured to submit the input data 1016 to the multi-dimensional neural network 1008 to produce an output of the outer DNN 210. From the output of the outer DNN 210 at least a function is rendered that is provided via the output interface 1018. The output interface 1018 is further connected to an output device 1020. Some examples of the output device 1020 includes, but not limited to, a monitor, a display screen, and a projector.
Each of the machine translation devices 1104A, 1104B, and 1104C may include a corresponding interface controller 1106A, interface controller 1106B and interface controller 1106C. For instance, the interface controllers 1106A, 1106B, and 1106C may be arranged in the NIC 1012 connected to a display, speaker(s) and a microphone of the machine translation device 1104A, 1104B, and 1104C. The interface controllers 1106A, 1106B, and 1106C may be configured to convert speech signals of the corresponding operators (i.e., the operators 1102A, 1102B, and 1102C) received as the input data 1016 from the network 1014. The network 1014 may be an internet, a wired commination network, a wireless communication network, or a combination of at least two of them.
The input data 1016 is processed by each of the machine translation devices 1104A, 1104B, and 1104C. The process input data 1016 is translated into desired language by the corresponding machine translation devices 1104A, 1104B, and 1104C. The translated speech is provided as output to the corresponding operators 1102A, 1102B, and 1102C. For instance, the operator 1102A sends a speech signal of English language to the operator 1102B using the machine translation device 1104A. The speech in English language is received by the machine translation device 1104B. The machine translation device 1104B translates the English language into speech of German language. The translated speech is provided to the operator 1102B. Further, in some example embodiments, the machine translation devices 1104A, 1104B, and 1104C may store/record conversations among the operators 1102A, 1102B, and 1102C into a storage unit, such as the storage device 1022. The conversations may be stored as audio data or textual data using a computer-executable speech-text program stored in the memory 1006 or in the storage device 1022.
In this manner, operators in different locations speaking different languages may communicate efficiently using the machine translation device equipped with the AI system 1000. Such communications enable the operators to perform cooperative operations as is shown and described in
Some embodiments are based on recognition that the cooperative operation system 1110 may provide a process data format for maintaining/recording the whole process data of manufacturing lines based on predetermined languages when an operator 1114 speak different language from other operators, such as the operators 1102A, 1102B and 1102C who working in the manufacturing lines constructed in a single facility or different facilities in different countries. In this case, the process data format may be recorded with individual languages even when the operators 1102A, 1102B, 1102C and 1114 use different instruction languages.
The NIC 1012 of the AI system 1000 may be configured to communicate with a manipulator, such as a robot 1116 via the network 1014. The robot 1116 may include a manipulator controller 1118 and a sub-manipulator 1120 connected to a manipulator state detector 1122, in which the sub-manipulator 1120 is configured to assemble workpieces 1124 for manufacturing parts of a product or finalizing the product. Further, the NIC 1012 may be connected to an object detector 1126, via the network 1014. The object detector 1126 may be arranged so as to detect a state of the workpiece 1124, the sub-manipulator 1120, and the manipulator state detector 1122 connected to the manipulator controller 1118 arranged in the robot 1116. The manipulator state detector 1122 detects and transmits manipulator state signals (S) to the manipulator controller 1118. The manipulator controller 1118 then provides process flows or instructions based on the manipulator state signals (S).
The display 1112 may display the process flows or instructions representing process steps for assembling products based on a (predesigned) manufacturing method. The manufacturing method may be received via the network 1014 and stored into the memory 1006 or the storage device 1022. For instance, when the operator 1114 checks a condition of assembled parts of a product or an assembled product (while performing a quality control process according to a format, such as process record format), an audio input may be provided via the microphone of the cooperative operation system 1110 to record the quality check. The quality check may be performed based on the product manufacturing process and product specifications that may be indicated on the display 1114. The operator 1116 may also provide instructions to the robot 1116 to perform operations for the product assembly lines. Using the speech-to-text program stored in the memory 1006 or the storage device 1022, the cooperative operation system 1108 can store results confirmed by the operator 1114 into the memory 1006 or the storage device 1022 as text data using the speech-to-text program. The results may be stored with time stamps along with item numbers assigned to each assembled part or assembled product for a manufacturing product record. Further, the cooperative operation system 1108 may transmit the records to a manufacturing central computer (not shown in
The following description provides exemplary embodiments only, and is not intended to limit the scope, applicability, or configuration of the disclosure. Rather, the following description of the exemplary embodiments will provide those skilled in the art with an enabling description for implementing one or more exemplary embodiments. Contemplated are various changes that may be made in the function and arrangement of elements without departing from the spirit and scope of the subject matter disclosed as set forth in the appended claims.
Specific details are given in the following description to provide a thorough understanding of the embodiments. However, understood by one of ordinary skill in the art can be that the embodiments may be practiced without these specific details. For example, systems, processes, and other elements in the subject matter disclosed may be shown as components in block diagram form in order not to obscure the embodiments in unnecessary detail. In other instances, well-known processes, structures, and techniques may be shown without unnecessary detail in order to avoid obscuring the embodiments. Further, like reference numbers and designations in the various drawings indicated like elements.
Also, individual embodiments may be described as a process which is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process may be terminated when its operations are completed, but may have additional steps not discussed or included in a figure. Furthermore, not all operations in any particularly described process may occur in all embodiments. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, the function's termination can correspond to a return of the function to the calling function or the main function.
Furthermore, embodiments of the subject matter disclosed may be implemented, at least in part, either manually or automatically. Manual or automatic implementations may be executed, or at least assisted, through the use of machines, hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the necessary tasks may be stored in a machine readable medium. A processor(s) may perform the necessary tasks.
Various methods or processes outlined herein may be coded as software that is executable on one or more processors that employ any one of a variety of operating systems or platforms. Additionally, such software may be written using any of a number of suitable programming languages and/or programming or scripting tools, and also may be compiled as executable machine language code or intermediate code that is executed on a framework or virtual machine. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments.
Embodiments of the present disclosure may be embodied as a method, of which an example has been provided. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts concurrently, even though shown as sequential acts in illustrative embodiments. Further, use of ordinal terms such as “first,” “second,” in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements.
Although the present disclosure has been described with reference to certain preferred embodiments, it is to be understood that various other adaptations and modifications can be made within the spirit and scope of the present disclosure. Therefore, it is the aspect of the append claims to cover all such variations and modifications as come within the true spirit and scope of the present disclosure.