The present invention(s) relate to deep learning networks or graphs, in particular, hypercomplex deep learning systems, methods, and architectures for multimodal small, medium, and large-scale data representation, analysis, and applications.
Deep learning is a method of discovering hierarchical representations and abstractions of data; the representations help make order out of unstructured data such as images, audio, and text. Existing deep neural networks, however, assume that all data is unstructured, yet many practical engineering problems involve multi-channel data that has important inter-channel relationships which existing deep neural networks cannot model effectively. For example: color images contain three or four related color channels; multispectral images are images of the same object at different wavelengths and therefore have significant inter-wavelength relationships; and multi-sensor applications (e.g. sensor arrays for audio, radar, vision, biometrics, etc.) also employ data relationships between channels. Because existing deep learning structures have difficulty modeling multi-channel data, they require vast amounts of data for training in order to approximate simple inter-channel relationships.
An important topic in traditional, real-valued deep learning literature is the “vanishing gradient” problem, wherein the error terms that are propagated through the network tend towards zero. This occurs due to use of nonlinearities, such as the hyperbolic tangent, that compress the data around zero, that is mapping values of larger magnitude to values of smaller magnitude. After repeated applications of a nonlinearity, the value tends towards zero.
Many application areas have a sparsity of labeled data for deep representation learning. For example, in the fashion industry, there are extensive datasets of images with clothing and labeled attributes such as color, sleeve length, and so on. However, all of these images are taken with models in well-lit images. If one wants to identify the same clothing in images on social media, a mapping from the social media images to the well-lit images would be helpful. Unfortunately, this mapping dataset is extremely small and is difficult to expand. Correspondingly, methods for learning data representations on the large, source dataset of model images and transferring that learning to the target task of clothing identification in social media are necessary.
In addition to transferring learning from one domain to another, it is important to transfer learning from human to machine. Traditionally, machine expert systems were created according to following steps: Problem domain experts would engineer a set of data representations called features; these features would be used to train a fully-connected neural network; and, finally, the neural network would be used for prediction and classification tasks. Deep learning has automated the domain expert creation of features, provided that a sufficiently large dataset exists. However, large datasets do not exist for many applications, and therefore a method of employing human expert features to reduce training set size would be helpful.
One important emerging application of representation learning is multi-sensor fusion and transfer learning in biometric identity verification systems. The primary objective of biometric identity verification systems is to create automated methods to recognize uniquely individuals using: (i) anatomical characteristics, for example, DNA, signature, palm prints, fingerprints, finger shape, face, hand geometry, vascular technology, iris, and retina; (ii) behavioral characteristics, for example, voice, gait, typing rhythm, and gestures; (iii) demographic indicators, for example, height, race, age, and gender; and (iv) artificial characteristics such as tattoos. Biometric systems are rapidly being deployed for border security identity verification, prevention of identity theft, and digital device security.
The key drivers of performance in biometric identity verification systems are multimodal data from sensors and information processing techniques to convert that data into useful features. In existing biometric identity verification systems, the data features are compared against a database to confirm or reject the identity of a particular subject. Recent research in biometrics has focused on improving data sensors and creating improved feature representations of the sensor data. By deep learning standards, typical biometric datasets are small: Tens to tens of thousands of images is a typical size.
There have been significant recent advances in biometric identity verification systems. However, these methods are still inadequate in challenging identification environments. For example, in multimedia applications such as social media and digital entertainment, one attempts to match Internet face images in social media. However, biometric images these applications usually exhibit dramatic variations in pose, illumination, and expression, which substantially degrade performance of traditional biometric algorithms. Moreover, multimedia applications present an additional challenge due to the large scale of the image databases available, therefore leading to many users and increased probability of incorrect identification.
Traditional biometric identity verification systems are usually based on two-dimensional biometric images captured in the visible light spectrum; these images are corrupted by different environmental conditions such as varied lighting, camera angles, and resolution. Multispectral imaging, where visible light is combined with other spectra such as near-infrared and thermal, has recently been applied to biometric identity verification systems. This research has demonstrated that multimodal biometric data fusion can significantly improve the accuracy of biometric identity verification systems due to complementary information from multiple modalities. The addition of palmprint data appears to enhance system performance further.
Curated two-dimensional biometric datasets have been created for a variety of biometric tasks. While each dataset contains information for only a narrowly-defined task, the combination of all datasets would enable the creation of a rich knowledge repository. Moreover, combination of existing biometric datasets with the large repository of unlabeled data on the Internet presents new opportunities and challenges. For example, while there is an abundance of unlabeled data available, many problems either do not have a labeled training set or only have a small training set available. Additionally, the creation of large, labeled training sets generally requires significant time and financial resources.
Recent advances in three-dimensional range sensors and sensor processing have made it possible to overcome some limitations, such as distortion due to illumination and pose changes, of two-dimensional biometric identity system modalities. Three-dimensional multispectral images will provide more geometric and shape information than their two-dimensional counterparts. Using this information, new methods of processing depth-based images and three-dimensional surfaces will enable the improvement of biometric identity verification systems.
The present invention(s) include systems, methods, and apparatuses for, or for use in: (i) approximating mathematical relationships between one or more sets of data and for, or use in, creating hypercomplex representations of data; (ii) transferring learned knowledge from one hypercomplex system to another; (iii) multimodal hypercomplex learning; and, (iv) biometric identity verification using multimodal, multi-sensor data and hypercomplex representations.
The Summary introduces key concepts related to the present invention(s). However, the description, figures, and images included herein are not intended to be used as an aid to determine the scope of the claimed subject matter. Moreover, the Summary is not intended to limit the scope of the invention.
In some embodiments, the present invention(s) provide techniques for learning hypercomplex feature representations of data, such as, for example: audio; images; biomedical data; biometrics such as fingerprints, palm prints, iris information, facial images, demographic data, behavioral characteristics, and so on; gene sequences; text; unstructured data; writing; geometric patterns; or any other information source. In some embodiments, the data representations may be employed in the applications of, for example, classification, grading, function approximation, system preprocessing, pretraining of other graph structures, and so on. Some embodiments of the present invention(s) include, for example, hypercomplex convolutional neural network layers and hypercomplex neural network layers with internal state elements.
In some embodiments, the present invention(s) allow for supervised learning given training data inputs with associated desired responses. The present invention(s) may be arranged in any directed or undirected graph structure, including graphs with feedback or skipped nodes. In some embodiments, the present invention(s) may be combined with other types of graph elements, including, for example, pooling, dropout, and fully-connected neural network layers. In some embodiments, the present invention includes convolutional hypercomplex layers and/or locally-connected layers. Polar (angular) representation of hypercomplex numbers is employed in some embodiments of the invention(s), and other embodiments may quantize or otherwise non-linearly process the angular values of the polar representation.
Some embodiments of the present invention(s) propagate training errors through the network graph using an error-correction learning rule. Some embodiments of the learning rule rely upon multiplication of the error with the hypercomplex inverse of hypercomplex weights in the same or other graph elements which, for example, include neural network layers.
In some embodiments of the present invention(s), hypercomplex mathematical operations are performed using real-valued mathematical operators and additional mathematical steps. Some embodiments of the hypercomplex layers may include real-valued software libraries that have been adapted for use with hypercomplex numbers. Exemplary implementations of the present invention(s) to perform highly-optimized computations through real-valued matrix multiply and convolution routines. These real-valued routines, for example, run on readily available computer hardware and are available for download. Examples include the Automatically Tuning Linear Algebra Subroutines and NVIDIA cuDNN libraries.
In some embodiments, techniques or applications are fully automated and are performed by a computing device, such as, for example, a central processing unit (CPU), graphics processing unit (GPU), field programmable gate array (FPGA), and/or application specific integrated circuit (ASIC).
Some embodiments of the present invention(s) include various graph structures involving hypercomplex operations. Examples include hypercomplex feedforward networks, networks with pooling, recurrent networks (i.e. with feedback), networks where connections skip over one more layers, networks with state and layers with internal state elements, and any combination of the aforementioned or other elements.
In some embodiments, the present invention(s) may be employed for supervised learning and/or unsupervised learning. Through a novel pretraining technique, embodiments of the present invention(s) may combine knowledge of application-specific features generated by human experts with machine learning of features in hypercomplex deep neural networks. Some embodiments of the invention employ both labeled and unlabeled datasets, where a large labeled dataset in a source domain may assist in classification tasks for an unlabeled, target domain.
Some embodiments of the present invention(s) have applications in broad areas including, for example: image super resolution; image segmentation; image quality evaluation; image steganalysis; face recognition; event embedding in natural language processing; machine translation between languages; object recognition; medical applications such as breast cancer mass classification; multi-sensor data processing; multispectral imaging; image filtering; biometric identity verification; and clothing identification.
Some embodiments of the present invention(s) incorporate multimodal data for biometric identity verification, for example, anatomical characteristics, behavioral characteristics, demographic indicators, and/or artificial characteristics. Some embodiments of the present invention(s) learn biometric features on a source dataset, for example a driver's license face database, and apply recognition in a target domain, such as social media photographs. Some embodiments of the present invention(s) incorporate multispectral imaging and/or palmprint data as additional modalities. Some embodiments of the present invention(s) employ three-dimensional and/or two-dimensional sensor data. Some embodiments of the present invention(s) incorporate unlabeled biometric data to aid in hypercomplex data representation training.
Advantages of the present invention will become apparent to those skilled in the art with the benefit of the following detailed description of embodiments and upon reference to the accompanying drawings in which:
While the invention may be susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. The drawings may not be to scale. It should be understood, however, that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but to the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present invention as defined by the appended claims.
It is to be understood the present invention is not limited to particular devices or methods, which may, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting. As used in this specification and the appended claims, the singular forms “a”, “an”, and “the” include singular and plural referents unless the content clearly dictates otherwise. Furthermore, the word “may” is used throughout this application in a permissive sense (i.e., having the potential to, being able to), not in a mandatory sense (i.e., must). The term “include,” and derivations thereof, mean “including, but not limited to.” The term “coupled” means directly or indirectly connected.
A hypercomplex neural network layer is defined presently. In the interest of clarity, a quaternion example is defined below. However, it is to be understood that the method described is applicable to any hypercomplex algebra, including but not limited to biquaternions, exterior algebras, group algebras, matrices, octonions, and quaternions.
An exemplary hypercomplex layer is shown in
Mathematically, in the quaternion case, these steps are defined as follows:
Let a∈m×n denote the input to the quaternion layer. The first step, convolution, produces the output s∈r×t, as defined in Equation 1:
where k∈p×q represents the convolution filter kernel and the × symbol denotes quaternion multiplication. An alternative notation for hypercomplex convolution is the asterisk, where s=k*ha. Details about hypercomplex convolution are explained in Sections 1.2 and 1.4.
Continuing with the quaternion example, the activation function is applied to s∈r×t and may be any mathematical function. An exemplary function used herein is a nonlinear function that converts the quaternion values into polar (angular) representation, quantizes the phase angles, and then recomputes updated quaternion values on an orthonormal basis (1,i,j,k). More information about this function is provided in Sections 1.6 and 1.7.
Exemplary methods for hypercomplex convolution are described presently. The examples described herein all pertain to quaternion convolution.
A quaternion example of hypercomplex convolution is pictured in
A specific example of the approach described above is shown in
Because convolution is a linear operator, the multiplication in Equation 2 may be replaced by the convolution operator of Equation 1. Correspondingly, a convolution algorithm to compute s(x,y)=k(x,y)*a(x,y) for quaternion matrices is shown in Equation 3:
where *h denotes quaternion convolution, *r denotes real-valued convolution, and (x,y) are the 2d array indices.
One may observe the Equation 3 requires sixteen real-valued convolution operations to perform a single quaternion convolution. However, due to the linearity of convolution, high-speed techniques for quaternion multiplication may also be applied to the convolution operation. For example,
The eight-real-multiply convolution takes inputs k(x,y)=ak(x,y)+bk(x,y)i+ck(x,y)j+dk(x,y)k and a(x,y)=aa(x,y)+ba(x,y)i+ca(x,y)j+da(x,y)k, as in Equation 3. However, rather than computing the convolution directly, the following intermediate variables are computed as a first step:
t
1(x,y)=ak(x,y)*raa(x,y)
t
2(x,y)=dk(x,y)*rca(x,y)
t
3(x,y)=bk(x,y)*hd rda(x,y)
t
4(x,y)=ck(x,y)*rba(x,y)
t
5(x,y)=(ak(x,y)+bk(x,y)+ck(x,y)+dk(x,y))*r(aa(x,y)+ba(x,y)+ca(x,y)+da(x,y))
t
6(x,y)=(ak(x,y)+bk(x,y)−ck(x,y)−dk(x,y))*r(aa(x,y)+ba(x,y)−ca(x,y)−da(x,y))
t
7(x,y)=(ak(x,y)−bk(x,y)+ck(x,y)−dk(x,y))*r(aa(x,y)−ba(x,y)+ca(x,y)−da(x,y))
t
8(x,y)=(ak(x,y)−bk(x,y)−ck(x,y)+dk(x,y))*r(aa(x,y)−ba(x,y)−ca(x,y)+da(x,y)) 4
In Equation 4, *r represents a real-valued convolution; one will observe that there are eight real-valued convolutions.
To complete the quaternion convolution s(x,y), the temporary terms ti are scaled and summed as shown in Equation 5:
Octonions represent another example of hypercomplex numbers. Octonion convolution may be performed using quaternion convolution as outlined presently:
Let on∈:
o
n
=e
0
o
0n
+e
1
o
1n
+e
2
o
2n
+e
3
o
3n
+e
4
o
4n
+e
5
o
5n
+e
6
o
6n
+e
7
o
7n 6
To convolve octonions oa*o ob, first represent each argument as a pair of quaternions, resulting in w,x,y,z∈
w=1o0a+io1a+jo2a+ko3a
x=1o4a+io5a+jo6a+ko7a
y=1o0b+io1b+jo2b+ko3b
z=1o4b+io5b+jo6b+ko7b 7
Next, perform quaternion convolution, for example, as described in Section 1.2:
s
L
=w*
h
y −z*
h
x*
s
R
=w*
h
z−y*
h
x 8
where a superscript * denotes quaternion conjugation and *h denotes quaternion convolution. Finally, recombine sL and sR to form the final result s=oa*oob∈:
s
L=1aL+ibL+jcL+kdL
s
R=1aRibRjcRkdR
s=e
0
a
L
+e
1
b
L
+e
2
c
L
+e
3
d
L
+e
4aR +e5bR+e6cR+e7dR 9
The exemplary process of octonion convolution described above is shown in
Sections 1.2 and 1.3 describe an examples of hypercomplex convolution of two-dimensional arrays. The techniques described above may be employed in multi-dimensional convolution that is typically used for neural network tasks.
For example,
The approach described above may also be extended to input arrays with a depth of larger than 1, thereby causing the filter kernel to become 4-dimensional. Conceptually, a loop over the input and output dimensions may be performed. The inside of the loop contains two-dimensional convolutions as described in the prior section. Note further that each two-dimensional convolution takes input values from all depth levels of the input array. If, for example, the 10×10 input array had an input depth of two, then each convolution kernel would have 3×3×2 weights, rather than 3×3 weights as described above. Therefore, the typical shape of a 4-dimensional hypercomplex convolution kernel is (Do, Di, Kx, Ky), where Do represents the output depth, Di represents the input depth, and Kx and Ky are the 2d filter kernel dimensions. Finally, in the present hypercomplex example of quaternions, all of the data points above are quaternion values and all convolutions are quaternion in nature. As will be discussed in the implementation sections below, existing computer software libraries typically do not support quaternion arithmetic. Therefore, an additional dimension may be added in software to represent the four components (1, i, j, k) of a quaternion.
The convolutions described thus far are sums of local, weighted windows of the input, where the filter kernel represents a set of shared weights for all window locations. Rather than using shared weights in a filter kernel, a separate set of weights may be used for each window location. In this case, one has a locally-connected hypercomplex layer, rather than a convolutional hypercomplex layer.
The exemplary quaternion conversion to and from polar (angular) form is defined presently. Let a single quaternion value output from the convolution step be denoted as s∈:
s=a+bi+cj+dk 10
The polar conversion representation is shown in Equation 11:
s
p
=|s|e
iϕ
e
jθ
e
kψ 11
where sp∈ represents a single quaternion number and (ϕ,θ,ψ) represent the quaternion phase angles. Note that a term to represent the norm of the quaternion, |s| intentionally set to one during the polar conversion.
The angles (ϕ,θ,ψ) are calculated as shown in Equation 12:
The definition of tan 2−1(x,y) is given in Equation 13:
Most software implementations of Equation 13 return zero for the case of x=0, y=0, rather than returning an error or a Not a Number (NaN) value.
Finally, to convert the quaternion polar form in Equation 11 to the standard form of Equation 10, one applies Euler's formula as shown in Equation 14:
In Equation 14, su∈ has the “u” subscript because it is the unit-norm version of our original variable s∈. Not restricting |s| to 1 in Equation 10 would result in su=s.
The quantization process described above creates a set of three new angles, (ϕp, θp, ψp), as shown in
A learning rule for a single hypercomplex neuron is described presently. A layer of hypercomplex neurons is merely a collection of hypercomplex neurons that run in parallel, therefore the same learning rule applies to all neurons independently. The output of each neuron forms a single output of a hypercomplex layer. This learning rule is an example of the quaternion case and is not meant to limit the scope of the claims in this document.
Stochastic gradient descent has gained popularity in the machine learning community for training real-valued neural network layers. However, because the activation function described Section 1.7 and shown in
A typical error correction weight update rule for a fully-connected neural network layer is shown in Equation 15:
w
i
(k+1)
=w
i
(k)+μ·δi·
Equation 15 represents training cycle k of a neuron, where xi is the input value to the weight, δi is the related to the current training error, μ is the learning rate, wi(k) is the current value of the weight, wi(k+1) is the updated value of the weight, and i is the index of all the weights for the neuron.
To extend the fully-connected error correction rule in Equation 15 to convolutional layers, the multiplication between δi and
Returning to the fully-connected example, the goal of training a neuron is to have its output, y(k+1) equal some desired response value of d∈. Solving for δi:
Note that Equation 16 employs the relationship:
√{square root over (x·
where ∥x∥ is the 2-norm of x and
Further note that the norm of each input is assumed to equal one, as the norm of the output from the activation function in Sections 1.6 and 1.7 is equal to one. This is not a restriction of the algorithm, as weight updates in the learning algorithm (discussed in Sections 1.8 and 1.9) may be scaled to account for non-unit-norm inputs, and/or inputs may be scaled to unit norm.
Equation 15 clearly has more unknowns than variables and therefore does not have a solution. Accordingly, the authors assume without proof that each neural weight shares equal responsibility for the final error, and that all of the δi variable should equal one another. This leads to the following weight update rule, assuming that there are n weights:
In equation 18, k represents the current training cycle, xi is the input value to the weight, n is the number of neurons, μ is the learning rate, wi(k) is the current value of the weight, wi(k+1) is the updated value of the weight, and i is the index of all the weights for the neuron.
Extending this approach to convolution and multiple neurons, the new weight update rule is:
where W(k+1) is the updated weight array, W(k) is the current weight array, n is the number of neurons, μ is the learning rate, D is the desired response vector, Y(k) is the current response vector, * represents quaternion convolution, and
This section provides an example of error propagation between hypercomplex layers to enable learning in multi-layer graph structures. This particular example continues the quaternion example of Section 1.8. Following the approach in Section 1.8, the multi-layer learning rule will be derived using a fully-connected network, and then, by linearity of convolution, the appropriate hypercomplex multiplication will be converted into a quaternion convolution operation.
As described in Section 1.8, the neurons each use an error correction rule for learning and, consequently, cannot be used with the gradient descent learning methods that are popular in existing machine learning literature.
Following the approach in Section 1.8, we set the output of the neural network equal to the desired response d and solve for the error terms δi:
Solving for the training error in terms of δAj and δBi:
As in the single-neuron case, there is more than one solution to Equation 21. In order to resolve this problem, the assumption is that each neural network layer contributes equally to the final network output error, implying:
Accordingly, errors are propagated through the graph (or network) by scaling the errors by the hypercomplex multiplicative inverse of the connecting weights.
To propagate the error from the output of a layer to its inputs:
where el represents the error at the current layer (or the network error at the output layer), [wl(k)−1]T is the transposed hypercomplex elementwise inverse of the current layer's weights at training cycle (k), Nl−1 is the number of inputs to the l−1st layer, and el−1 is the new error term for the l−1st layer. In Equation 23, the layers are arranged from 1∈[0,L], with the Lth layer corresponding to the output layer and the 0th layer corresponding to the input layer. The error terms el are computed from output to input, therefore propagating the error backward through the network.
Once the error terms el have been computed, the weight update is similar to the single-layer case of Section 1.8:
where n represents the number of neurons in layer l, and *h represents hypercomplex convolution.
For inputs that are not of unit-norm, Equation 24 may be modified to scale the weight updates:
The hypercomplex learning algorithms of Sections 1.8 and 1.9 both presume that the initial weights for each layer are selected randomly. However, this need not be the case. For example,
Suppose there is a multi-layer hypercomplex neural network in which the input layer is “Layer A”, the first hidden layer is “Layer B”, the next layer is “Layer C”, and so on. Unsupervised learning of Layer A may be performed by removing it from the network, attaching a fully-connected layer, and training the new structure as an autoencoder, which means that the desired response is equal to the input. A diagram of an exemplary hypercomplex autoencoder has been drawn in
One may use the pre-trained output from Layer A to pre-train Layer B in the same manner. This is shown in the lower half of
Once pre-training of all layers is complete, the hypercomplex weight settings of each layer may be copied to the original multi-layer network, and fine-tuning of the weight parameters may be performed using any appropriate learning algorithm, such as those developed in Sections 1.8 and 1.9.
This approach is superior to starting with random weights and propagating errors for two reasons: First, it allows one to use large quantities of unlabeled data for the initial pre-training, thereby expanding the universe of useful training data; and, second, since weight adjustment through multi-layer error propagation takes many training cycles, the pre-training procedure significantly reduces overall training time by reducing the workload for the error propagation algorithm.
Moreover, the autoencoders described above may be replaced by any other unsupervised learning method, for example, restricted Boltzmann machines (RBMs). Applying contrastive divergence to each layer, from the lowest to the highest, results in a hypercomplex deep belief network.
Historically, multi-layer perceptron (MLP) neural network classification systems involve the following steps: Input data, such as images, are converted to a set of features, for example, local binary patterns. Each feature, which, for example, takes the form of a real number, is stacked into a feature vector that is associated with a particular input pattern. Once the feature vector for each input pattern is computed, the feature vectors and system desired response (e.g. classification) values are presented to a fully-connected MLP neural network for training. Many have observed that overall system performance is highly dependent upon the selection of features, and therefore domain experts have spent extensive time engineering features for these systems.
One may think of hypercomplex convolutional networks as a way to algorithmically learn hypercomplex features. However, it may be desirable to incorporate expert features into the system as well. An exemplary method for this is shown in
As with the unsupervised pre-training method described in Section 1.10, the hypercomplex layer or layer is pre-trained and then its weights are copied back to the original hypercomplex graph. Learning rules, for example those of Sections 1.8 and 1.9, may then be applied to the entire graph to fine tune the weights.
Using this method, one can start with feature mappings defined by domain experts and then improve the mappings further with hypercomplex learning techniques. Advantages include use of domain-expert knowledge and reduced training time.
Deep learning algorithms perform well in systems where an abundance of training data is available for use in the learning process. However, many applications do not have an abundance of data; for example, many medical classification tasks do not have large databases of labeled images. One solution to this problem is transfer learning, where other data sets and prediction tasks are used to expand the trained knowledge of the hypercomplex neural network or graph.
An illustration of transfer learning is shown in
There are numerous ways of performing the knowledge transfer, though methods will generally seek to represent the “other tasks” and target task in the same feature space through an adaptive (and possibly nonlinear) transform, e.g., using a hypercomplex neural network.
1.13 Hypercomplex Layer with State
An exemplary neural network layer is discussed in 1.1 and shown in
The hypercomplex tensor layer enables evaluation of whether or not hypercomplex vectors are in a particular relationship. An exemplary quaternion layer is defined in Equation 26.
In Equation 26, a∈d×1 and b∈d×1 represent quaternion input vectors to be compared. There are k relationships R that may be established between a and b, and each relationship has its own hypercomplex weight array WR[i]∈d×d, where 1≤i≤k (and WR[1:k]∈d×d×k). The output of aT·WR[1:k]·b is computed by slicing WR[1:k] for each value of [1,k] and performing two-dimensional hypercomplex matrix multiplication. Furthermore, the hypercomplex weight array VR∈k×2d that acts as a fully-connected layer. Finally, a set of hypercomplex bias weights bR∈k×1 may optionally be present.
The function ƒ is the layer activation function and may, for example, be the hypercomplex angle quantization function of Section 1.7. Finally, the weight vector uR∈k×1 is transposed and multiplied to create the final output y∈. When an output vector in k×1 is desired rather than a single quaternion number, the multiplication with uRT may be omitted.
A major advantage to hypercomplex numbers in this layer structure is that hypercomplex multiplication is not commutative, which helps the learning structure understand that g(a, R, b)≠g(b, R, a). Since many relationships are nonsymmetrical, this is a useful property. For example, the relationship, “The branch is part of the tree,” makes sense, whereas the reverse relationship, “The tree is part of the branch,” does not make sense. Moreover, due to the hypercomplex nature of a and b, one can compare relationships between tuples rather than only single elements.
In order to prevent data overfitting, dropout operators are frequently found in neural networks. A real-valued dropout operator sets each element in a real-valued array to zero with some nonzero (but usually small) probability. Hypercomplex dropout operators may also be employed in hypercomplex networks, again to prevent data overfitting. However, in a hypercomplex dropout operator, if one of the hypercomplex components (e.g. (1, i, j, k) in the case of a quaternion) is set to zero, then all other components must also be zero to preserve inter-channel relationships within the hypercomplex value. If unit-norm output is desired or required, the real component may assigned a value of one and all other components may be assigned a value of zero.
It is frequently desirable to downsample data within a hypercomplex network using a hypercomplex pooling operation. An exemplary method for performing this operation is to take a series of local windows, apply a function to each window (e.g. maximum function), and finally represent the data using only the function's output for each window. In hypercomplex networks, the function will, for example, take arguments that involve all hypercomplex components and produce an output that preserves inter-channel hypercomplex relationships.
A major difficulty in the practical application of learning systems is the computational complexity of the learning process. In particular, the convolution step described in Equation 1 typically cannot be computed efficiently via Fast Fourier Transform due to small filter kernel sizes. Accordingly, significant effort has been expended by academia and industry to optimize real-valued convolution for real-valued neural network graphs. However, no group has optimized hypercomplex convolution for these tasks, nor is there literature on hypercomplex deep learning. This section presents methods for adapting real-valued computational techniques to hypercomplex problems. The techniques in this section are critical to making the hypercomplex systems described in Section 1 practical for engineering use.
One approach to computing the hypercomplex convolution of Equation 1 is to write a set of for loops that shift the hypercomplex kernel to every appropriate position in the input image, perform a small set of multiplications and additions, and then continue to the next position. While such an approach would theoretically work, modern computer processors are optimized for large matrix-matrix multiplication operations, rather than multiplication between small sets of numbers. Correspondingly, the approach presented in this subsection reframes the hypercomplex convolution problem as a large hypercomplex multiplication, and then explains how to use highly-optimized, real-valued multiplication libraries to complete the computation. Examples of such real-valued multiplication libraries include: Intel's Math Kernel Library (MKL); Automatically Tuned Linear Algebra Software (ATLAS); and Nvidia's cuBLAS. All of these are implementations of a Basic Linear Algebra Subprograms (BLAS) library, and all provide matrix-matrix multiplication functionality through the GEMM function.
In order to demonstrate a hypercomplex neural network convolution, an example of the quaternion case is discussed presently. To aid the discussion, define the following variables:
X=inputs of shape (G,Di,4,Xi,Yi)
A=filter kernel of shape (Do,D1,4,Kx,Ky)
S=outputs of shape (G, Do,4,Xo,Yo) 27
where G is the number of data patterns in a processing batch, Di is the input depth, Do is the output depth, and the last two dimensions of each size represent the rows and columns for each variable. Note that the variables X, A, and S have been defined to correspond to real-valued memory arrays common in modern computers. Since each memory location holds a single real number and not a hypercomplex number, each variable above has been given an extra dimension of size 4. This dimension is used to store the quaternion components of the array.
The goal of this computation is to compute the quaternion convolution S=A*X. As discussed above, one strategy is to reshape the A and X matrices such that a single quaternion matrix multiply could be employed to compute the convolution. Reshaped matrices A′ and X′ are shown in Equation 28.
In A′ of Equation 28, ai are row vectors. The variable i indexes the output dimension of the kernel, Do. Each row vector is of length Di·Kx·Ky, corresponding to a filter kernel at all input depths. Observe that the A′ matrix is of depth 4 to store the quaternion components; therefore, A′ may be thought of as a two-dimensional quaternion matrix.
In X′ of Equation 28, xr,s are column vectors. The variable r indexes the data input patterns dimension, G. Each column vector is of length Di·Kx·Ky, corresponding to a filter kernel at all input depths. Since the filter kernel must be applied to each location in the image, for each input pattern, there are many columns xr,s. The s subscript is to index the filter locations; each input pattern contains M filter locations, where M equals:
M=(Xi−Kx+1)·(Yi−Ky+1) 29
The total number of columns of X′ is equal to G·M. Like A′, X′ is a two-dimensional quaternion matrix and is stored in three real dimensions in computer memory.
The arithmetic for quaternion convolution is performed in Equation 28, using a quaternion matrix-matrix multiply function that is described below in Sections 2.1.1 and 2.1.2.
One will observe that the result S′ from Equation 28 is still a two-dimensional quaternion matrix. This result is reshaped to form the final output S.
An example of a quaternion matrix multiply routine that may be used to compute Equation 28 is discussed presently. One approach is to employ Equation 2, which computes the product of two quaternions using sixteen real-valued multiply operations. The matrix form of this equation is identical to the scalar version in Equation 2, and each multiply may be performed using a highly-optimized GEMM call to the appropriate BLAS library. Moreover, the sixteen real-valued multiply operations may be performed in parallel if hardware resources permit.
Another example of a quaternion matrix multiply routine only requires eight GEMM calls, rather than the sixteen calls of Equation 2. This method takes arguments k(x,y)=ak(x,y)+bk(x,y)i+ck(x,y)j+dk(x,y)k and a(x,y)=aa(x,y)+ba(x,y)i+ca(x,y)j+da(x,y)k, and performs the quaternion operation k·a. The first step is to compute eight intermediate values using the real-valued GEMM call:
t
1(x,y)=ak(x,y)·aa(x,y)
t
2(x,y)=dk(x,y)·ca(x,y)
t
3(x,y)=bk(x,y)·da(x,y)
t
4(x,y)=ck(x,y)·ba(x,y)
i
5(x,y)=(ak(x,y)+bk(x,y)+ck(x,y)+dk(x,y))·(aa(x,y)+ba(x,y)+ca(x,y)+da(x,y))
t
6(x,y)=(ak(x,y)+bk(x,y)−ck(x,y)−dk(x,y))·(aa(x,y)+ba(x,y)−ca(x,y)−da(x,y))
t
7(x,y)=(ak(x,y)−bk(x,y)+ck(x,y)−dk(x,y))·(aa(x,y)−ba(x,y)+ca(x,y)−ca(x,y))
t
8(x,y)=(ak(x,y)−bk(x,y)−ck(x,y)+dk(x,y))·(aa(x,y)−ba(x,y)−ca(x,y)+ca(x,y)) 30
To complete the quaternion multiplication s(x,y), the temporary terms ti are scaled and summed as shown in Equation 31:
Because the GEMM calls represent the majority of the compute time, the method in this section executes more quickly than the method of Section 2.1.1.
2.2 Convolution Implementations: cuDNN
One pitfall to the GEMM-based approach described in the prior section is that formation of the temporary quaternion matrices A′ and X′ is memory-intensive. Graphics cards, for example those manufactured by Nvidia, are frequently used for matrix multiplication. Unfortunately, these cards have a limited onboard memory, and therefore inefficient use of memory is a practical engineering problem.
For real-valued convolution, memory-efficient software such as Nvidia's cuDNN library has been developed. This package performs real-valued convolution in a memory- and compute-efficient manner. Therefore, rather than using the GEMM-based approach above, adapting cuDNN or another convolution library to hypercomplex convolution may be advantageous. Because, like multiplication, convolution is a linear operation, the algorithms for quaternion multiplication may be directly applied to quaternion convolution by replacing real-valued multiplication with real-valued convolution. This lead to Equations 3, 4, and 5 of Section 1.2 and is explained in more detail there. The real-valued convolutions in these equations may be carried out using an optimized convolution library such as cuDNN, thereby making quaternion convolution practical on current computer hardware.
The hypercomplex neural network layers described thus far are ideal for use in arbitrary graph structures. A graph structure is a collection of layers (e.g. mathematical functions, hypercomplex or otherwise) with a set of directed edges that define the signal flow from each layer to the other layers (and potentially itself). Extensive effort has been expended to create open-source graph solving libraries, for example, Theano.
Three key mathematical operations are required to use the hypercomplex layers with a graph library such as Theano: First, the forward computation through the layer (Sections 1.2 to 1.7); second, a weight update must be computed by a learning rule for a single layer (Section 1.8); and, third, errors must be propagated through the layer to the next graph element (Section 1.9). Since these three operations have been introduced in this document, it is therefore possible to use the hypercomplex layer in an arbitrary graph structure.
The authors have implemented an exemplary set of Theano operations to enable the straightforward construction of arbitrary graphs of hypercomplex layers. The authors employ the memory storage layout of using a three-dimensional real-valued array to represent a two-dimensional quaternion array; this method is described further in Section 2.1 in the context of GEMM operations. Simulation results in Section 4 have been produced using the exemplary Theano operations and using the cuDNN library as described in Section 2.2.
As has been alluded to throughout this document, computational efficiency is a key criteria that must be met in order for hypercomplex layers to be practical in solving engineering challenges. Because convolution and matrix multiplication are computationally intensive, the standard approach is to run these tasks on specialized hardware, such as a graphics processing unit (GPU), rather than on a general-purpose processor (e.g. from Intel). The cuDNN library referenced in Section 2.2 is specifically written for Nvidia GPUs. The exemplary implementation of hypercomplex layers indeed employs the cuDNN library and therefore operates on the GPU. However, the computations may be performed on any other computational device and the implementation discussed here is not meant to limit the scope or claims of this patent.
An important bottleneck in GPU computational performance is the time delay to transfer data to and from the GPU memory. Because the computationally-intensive tasks of convolution and multiplication are implemented on the GPU in the exemplary software, it is critical that all other graph operations take place on the GPU to reduce transfer time overhead. Therefore, the activation function described in Sections 1.6 and 1.7 have also been implemented using the GPU, as have pooling operators and other neural network functions.
Sections 1 and 2 of this document have provided examples of a hypercomplex neural network layers, using quaternion numbers for illustrative purposes. This section discusses exemplary graph structures (i.e. “neural networks”) of hypercomplex layers. The layers may be arranged in any graph structure and therefore have wide applicability to the engineering problems discussed in Section 4. The Theano implementation of the hypercomplex layer, discussed in Section 2.3, allows for construction of arbitrary graphs of hypercomplex layers (and other components) using minimal effort.
3.2 Neural Network with Pooling
The feedforward neural network of
Graphs of hypercomplex layers also may contain feedback loops, as shown in
3.4 Neural Network with Layer Jumps
As shown in
In the entirety of this document, the term “hypercomplex layer” is meant to encompass any variation of a hypercomplex neural network layer, including, but not limited to, fully-connected layers such as those of
Hypercomplex layers and/or graphs may, for example, be combined in parallel with other structures. The final output of such a system may be determined by combining the results of the hypercomplex layers and/or graph and the other structure using any method of fusion, e.g. averaging, voting systems, maximum likelihood estimation, maximum a posteriori estimation, and so on. An example of this type of system is depicted in
Hypercomplex layers, graphs, and/or modules may be combined in series with nonhypercomplex components; an example of this is shown in
The hypercomplex layers may be arranged in any graph-like structure; Sections 3.1 to 3.8 provide examples but are not meant to limit the arrangement or interconnection of hypercomplex layers. As discussed in Section 2.3, the exemplary quaternion hypercomplex layer has specifically been implemented to allow for the creation of arbitrary graphs of hypercomplex layers. These layers may be combined with any other mathematical function(s) to create systems.
This section provides exemplary engineering applications of the hypercomplex neural network layer.
An exemplary application of hypercomplex neural networks is image super resolution. In this task, a color image is enlarged such that the image is represented by a larger number of pixels than the original. Image super resolution is performed by digital cameras, where it is called, “digital zoom,” and has many applications in surveillance, security, and other industries.
Image super resolution may be framed as an estimation problem: Given a low-resolution image, estimate the higher-resolution image. Real-valued, deep neural networks have been used for this task. To use a neural network for image super resolution, the following steps are performed: To simulate downsampling, full-size original images are blurred using a Gaussian kernel; the blurred images are paired with their original, full-resolution sources and used to train a neural network as input and desired response, respectively; and, finally, new images are presented to the neural network and the network output is taken to be the enhanced image.
One major limitation of the above procedure is that real-valued neural networks do not understand color, and therefore most approaches in literature are limited to grayscale images. We adapt the hypercomplex neural network layer introduced in this patent application to the super resolution application in
The above steps were performed using a quaternion neural network with 3 layers. The first convolutional layer has the parameters Di=1, Do=64, Kx=Ky=9. The second layer takes input directly from the first and has parameters Di=64, Do=32, Kx=Ky=1. Finally, the third layer takes input directly from the second and has parameters Di=32, Do=1, Kx=Ky=5. For information on parameter definitions, please see Section 2.1 of this document. Note that the convolution operations were performed on all valid points (i.e. no zero padding), so the prediction image is of size 20×20 color pixels, which is somewhat smaller than the 32×32 pixel input size.
This experiment was repeated using a real-valued convolutional neural network with the same parameters. However, when using a real-valued neural network, the quaternion polar form (Equations 12 and 14) is not used for the real-valued neural networks. Rather, the images are presented to the network as a depth 3 input, where each input depth corresponds to one of the colors. Consequently, Di=3 for the first layer rather than 1 in the quaternion case, and Do=3 in the last layer. This allows the real neural network to process color images, but the real network does not understand that there is a significant relationship between the three color channels. Each neural network was trained using 256 input images. The networks were trained for at least 150,000 cycles; in all cases, the training error had reached steady state before training was deemed complete. An additional set of 2544 images was used for testing.
The mean PSNR values are shown for the training and testing datasets in Table 1. Higher PSNR values represent better results, and one can observe that the hypercomplex network outperforms the real-valued network.
Sample visual outputs from each algorithm are shown in
Another exemplary application of the hypercomplex neural network is color image segmentation. Image segmentation is the process of assigning pixels in a digital image to multiple sets, where each set typically represents a meaningful item in the image. For example, in a digital image of airplanes flying, one may want to segment the image into sky, airplane, and cloud areas.
When combined with the pooling operator, real-valued convolutional neural networks have seen wide application to image classification. However, the shift-invariance caused by pooling makes for poor image localization, which is necessary for image segmentation. One strategy to combat this problem is to upsample the image during convolution. Such upsampling is standard practice in signal processing, where one inserts zeros between consecutive data values to create a larger image. This upsampled image is trained into a hypercomplex neural network.
An exemplary system for hypercomplex image segmentation is shown in
The hypercomplex networks introduced in this patent may also be used for automated image quality evaluation.
Image quality analysis is fundamentally a task for human perception, and therefore humans provide the ultimate ground truth in image quality evaluation tasks. However, human ranking of images is time-consuming and expensive, and therefore computational models of human visual perception are of great interest. Humans typically categorize images using natural language-i.e. with words like, “good,” or, “bad.” Existing studies have asked humans to map these qualitative descriptions to a numerical score, for example, 1 to 100. Research indicates that people are not good at the mapping process, and therefore the step of mapping to a numerical score adds noise. Finally, existing image quality measurement systems have attempted to learn the mapping from image to numerical score, and are impeded by the noisy word-to-score mapping described previously.
Another approach is to perform blind image quality assessment, where the image quality assessment system learns the mapping from image to qualitative description. Thus, a blind image quality assessment system is effectively a multiclass classifier, where each class corresponds to a word such as, “excellent,” or, “bad.” Such a system is shown in
The first processing step of the system in
A key feature of the hypercomplex network is its ability to process all color channels simultaneously, thereby preserving important inter-channel relationships. Existing methods either process images in grayscale or process color channels separately, thereby losing important information.
Another exemplary application of hypercomplex deep learning is image steganalysis. Image steganography is the technique of hiding information in images by slightly altering the pixel values of the image. The alteration of pixels is performed such that the human visual system cannot perceptually see a difference between the original and altered image. Consequently, steganography allows for covert communications over insecure channels and has been used by a variety of terrorist organizations for hidden communications. Popular steganographic algorithms include HUGO, WOW, and S-UNIWARD.
Methods for detecting steganography in images have been developed. However, all of the methods either process images in grayscale or process each color channel separately, thereby losing important inter-channel relationships. The hypercomplex neural network architecture in this patent application overcomes both limitations; an example of a steganalysis system is shown in
Due to hypercomplex networks' advantages in color image processing, face recognition is another exemplary application. One method of face recognition is shown in
An important task in natural language processing is event embedding, the process of converting events into vectors. An example of an event may be, “The cat ate the mouse.” In this case, the subject O1 is the cat, the predicate P is “ate”, and the object O2 is the poor mouse. Open-source software such as ZPar can be used to extract the tuple (O1,P,O2) from the sentence, and this tuple is referred to as the event.
However, the tuple (O1,P,O2) is still a tuple of words, and machine reading and translation tasks typically require the tuple to be represented as a vector. An exemplary application of a hypercomplex network for tuple event embedding is shown in
A natural extension of the event embedding example of Section 4.6 is machine translation: Embedding a word, event, or sentence as a vector, and then running the process in reverse but using neural networks trained in a different language. The result is a system that automatically translates from one written language to another, e.g. from English to Spanish.
The most general form of a machine translation system is shown in
Another example of a machine translation system is shown in
The deep neural network graphs described in this patent application are typically trained using labeled data, where each training input pattern has an associated desired response from the network. However, in many applications, extensive sets of labeled are not available. Therefore, a way of learning from unlabeled data is valuable in practical engineering problems.
An exemplary approach to using unlabeled data is to train autoassociative hypercomplex neural networks, where the desired response of the network is the same as the input. Provided that the hypercomplex network has intermediate representations (i.e. layer outputs within the graph) that are of smaller size than the input, the autoassociative network will create a sparse representation of the input data during the training process. This sparse representation can be thought of as a form of nonlinear, hypercomplex principal component analysis (PCA) and extracts only the most “informative” pieces of the original data. Unlike linear PCA, the hypercomplex network performs this information extraction in a nonlinear manner that takes all hypercomplex components into account.
An example of autoassociative hypercomplex neural networks is discussed in Section 1.10, where autoassociative structures are employed for pre-training neural network layers.
Additionally, unsupervised learning of hypercomplex neural networks may be used for feature learning in computer vision applications. For example, in
Hypercomplex neural networks and graph structures may be employed, for example, to control systems with state, i.e. a plant.
Another exemplary application hypercomplex networks is breast cancer classification or severity grading. For this application, use of the word, “classification,” will refer both to applications where cancer presence is determined on a true or false basis, and will also refer to applications where the cancer is graded according to any type of scale-i.e. grading the severity of the cancer.
A simple system for breast cancer classification is shown in
Note that a variety of multi-dimensional mammogram techniques are currently under development, and that false color is currently added to existing mammograms. Therefore, the advantages that a hypercomplex network has when processing color and multi-dimensional data apply to this example.
Since breast cancer classification has been studied extensively, a large database of expert features has already been developed for use with this application. A further improvement upon the described hypercomplex breast cancer classification system is shown in
Finally, additional postprocessing may be performed after the output of the hypercomplex network. An example of this is shown in
Most modern sensing applications employ multiple sensors, either in the context of a sensor array using multiple of the same type of sensor or by using different types of sensors (e.g. as in a smartphone). Because these sensors are typically measuring related, or the same, quantity, the output from each sensor is usually related in some way to the outputs from the other sensors.
Multi-sensor data may be represented by hypercomplex numbers and, accordingly, processed in a unified manner by the introduced hypercomplex neural networks. An example of speech recognition is shown in
The goal is to perform similar speech recognition on a speaker who is far away from a microphone array. To accomplish this, a microphone array captures far away speech and represents it using hypercomplex numbers. Next, a deep hypercomplex neural network (possibly including state elements) is trained to output corrected cepstrum coefficients by using the close talking input and its cepstrum converter to created labeled training data. Finally, during the recognition phase, the close talking microphone can be disabled completely and the hypercomplex neural network feeds the speech recognizer directly, delivering speech recognition quality similar to that of the close-talking system.
Multispectral imaging is important for a variety of material analysis applications, such as remote sensing, art and ancient writing investigation and decoding, fruit quality analysis, and so on. In multispectral imaging, images of the same object are taken at a variety of different wavelengths. Since materials have different reflectance and absorption properties at different wavelengths, multispectral images allow one to perform analysis that is not possible with the human eye.
A typical multispectral image dataset is processed using a classifier to determine some property or score of the material. This process is shown in
In
Another exemplary application of the hypercomplex neural networks introduced in this patent is color image filtering, where the colors are processed in a unified fashion as shown in
While processing of multichannel, color images has been discussed at length in this application, hypercomplex structures may also be employed for single-channel, gray-level images as well.
Standard color image enhancement and analysis techniques, such as, for example, luminance computation, averaging filters, morphological operations, and so on, may be employed in conjunction with the hypercomplex graph components/neural networks described in this application. For example,
Biometric authentication has become popular in recent years due to advancements in sensors and imaging algorithms. However, any single biometric credential can still easily be corrupted due to noise: For example, camera images may be occluded or have unwanted shadows; fingerprint readers may read fingers covered in dirt or mud; iris images may be corrupted due to contact lenses; and so on. To increase the accuracy of biometric authentication systems, multimodal biometric measurements are helpful.
An exemplary application of the hypercomplex neural networks in this patent application is multimodal biometric identity matching, where multiple biometric sensors are combined at the feature level to enable decision making based on a fused set of data.
Advantages of unified multimodal processing include higher accuracy of classification and better system immunity to noise and spoofing attacks.
4.19 Multimodal Biometric Identity Matching with Autoencoder
To further extend the example in Section 4.18, one may add an additional hypercomplex autoencoder, as pictured in
4.20 Multimodal Biometric Identity Matching with Unlabeled Data
In biometrics, it is frequently the case that large quantities of unlabeled data are available, but only a small dataset of labeled data can be obtained. It is desirable to use the unlabeled data to enhance system performance through pre-training steps, as shown in
Particularly in the case of facial images, most labeled databases have well-lit images in a controlled environment, while unlabeled datasets (e.g. from social media) have significant variations in pose, illumination, and expression. Therefore, creating data representations that are capable of representing both types of data will enhance overall system performance.
4.21 Multimodal Biometric Identity Matching with Transfer Learning
A potential problem with the system described in Section 4.20 is that the feature representation created by the (stacked) autoencoder may result in a feature space where each modality is represented by separate elements; while all modalities theoretically share the same feature representations, they do not share the same numerical elements within those representations. An exemplary solution to this is shown in
In this proposed system, matching sets of data are formed from, for example: (i) anatomical characteristics, for example, fingerprints, signature, face, DNA, finger shape, hand geometry, vascular technology, iris, and retina; (ii) behavioral characteristics, for example, typing rhythm, gait, gestures, and voice; (iii) demographic indicators, for example, age, height, race, and gender; and (iv) artificial characteristics such as tattoos and other body decoration. While training the autoencoder, one or more modalities from each matching dataset are omitted and replaced with zeros. However, the autoencoder/decoder pair is still trained to reconstruct the missing modality, thereby ensuring that information from the other modalities is used to represent the missing modality. As with
The fashion industry presents a number of interesting exemplary applications of hypercomplex neural networks. In particular, most labeled data for clothing comes from retailers, who hire models to demonstrate clothing appearance and label the photographs with attributes of the clothing, such as sleeve length, slim or loose, color, and so on. However, many practical applications of clothing detection are relevant for so-called “in the street” clothes. For example, if a retailer wants to know how often a piece of clothing is worn, scanning social media could be an effective approach, provided that the clothing can be identified from photos that are taken in an uncontrolled environment.
Moreover, in biometric applications, clothing identification may be helpful as an additional modality.
An exemplary system for creating a store clothing to attribute classification system is shown in
At 502, a hypercomplex representation of input training data may be created. In some embodiments, the input training data may comprise image data (e.g., traditional image data, multispectral image data, or hyperspectral image data. For example, in some embodiments, each spectral component of the multispectral image data may be associated in the hypercomplex representation with a separate dimension in hypercomplex space), and the hypercomplex representation may comprise a first tensor. In these embodiments, a first dimension of the first tensor may correspond to a first dimension of pixels in an input image, a second dimension of the first tensor may correspond to a second dimension of pixels in the input image, and a third dimension of the first tensor may correspond to different spectral components of the input image in a multispectral or hyperspectral representation. In other words, the first tensor may separately represent each of a plurality of spectral components of the input images as respective matrices of pixels. In some embodiments, the first tensor may have an additional dimension corresponding to depth (e.g., for a three-dimensional “image”, i.e., a multispectral 3D model).
In other embodiments, the input training data may comprise text and the CNN may be designed for speech recognition or another type of text-processing application. In these embodiments, the hypercomplex representation of input training data may comprise a set of data tensors. In these embodiments, a first dimension of each data tensor may correspond to words in an input text, a second dimension of each data tensor may correspond to different parts of speech of the input text in a hypercomplex representation, and each data tensor may have additional dimensions corresponding to one or more of data depth, text depth, and sentence structure.
At 504, hypercomplex convolution of the hypercomplex representation (e.g., the first tensor in some embodiments, or the set of two or more data tensors, in other embodiments) with a second tensor may be performed to produce a third output tensor. For image processing applications, the second tensor may be a hypercomplex representation of weights or adaptive elements that relate one or more distinct subsets of the pixels in the input image with each pixel in the output tensor. In some embodiments, the subsets of pixels are selected to comprise a local window in the spatial dimensions of the input image. In some embodiments, a subset of the weights map the input data to a hypercomplex output using a hypercomplex convolution function.
In text processing applications, 504 may comprise performing hypercomplex multiplication using a first data tensor of the set of two or more data tensors of the hypercomplex representation with a third hypercomplex tensor to produce a hypercomplex intermediate tensor result, and then multiplying the hypercomplex intermediate result with a second data tensor of the set of two or more data tensors to produce a fourth hypercomplex output tensor, wherein the third hypercomplex tensor is a hypercomplex representation of weights or adaptive elements that relate the hypercomplex data tensors to one another. In some embodiments, the fourth hypercomplex output tensor may be optionally processed through additional transformations and then serve as input data for a subsequent layer of the neural network. In some embodiments, the third output tensor may serve as input data for a subsequent layer of the neural network.
At 506, the weights or adaptive elements in the second tensor may be adjusted such that an error function related to the input training data and its hypercomplex representation is reduced. For example, a steepest descent or other minimization calculation may be performed on the weights or adaptive elements in the second tensor such that the error function is reduced.
At 508, the adjusted weights may be stored in the memory medium to obtain a trained neural network. Each of steps 502-508 may be subsequently iterated on subsequent respective input data to iteratively train the neural network.
Further modifications and alternative embodiments of various aspects of the invention will be apparent to those skilled in the art in view of this description. Accordingly, this description is to be construed as illustrative only and is for the purpose of teaching those skilled in the art the general manner of carrying out the invention. It is to be understood that the forms of the invention shown and described herein are to be taken as examples of embodiments. Elements and materials may be substituted for those illustrated and described herein, parts and processes may be reversed, and certain features of the invention may be utilized independently, all as would be apparent to one skilled in the art after having the benefit of this description of the invention. Changes may be made in the elements described herein without departing from the spirit and scope of the invention as described in the following claims.
This application claims priority to U.S. Provisional Application Ser. No. 62/551,901 entitled “Hypercomplex Deep Learning Methods, Architectures, and Apparatus for Multimodal Small, Medium, and Large-Scale Data Representation, Analysis, and Applications” filed Aug. 30, 2018, which is incorporated herein by reference in its entirety.
This invention was made with government support under RD-0932339 awarded by the National Science Foundation. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
62551901 | Aug 2017 | US |