This invention relates generally to the artificial intelligence field, and more specifically to new and useful systems and methods for deep learning with small training sets for neural networks in the artificial intelligence field.
Despite advances in computer vision, image processing, and machine learning, recognizing visual objects remains a task where computers fail in comparison with the capabilities of humans. Recognizing an object from an image not only requires recognizing the object in a scene but also recognizing objects in various positions, in different settings, and with slight variations. For example, to recognize a chair, the innate properties that make a chair a chair must be understood. This is a simple task for a human. Computers struggle to deal with the vast variety of types of chairs and the situations in which a chair may be present. Models capable of performing visual object recognition must be trained to provide explanations for visual datasets in order to recognize objects present in those visual datasets. Unfortunately, most methods for training such models either fall short in performance and/or require large training sets.
This issue is not confined solely to visual object recognition, but more generally applies to pattern recognition, which may be used in speech recognition, natural language processing, and other fields. Thus, there is a need in the artificial intelligence field to create new and useful systems and methods for deep learning with small training sets. This invention provides such new and useful systems and methods.
The following description of the invention embodiments of the invention is not intended to limit the invention to these invention embodiments, but rather to enable any person skilled in the art to make and use this invention.
Neural networks and related systems, including recursive cortical networks (RCNs), convolutional neural networks (CNNs), hierarchical compositional networks (HCNs), HMAX models, Slow Feature Analysis (SFA) systems, and Hierarchical Temporal Memory (HTM) systems may be used for a wide variety of tasks that are difficult to complete using standard rule-based programming. These tasks include many in the important fields of computer vision and speech recognition.
Neural networks and related systems can be represented as distributed processing elements that implement summation, multiplication, exponentiation or other functions on the elements incoming messages/signals. Such networks can be enabled and implemented through a variety of implementations. For example, a system may be implemented as a network of electronically coupled functional node components. The functional node components can be logical gates arranged or configured in a processor to perform a specified function. As a second example, the system may be implemented as a network model programmed or configured to be operative on a processor. The network model is preferably electronically stored software that encodes the operation and communication between nodes of the network. Neural networks and related systems may be used in a wide variety of applications and can use a wide variety of data types as input such as images, video, audio, natural language text, analytics data, widely distributed sensor data, or other suitable forms of data.
In particular, convolutional neural networks (CNNs) may be useful for performing inference on data for which feature recognition is independent of one or more dimensions of the data; for example, when detecting shapes in an image, the detected shapes are not dependent on their position in the image—the same features used to detect a square in one part of the image may be used to detect a square in another part of the image as well. These dimensions may be spatial (as in the 2D image example), but may additionally or alternatively be temporal or any suitable dimensions (e.g., a frequency dimension for audio or multispectral light data).
CNNs, as shown in
CNNs may also include pooling layers, which function to reduce the size of the output of a set of neurons (typically the output of a convolutional layer, but pooling layers may be used for any set of neurons; e.g., on top of the input neurons). For example, a pooling layer may take the maximum activation of a set of neurons as an output (i.e., max-pooling). Pooling layers are applied to each feature map separately. Commonly, pooling layers are used between convolutional layers in CNNs. CNNs also may include other layers, such as input layers, output layers, etc.
As shown in
Each feature map is in turn connected to a set of pooling neurons in PL. As shown, PL has a pooling window of 2.
The output of PL is used as an input to the second convolution layer CL2, which has a receptive field of 2. Note here that each neuron of CL2 connects to each feature map of CL1/PL; in other words, the feature map of CL2 (there is only one as shown in
Finally, the output of CL2 is used as input to the output layer OL. Note here that OL is fully connected to CL2.
By limiting neural network connections via exploiting the locality of the receptive fields according to data dimensionality, CNNs can perform inference with a fraction of the complexity required by an older fully-connected model. However, training CNNs often requires a substantially large training set and CNNs may be ‘fooled’ by certain images (as the number of training samples may be poorly representative of the very high dimensional input space).
A Hierarchical Compositional Network (HCN) 100 includes a set of pooling layers 120 and a set of convolutional layers 130, as shown in
The HCN 100 is a neural network based on a generative model (unlike a standard feed-forward CNN, which is based on a discriminative model) that incorporates some of the structural biases of the previously described CNN. For instance, like a CNN, the HCN 100 preferably takes advantage of the dimensional structure of data by connecting neurons only to a small region of input data, selected using properties of dimensions of the data (e.g., the x and y location of pixels in image data).
In addition, the specific architecture of the HCN 100 enables the HCN 100 to create images by composing parts, introduce variability through pooling, and perform explaining away during inference. Training of the HCN 100 (discussed in further detail on the section of the method 200) may be performed in both supervised and unsupervised scenarios; after training, discrimination can be achieved via a fast forward pass with the functional form of a CNN. This is a unique aspect enabling the HCN 100 to capture advantages of both CNNs (which are discriminative) and non-HCN generative models.
As shown in
Various instances and instantiations of HCN sub-networks are preferably constructed, connected, and used recursively in the hierarchy of the HCN 110. The architecture of the hierarchical network may be constructed in any manner. The HCN 100 preferably includes alternating convolutional layers 130 and pool layers 120 (alternatively the layers of the HCN 100 may be arranged in any manner). The sub-networks have feature input nodes and feature output nodes, and the feature nodes are used to bridge or connect the sub-networks. Each node of the hierarchical network will preferably have parent node connections and child node connections. Generally, the parent node connections are preferably inputs during generation and outputs during inference. Conversely, the child node connections are outputs during generation and inputs during inference. In the variation of a single layer (or non-hierarchical) sub-networks are arranged as siblings.
The sub-networks may be set up in a variety of different configurations within a network. Many of the configurations are determined by constraint nodes that define the node-selection within a sub-network, between sub-networks, or even between networks. Additionally, sub-networks can be set up to have distinct or shared child features. The sub-networks are additionally arranged in hierarchical layers. In other words, a first sub-network may be the parent of a second sub-network. Similarly, the second sub-network may additionally be the parent of a third sub-network. The layers of sub-networks are preferably connected through shared parent feature nodes and child feature nodes. Preferably, a child feature node of a top layer sub-network is the parent feature node of a lower sub-network. Conversely, the parent feature nodes of a sub-network can participate as the child feature nodes of a higher sub-network. The parent feature nodes of the top-level sub-networks are preferably the inputs into the system. The child features of the bottom/lowest sub-networks are preferably the outputs of the system. Connecting multiple sub-networks can introduce multi-parent interactions at several nodes in the network. These interactions can be modeled using different probabilistic models in the nodes.
Connecting the sub-networks in a hierarchy can function to promote compact and compressed representations through sub-network re-use. Parent feature nodes of one sub-network can participate as child feature nodes in multiple parent sub-networks. A similar benefit is that invariant representations of a child sub-network can be re-used in multiple parent sub-networks. One example of where this would be applicable is in the case of an HCN 110 representing visual objects. The lower-level sub-networks can correspond to parts of objects and the higher level sub-networks (i.e., upper layer sub-networks) can represent how those parts come together to form the object. For example, the lower level sub-networks can correspond to representations for the body parts of an image of a cow. Each body part will be invariantly represented (enabled by pooling layers 120) and will be tolerant to location transformations like translations, scale variations, and distortions. The higher-level sub-network then will specify how the body parts come together to represent a cow. Some of the lower-level body parts of a cow could be re-used at a higher level for representing a goat. For example, the legs of both of these animals move similarly and hence those parts could potentially be re-used. This means that the invariant representations learned for the legs of cows can be automatically re-used for representing goats.
The HCN 110 may be used both for generating data explanations (e.g. classifying objects in an image) and for generating data predictions (e.g. an image containing some set of objects). During data explanation generation, nodes of the HCN 110 preferably operate on input data features and propagate the node selection/processing through the hierarchy of the HCN 110 until an output is obtained from a parent feature of a top-layer sub-network. A combination of propagating information up in the hierarchy (to higher parent layers) and downwards (towards the final child features) may be used to accomplish this output. During data prediction generation, the HCN 110 preferably starts from a general generation request that is directed, fed, or delivered to the parent feature nodes of the top layer sub-networks. The nodes preferably operate on the information and propagate the node selection/processing down the hierarchy of the HCN 110 until an output is obtained from the child feature nodes of the bottom-layer sub-networks.
As shown in
The nodes of the network preferably are configured to operate, perform or interact with probabilistic interactions that determine node activation, selection, ON/OFF, or other suitable states. When activated by a parent node, the node will preferably trigger activation of connected child nodes according to the selection function of the node. While a selection function may be any function, examples include logical AND, OR, and XOR selection functions (all Boolean), as well as tanh (and other sigmoid) selection functions. The nodes preferably represent binary random variables or multinomial random variables. As shown in
In one implementation of an invention embodiment, the HCN 110 may be modeled in factor graph form using a set of binary random variable nodes and factor nodes, as shown in
In this representation, the HCN 110 may be represented using binary random variable nodes (represented by circle symbols in
The AND factor, as shown in
The OR factor, as shown in
The POOL factor, as shown in
These factor nodes, along with the previously mentioned binary random variable nodes, may be used to form the structure of the HCN 100.
The convolution layer 130 functions to combine a sparsification S with a set of weights W to produce a representation R. The weights can be thought of as a dictionary describing the elements of S in terms of the elements of R. For an HCN 110 operable on two-dimensional data (e.g., 2D image data), tensor S has size Hs×Ws×Fs where the first two dimensions (height and width) are spatial, while the third dimension indexes the represented feature f. All of the elements of S with the same f produce the same expansion in R at different locations. As W contains the expansion at the representation layer of each feature, its size is HW×WW×FW×FWbelow where Fw=Fs and FWbelow=FR. Tensor R in turn has size HR×WR×FR. Probabilistically, R is a deterministic transformation of W and S given by: p(|,)=[=bconv(,)] where the definition of the binary convolution is given as:
where conv2D is the usual 2D convolution operator. Thus, an element of R will be set to 1 if any of the elements of the sparsification S activates it once expanded according to W. From the definition of binary convolution previously stated, it can be shown that each element of R can be expressed by ORing a total of HwWwFw intermediate terms, with each intermediate term corresponding to the AND of a term from S and a term from W. Each intermediate term is used in exactly one OR and each element of R is connected to exactly one OR. However, the elements of Wand S are connected to multiple ANDs.
As previously mentioned, the pooling layer 120 functions to introduce variability to the HCN 100. The pooling layer 120 accomplishes to shift the position of active units of representation R of a given layer (), resulting in the sparsification layer below (−1). Note that here ‘position’ refers to the correspondence of units of the HCN 100 to dimensionality of the input (e.g., a unit may correspond to a 9×9 region of neighboring pixels). Each pooling layer 120 preferably shifts the active units of R within a local pooling window of size Hp×Wp×1; alternatively, the pooling layer 120 may shift active units of R in any manner.
When two or more active units in are shifted towards the same position in −1, they result in a single activation, so the number of active units in −1 is preferably equal to or smaller than the number of activations in .
The shifting performed by the pooling layer may be expressed using a set of intermediate binary variables UΔr,Δc,r,c,f each of which are associated with a shift of Δr, Δc of the element ,c,f, where this element corresponds to a given layer , a position r,c, and a feature index f. The HpWp intermediate variables associated to the same element ,c,f are noted as rcf. Since an element can be shifted to a single position per realization when it is active, the elements in rcf may be grouped into a pool
and −1 may be obtained from by ORing the HPWP intermediate variables of U that can potentially turn it on:
Note that as described in the previous paragraphs, the pooling layers are shifted ‘spatially’ only (i.e., across input dimensions). Additionally or alternatively, pooling layers may also shift across features, which may introduce richer variance in features.
As shown in
The sections as shown in
The example network sections as shown in
As shown in
Note that in the example subnetwork, PF1 is connected to multiple CF nodes corresponding to a given feature index. A feature node of a given layer is preferably connected to all feature nodes within some region of the feature node; e.g., if the feature node is associated with a position (r1, c1), the feature node may be connected to all feature nodes in the layer below corresponding to all (r, c) that satisfy |r1−r|<Z1, |c1−c|<Z2 where the region defined by Z1 and Z2 is referred to as a receptive field.
The receptive field may be set in any manner and may vary based on position, layer, or any other function of the HCN 100. For example, feature variable nodes of higher layers may have a smaller receptive field than those of lower layers. As another example, feature variable nodes corresponding to a given layer and a position relatively central to the network input dimensions may have larger receptive fields than those in the same layer but at the periphery of the network input dimensions.
Likewise, the overlap between receptive fields may be varied. The receptive field and overlap in turn may correspond to the difference in number of feature nodes in two connecting layers; for example, if a first layer has a feature node for each position in a 2D array of size 4×4 (16 nodes) and the receptive field for the layer above is 2×2 with no overlap, the above layer will contain 4 feature nodes. In the case of 50% overlap (e.g., the receptive window shifts by one instead of by two), the above layer instead contains 9 feature nodes. Similar to the receptive field, overlap may be varied based on position, layer, or any other function of the HCN 100.
In the example subnetwork of
Given the above explanation of the general functioning and structure of the HCN 100, the following sections will discuss the elements in more detail.
The feature variable nodes function to identify individual features, which are which are in turn composed of lower-level features (i.e., child features) and may themselves serve in describing higher-level features (i.e., parent features). In other words, a feature variable node in a first layer may serve as a parent feature to nodes in layers below the first layer and as a child feature to nodes in layers above the first layer. As shown in
The feature variable nodes function to identify individual features, which are which are in turn composed of lower-level features (i.e., child features) and may themselves serve in describing higher-level features (i.e., parent features). In other words, a feature variable node in a first layer may serve as a parent feature to nodes in layers below the first layer and as a child feature to nodes in layers above the first layer. As shown in
The feature variable nodes of the HCN 10o are preferably connected to a pooling layer 120 above (i.e., the output of the feature variable node in inference/forward message passing and the input of the feature variable node in generation/backward message passing). Below (i.e., the input of the feature variable node in inference/forward message passing and the output of the feature variable node in generation/backward message passing), the feature variable nodes are preferably connected to one or more AND factor nodes.
As shown in
Note that while the AND and OR factor nodes are represented in
The AND factor nodes function to enable the selective activation of connections between parent feature variable nodes and child variable nodes based on the weight variable nodes. The OR factor nodes coupled to the AND factor nodes (alternatively stated, the CONV factor nodes) in turn function to enable neighboring parent feature variable nodes to couple to pooling layers in a manner such that active units shifted toward the same position result in a single activation (preventing unnecessarily many activations) as discussed in the section on the convolution layer 130.
Along with the AND factor nodes, the weight variable nodes function to enable the selective activation of connections between parent feature variable nodes and child variable nodes. Stated alternatively, the connection between a feature variable node and a pool variable node (which in turn couples to child variable nodes) may be disabled or enabled based on the weight variable node. The use of weight variable nodes is a unique part of the HCN 100; they enable the HCN 100 to learn not only network parameters but also (in a sense) network structure. This functionality is not possible in traditional neural networks.
Weight variable nodes are unique among the variable nodes of the HCN 100 in that their role and function may change depending on the training state of the network; that is, weight variable nodes preferably function differently during primary training than after it.
As shown in
In one variation of an invention embodiment, the weight variable nodes are coupled to message storage. In this variation, the weight variable nodes may remember messages passed to them during one part of training (e.g., a first element of a training set) and provide message updates based on the remembered messages in future parts of training. For example, the output of weight variable nodes during training on one training element (or a set of training elements) may be based on stored messages received during training on previous training elements.
Through use of memory and/or connections to multiple HCNs 100, weight variable nodes may enable purely batch training (looking at many training set elements at a time), purely online learning (looking at a single training set element at a time), or any combination of the above (e.g., weight variable nodes may train on multiple sets of five images, each set after another). This may enable ideal use of computational resources.
After training, weight variable nodes (if previously coupled to multiple HCNs 100) are preferably de-coupled (i.e., fixed to a single HCN 100) and fixed in value. Together with the AND nodes, this is functionally equivalent to activating or de-activating network connections. For example, the connections of the HCN 100 shown in
To take advantage of the dimensionality of input data, weight variable nodes are preferably shared across connections of feature variable nodes corresponding to the same feature and relative location.
For example, as shown in
Alternatively, weight variable nodes may not be shared across only some or none of feature node connections associated with the same feature and relative location.
Together, the feature variable nodes, weight variable nodes, and AND/OR factor nodes comprise a representation of the convolution layer 130. Below and/or above the convolution layer 130, the convolution layer 130 is coupled to one or more pooling layers 120. As shown in
The pool variable nodes function as transformation-invariant (e.g., translation invariant across a discrete set of translations, transformation across feature indexes) representations of child feature variable nodes. Each pool variable node is coupled to a POOL factor node, which in turn couples to (via intermediate variable nodes and OR factor nodes) child feature variable nodes.
Each pool variable node is preferably associated with a position (or other index of input dimension) and a feature index (though in some variations, pool variable nodes may POOL across features; in this case, pool variable nodes may be represented by a composite feature index or multiple feature indexes).
Pool variable nodes preferably couple to parent feature variable nodes of multiple (e.g., all) feature indexes of the layer above. In the example as shown in
Each pool variable node couples to a POOL factor node below, which in turn couples to several intermediate variable nodes below, which in turn are connected via OR factor nodes to a feature variable node. This structure enables the introduction of variability to HCN 100 generation. The POOL factor node, for a given pool variable node, enables the selection of a child feature variable node with a shift (noting that this shift may be across either data dimensions and/or feature index, and noting also that the shift may be zero). For example, as shown in
Each OR factor node preferably couples to a single child feature variable node below; thus, each OR factor node is preferably associated with a dimension index (e.g., a position) and feature index (the dimension index and feature index of the child feature variable node). Likewise, the set of intermediate variable nodes coupled to each OR factor node represent a pool also corresponding to this dimension and feature index.
For example, as shown in
The number of intermediate variable nodes is related to the aforementioned pooling window associated with a pool (and/or a pool variable node). For example, as shown in
The pooling window may be set in any manner and may vary based on position, layer, or any other function of the HCN 100. For example, pool variable nodes of higher layers may have a smaller pooling window than those of lower layers. As another example, pool variable nodes corresponding to a given layer and a position relatively central to the network input dimensions may have larger pooling windows than those in the same layer but at the periphery of the network input dimensions. Likewise, the overlap between pooling windows may be varied (e.g., similar to receptive fields).
The intermediate variable nodes and OR factor nodes function to prevent multiple activations when two or more active units in the layer above are shifted toward the first position. While the pooling layer 120 preferably includes these nodes, in a variation of an invention embodiment, the POOL factor node may connect directly to child feature variable nodes, as shown in
In a variation of an invention embodiment, the pooling layer 120 may additionally include lateral constraint factor nodes, as shown in
The modified POOL factor nodes are preferably substantially similar to previously described POOL factor nodes, except that when the extended state variable node does not correctly represent the state of the pool, the POOL factor node is forced to a particular value (e.g., −∞). Note that the extended state variable node is preferably a multinomial representative of the state of the pool; i.e., which intermediate variables are active and which are not. The constraint factor node preferably enforces a desired correlation between extended state variable nodes of different pools (e.g., if Ua of P1 is active, so then must Uc of P2 be active). Constraint factor nodes can enforce restrictions, rules, and constraints within selection of nodes in other pools, in other sub-networks, and/or in different times. The HCN 100 is preferably evaluated in an ordered fashion such that nodes that are connected through a constraint factor node are preferably not evaluated simultaneously. Subsequently, restrictions of the constraint variable node are activated/enforced on other connected (i.e., constrained) nodes.
In addition to pooling layers 120 and convolutional layers 130 and their elements as mentioned above, the HCN 100 may also include a pre-processing layer 110 and/or a class layer 140.
The pre-processing layer no functions to process input data to prepare it for use by the HCN 100 (and thus may be considered the lowest level of the HCN 100). Examples of pre-processing that may be performed by the pre-processing layer 110 may include edge detection, resolution reduction, contrast enhancement; pitch detection, frequency analysis, or mel-frequency cepstral coefficient generation. Additionally or alternatively, the pre-processing layer no may perform any transformation or processing on input data. In one example implementation, the pre-processing layer is a noisy channel layer operating on individual pixels of input image data. This noisy channel may generate a bottommost sparsification S0 from an input image X using bit flip probabilities of p(Xrcf=1|Srcf0=0)=P10; p(Xrcf=0|Srcf0=1)=P01.
The class layer 140, in contrast to the pre-processing layer no, serves as the highest layer of the HCN 100. The class layer 140 functions to select classification categories and templates. For example, an HCN may contain several categories, one associated with ‘furniture’. The ‘furniture’ category may in turn contain several different representations of furniture; e.g., a chair, a table, etc. The category/template structure represents a two-level classification strategy; however, the class layer 140 may include a classification strategy incorporating any number of levels.
The class layer 140 preferably includes at least one set of classification variable nodes, as shown (represented by circle K symbols) in
A method 200 for learning a hierarchical compositional network (HCN) includes receiving an initial HCN structure S210, receiving a training dataset S220, and learning a set of HCN parameters S230, as shown in
The method 200 functions to set tunable parameters of an HCN such that the HCN is trained to perform data inference (and/or generation) based on a set of data used to train the HCN (i.e., the training dataset received in S220).
The method 200, taking advantage of the unique structure of HCNs, may be performed in both unsupervised and supervised settings, as well as with either of complete and incomplete datasets. Further, in some implementations of the method 200, the method may be used to learn classifiers with the functional form of a CNN (and even generate such a network, as in S240).
The method 200 is preferably implemented on the HCN 100, but may additionally or alternatively be implemented by any neural network capable of implementing the steps of the method 200. The method 200 is preferably implemented by a computing system (e.g., computer, distributed computing system, etc.).
S210 includes receiving an initial HCN structure. S210 preferably includes receiving information describing the structure of an HCN—e.g., data that specifies the neurons of the HCN and their connections. This information maybe specified in a number of forms; for example, HCN structure may be specified by specifying each variable node and factor node and their connections. Alternatively, HCN structure may be specified relying on known structural rules (e.g., a two-layer HCN, each layer containing a convolutional sub-layer and a pooling sub-layer, connections specified by stated pooling windows and receptive fields). HCN structure information may include any HCN structural or parametric information described in the section on the HCN 100 as well as any additional information that may be used in the course of the method 200.
S210 may additionally include receiving hyperparameters of the HCN 100; e.g., bit-flip probabilities for input data (e.g., P01, P10). As a second example, S210 may include receiving fixed per-layer sparse priors for weight variable nodes (represented as ). Additionally or alternatively, S210 may include receiving any other hyperparameters or parameters related to the HCN.
S220 includes receiving a training dataset. S220 functions to receive a set of training data (henceforth referred to as X). The set of training data preferably includes multiple elements (e.g., {Xn}n=1N); for example, each element may correspond to a different training image of an image dataset. Training data may additionally include corresponding classifying information; for example, a dataset may include a set of labels C: {Xn, Cn}n=1N.
Accordingly, training data may be unlabeled, partially labeled, or fully labeled. Likewise, training data may be complete (e.g., information is provided for each input neuron of the HCN) or incomplete (e.g., information is not provided for all input neurons).
Training data may be any set of data for which inference or generation is desired; e.g., images, video, audio, speech, medical sensor data, natural language data, financial data, application data, traffic data, environmental data, etc.
S230 includes learning a set of HCN parameters. S230 functions to learn values for tunable parameters of the HCN based on the training dataset (allowing the HCN to perform inference or generation for data objects similar to those the HCN is trained on). These tunable parameters are preferably chosen by attempting to maximize the network's likelihood given the training set data by varying the parameters of the HCN iteratively (until HCN parameters corresponding to a maximum likelihood value are found).
HCN parameters learned in S230 preferably include values for the HCN weight variable nodes (which may be used to modify the structure of the HCN; this is described in more detail in later sections), but may additionally or alternatively include any other HCN parameters (e.g., constraint variable node values, connection weights, etc.).
In one example, the joint probability of multiple training set elements is given by
where Hn is a collection of the latent variables corresponding to the nth element. In this example, S230 may include attempting selecting W to maximize this joint probability. Note that this is distinct from selecting W by maximizing a discriminative loss of the type log p({Cn}n=1N|{Xn}n=1N, {=1); in this case, all the prior information p(X) about the structure of the images is lost, resulting in more samples being required to achieve the same performance (and less invariance to new test data).
S230 preferably includes performing HCN learning using a max-product message passing algorithm in multiple (forward and backward pass) iterations. Alternatively, S230 may include performing HCN learning using any type of message passing.
In a first example, S230 includes performing network initialization by initializing bottom-up messages with the logarithm of samples from a uniform distribution in (0,1|, initializing top-down messages to −∞, and setting constant bottom-up messages to S0: m(Srcf0)=(k1−k0)Xrcf+k0 with
In this example, after initialization, S230 repeats the following:
For each from 1, . . . , L (forward pass)
update message from all class layer POOLs to SL and hard assign Cn if label available
For each from L, . . . ,1 (backward pass)
This loop is repeated, finally generating max-marginal differences of ,, for each layer. Note that while this is an example of a particular message updating technique, this technique maybe modified in any manner. For example, updates could be serial (instead of parallel), the damping factor could be different (lambda could take other values between zero and one, not necessarily 0.5) and the order of the updates could be modified (for instance, we could update the messages for a given layer multiple times, or we could update the message to W only once in a while, etc.)
In a second example, multiple forward and backward/reverse transformations can be performed, such as described in U.S. application Ser. No. 14/822,730, filed Aug. 10, 2015, which is incorporated herein in its entirety by this reference.
In iterative learning techniques such as those described above, S230 may include iterating based on any condition (e.g., set time, set number of iterations, result achieving a threshold, etc.).
After the learning algorithm has completed, the max-marginal differences of may be used to fix W in the HCN; for example S230 may include setting values of W to binary values based on a set threshold, and then hard-coding these binary values into the HCN by removing AND factor nodes as appropriate (as shown in the conversion from
Note that while the preceding paragraphs describe batch learning processes, S230 may additionally or alternatively include performing online learning by storing messages sent to W during training on one training element (or set of training elements) and using these stored messages to aid in training on later training elements.
S240 includes generating a CNN from the HCN parameters. S240 includes copying the binary weights learned by an HCN to a CNN with linear activations. S240 takes advantage of the similarities in HCN and CNN performance; thus, HCN parameters can be used to create either or both of trained CNNs and trained HCNs.
The methods of the preferred embodiment and variations thereof can be embodied and/or implemented at least in part as a machine configured to receive a computer-readable medium storing computer-readable instructions. The instructions are preferably executed by computer-executable components preferably integrated with a hierarchical compositional network. The computer-readable medium can be stored on any suitable computer-readable media such as RAMs, ROMs, flash memory, EEPROMs, optical devices (CD or DVD), hard drives, floppy drives, or any suitable device. The computer-executable component is preferably a general or application specific processor, but any suitable dedicated hardware or hardware/firmware combination device can alternatively or additionally execute the instructions.
As a person skilled in the art will recognize from the previous detailed description and from the figures and claims, modifications and changes can be made to the preferred embodiments of the invention without departing from the scope of this invention defined in the following claims.
This application is a continuation in part of U.S. application Ser. No. 15/708,383, filed on Sep. 19, 2017, which claims the benefit of U.S. Provisional Application Ser. No. 62/396,657, filed on Sep. 19, 2016, each of which is incorporated in its entirety by this reference. This application is related to U.S. application Ser. No. 14/822,730, filed Aug. 10, 2015, which is incorporated herein in its entirety by this reference.
Number | Date | Country | |
---|---|---|---|
62396657 | Sep 2016 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 15708383 | Sep 2017 | US |
Child | 17560000 | US |