A major challenge in information retrieval and computational system biology is to study how complex interactions among system inputs influence final system outputs. In information retrieval, we often need to find the most relevant documents or webpages or product descriptions to a query in a lot of scenarios such as online search, and modeling deep semantically complex interactions among words and phrases is very important. For example, “bark” interacting with “dog” means something different than “bark” interacting with “tree”. In computational biology, high-throughput genome-wide molecular assays simultaneously measure the expression level of thousands of genes, which probe cellular networks from different perspectives. These measurements provide a “snapshot” of transcription levels within the cell. As one of the most recent techniques, Chromatin InmmunoPrecipitation followed by parallel sequencing (ChIP-Seq) makes it possible to accurately identify Transcription Factor (TF) bindings and histone modifications at a genome-wide scale. These data enable us to study the combinatorial interactions involving TF bindings and histone modifications. Or another example in computational biology, proteins normally carry out their functions by grouping or binding with other proteins. Modeling high-order protein interaction groups that only appear in disease samples but not in normal samples for accurate disease status prediction such as cancer diagnosis is still a very challenging problem.
In information retrieval, our previous approach called Supervised Semantic Indexing (SSI) based on linear transformation and polynomial expansions has been used for document retrieval, but it doesn't consider complex high-order interactions among words and it has a shallow model architecture with limited learning capabilities. In computational biology, previous attempts focus on genome-wide pairwise co-association analysis using simple correlations, clustering, or Bayesian Networks. These approaches either do not reveal higher-order dependencies between input variables (genes) such as how the activity of one gene can affect the relationship between two or more other genes, or impose non-existing cause-effect relationships among genes.
We disclose systems and methods for determining complex interactions among system inputs by using semi-Restricted Boltzmann Machines (RBMs) with factorized gated interactions of different orders to model complex interactions among system inputs; applying semi-RBMs to train a deep neural network with high-order within-layer interactions for learning a distance metric and a feature mapping; and tuning the deep neural network by minimizing margin violations between positive query document pairs and corresponding negative pairs.
Implementations of the above aspect can include one or more of the following. Probabilistic graphical models are widely used for extracting insightful semantic or biological mechanistic information from input data and often provide a concise representation of complex system input interactions. A new framework can be used for discovering interactions among words and phrases based on discretized TF-IDF representation of documents and among Transcription Factors (TFs) based on multiple ChIP-Seq measurements. We extend Restricted Boltzmann Machine (RBM) to discover input feature interactions of arbitrary order. Instead of just focusing on modeling image mean and covariance as in mean-covariance RBM, our semi-RBMs here have gated interactions with a combination of orders ranging from 1 to m to approximate the arbitrary-order combinatorial input feature interactions in words and in TFs. The hidden units of our semi-RBMs act as binary switches controlling the interactions between input features. We use factorization to reduce the number of parameters. The semi-RBM with gated interaction of order 1 exactly corresponds to the traditional RBM. The discrete nature of our input data enables us to get samples from our semi-RBMs by using either fast deterministic damped mean-field updates or prolonged Gibbs sampling. The parameters of semi-RBMs are learned using Contrastive Divergence. After a semi-RBM is learned, we can treat the inferred hidden activities of input data as new data to learn another semi-RBM. This way, we can form a deep belief net with gated high-order interactions. Given pairs of discrete representations of a query and a document, we use these semi-RBMs with gated arbitrary-order interactions to pre-train a deep neural network generating a similarity score between the query and the document, in which the penultimate layer corresponds to a very powerful non-linear feature embedding of the original system input features. Then we use back-propagation to fine-tune the parameters of this deep gated high-order neural network to make positive pairs of query and document always have larger similarity scores than negative pairs based on margin maximization.
The system uses semi-RBMs with factorized gated interactions of a combination of different orders to model complex interactions among system inputs, with applications in modeling the complex interactions between different words in documents and queries and predicting the bindings of some TFs given some other TFs, which provides us with some insight into understanding deep semantic information for information retrieval and TF binding redundancy and TF interactions for gene regulation.
The semi-RBMs are used to efficiently train a deep neural network with high-order within-layer interactions, which is one of the first deep neural networks capable of dealing with high-order lateral connections for learning a distance metric and a feature mapping.
The deep neural network is fine-tuned by minimizing margin violations between positive query-document pairs and corresponding negative pairs, which is one of the first attempts of combining large-margin learning and deep gated neural networks.
Advantages of the system may include one or more of the following. The system extends Restricted Boltzmann Machine (RBM) to discover input feature interactions of arbitrary order. The system is capable of capturing combinatorial interactions between system inputs. In addition to modeling real continuous image data, the system can handle discrete data. Instead of just focusing on modeling image mean and covariance as in mean-covariance RBM, our semi-RBMs here have gated interactions with a combination of orders ranging from 1 to m to approximate the arbitrary-order combinatorial input feature interactions in words and in TFs. The system can be used to identify complex non-linear system input interactions for data de-noising and data visualization, especially in biomedical applications and scientific data explorations. The system can also be used to improve the performance of current search engines, collaborative filtering systems, online advertisement recommendation systems, and many of other e-commerce systems.
The framework of
The system uses semi-RBMs with factorized gated interactions of a combination of different orders to model complex interactions among system inputs, with applications in modeling the complex interactions between different words in documents and queries and predicting the bindings of some TFs given some other TFs, which provides us with some insight into understanding deep semantic information for information retrieval and TF binding redundancy and TF interactions for gene regulation.
The semi-RBMs are used to efficiently train a deep neural network with high-order within-layer interactions, which is one of the first deep neural networks capable of dealing with high-order lateral connections for learning a distance metric and a feature mapping. The deep neural network is fine-tuned by minimizing margin violations between positive query-document pairs and corresponding negative pairs, which is one of the first attempts of combining large-margin learning and deep gated neural networks.
As in traditional SSI, a training is conducted by minimizing the following margin ranking loss on a tuple (q, d+, d−):
where q is the query, d+ is a relevant document, and d− is an irrelevant document, f(·,·) is the similarity score.
Next, we will discuss implementations of the RBM system. RBM is an undirected graphical model with one visible layer v and one hidden layer h. There are symmetric connections W between the hidden layer and the visible layer, but there are no within-layer connections. For a RBM with stochastic binary visible units v and stochastic binary hidden units h, the joint probability distribution of a configuration (v, h) of RBM is defined based on its energy as follows:
where b and c are biases, and Z is the partition function with Z=Σu,gexp(−E(u,g)). Due to the bipart structure of RBM, given the visible states, each hidden unit is conditionally independent, and given the hidden states, the visible units are conditionally independent.
This nice property allows us to get unbiased samples from the posterior distribution of the hidden units given an input data vector. By minimizing the negative log-likelihood of the observed input data vectors using gradient descent, the update rule for the weight W is as follows,
ΔWij=ε(<vihj>data−<vihj>∞). (5)
where ε is learning rate, <·>data denotes the expectation with respect to the data distribution and <·>∞ denotes the expectation with respect to the model distribution. In practice, we do not have to sample from the equilibrium distribution of the model, and even one-step reconstruction samples work very well [?].
ΔWij=ε(<vihj>data−<vihj>recon), (6)
Although the above update rule does not follow the gradient of the log-likelihood of data exactly, it works very well in practice. In [?], it is shown that a deep belief net based on stacked RBMs can be trained greedily layer by layer. Given some observed input data, we train a RBM to get the hidden representations of the data. We can view the learned hidden representations as new data and train another RBM. We can repeat this procedure many times to pretrain a deep neural network, and then we can use backpropagation to fine-tune all the network connection weights.
In RBM, the marginal distribution of visible units is as follows,
The above distribution shows that RBM can be viewed as a model of Product of Experts (PoE), in which each hidden unit corresponds to a mixture expert, and the non-linear dependency between visible units are implicitly encoded owing to the non-factorization property of each expert.
Next we discuss the use of Semi-Restricted Boltzmann Machine for discrete categorical data. RBM without lateral connections captures dependencies between visible units (features) in a less convenient way, which involves much more coordinations than semi-RBMs. In the following, we will describe two different types of semi-RBMs tailored for modeling feature dependencies in discrete categorical data.
We extend the energy function of RBM in Equation 1 to handle both discrete categorical data and feature dependencies by explict lateral connections and we call the resulting model “lateral semi-RBM” (IsRBM). The energy function of IsRBM is,
where we use K softmax binary visible units to represent each discrete feature taking values from 1 to K, vik=1 if and only if the discrete value of the i-th feature is k, Wijk is the connection weight between the k-th softmax binary unit of feature i and hidden unit j, Zi is the normalization term enforcing that the probabilities of feature i's taking all possible discrete values, that is, the marginal probabilities {p(vik=1|h, v)}k, sum to 1, and Lii′ kk′ is the lateral connection weight between feature i taking value k and feature i taking value k′ (except explicitly mentioned, in all subsequent descriptions, we will use i for indexing visible units, j for indexing hidden units, and Z for denoting normalization terms). If we have n features and K possible discrete values for each feature, we have
lateral connection weights. The lateral connections between visible units do not affect the conditional distributions for hidden units p(hj|v), which are still conditionally independent as in RBM, but the conditional distributions p(vik|h) are not independent anymore. We use “damped mean-field” updates to get approximate samples {r(vik)} from p(v|h). Then we have,
T is the maximum number of iterations of mean-field updates, and, instead of using p(vik=1|h) from RBM to initialize {r0(vik)}, we can also use a data vector v for initialization here.
As in RBM, we use contrastive divergence to update the connection weights of IsRBM to approximately maximize the log-likelihood of observed data.
ΔWijk=ε(<vikhj>data−<vikhj>recon),
ΔLii′kk′=ε(<Vikvi′k″>data−<rT(vik)rT(ri′k′)>recon),
Δbik=ε(<vik>data−<rT(vik)>recon),
Δcj=ε(<hj>data−<hj>recon),
where we also use a small number of steps of sampled reconstructions to approximate the terms under model distribution.
In IsRBM, the marginal distribution p(v) takes the following form,
where vik=1 if and only if the discrete value of feature i is k. This marginal distribution shows that the dependencies between pairwise features are only captured by the explicit lateral connection weights Lii′ as biase terms. As in RBM, the hidden units of IsRBM also play the role of defining mixture experts, and the higher-order dependencies between features are implictly captured by the product of the mixture experts.
Next we will consider Semi-RBM with factored multiplicative interaction terms. One exemplary semi-RBM that uses hidden units to directly modulate the interactions between features can be defined with the following energy function (we omit biase terms here for description convenience),
However, in this energy function, we need mn2 parameters provided that we have n visible units and m hidden units. Factorization is used to approximate the three-way interaction weight Wii′j by ΣfWifWi′fUjf. In this way, the above energy function with three-way interactions can be written as Σf(ΣiWifvi)2(ΣjUjfhj). In the following, we extend factored semi-RBMs for modeling discrete categorical data with an arbitrary order of feature interactions. Using K softmax binary units to represent a dicrete feature with K possible values as in the previous section, the energy function of factored semi-RBM for discrete data is,
where d is a user-defined parameter that controls the order of interactions between features. If d=2, the above energy function will capture all possible pairwise feature interactions, which is a factored version of Equation 13. We call the semi-RBM defined by the energy function “factored semi-RBM” (fsRBM). In fsRBM, the marginal distribution of visible units is,
The marginal distribution of fsRBM can also be viewed as a PoE model, and each expert is a mixture model. However, unlike in IsRBM, each hidden unit can be used to choose a mixture component modeling d-th order interactions between features, thereby modulating high-order interactions between features directly. As in IsRBM, complex non-linear dependencies between features are also implictly encoded by the PoE model.
In the above fsRBM, only d-th order interactions are explictly considered in the energy function, and now we extend it to include all the interactions with all possibler orders smaller than or equal to d, and we call the resulting model “factored polynomial semi-RBM” (fpsRBM). The energy function of fpsRBM is,
where {W(a)k}, U(a), and h(a) are, respectively, the connection weights between visible units and factors, the connection weights between hidden units and factors, and the interaction-modulating hidden units for order a. Please note that, when a=1, the energy term Σf(ΣiWif(1)k)(ΣjUjf(1)hj(1)) is a factored version of traditional RBM. In fpsRBM, we can view {h(a)} as a complete set of hidden representations gating different orders of feature interactions up to order d.
If we only use one set of hidden units h, connection weights u, and {wk} for all the interaction terms with all possible orders from 1 to d, the above energy function is analogous to the following form,
We call the semi-RBM defined by the above energy function “weight sharing factored polynomial semi-RBM” (ws-fpsRBM).
The inference in factored semi-RBMs is similar to that of IsRBM: the conditional distributions for hidden units are conditionally independent given the visibles, but the conditional distributions for visible units given the hiddens are dependent, so we need to use “mean-field” updates to get the approximate samples for the visibles.
The conditionals and the mean-field updates for fpsRBM and ws-fpsRBM are as follows (the ones for fsRBM is almost the same as those for ws-fpsRBM due to the high similarity in their energy functions),
where rt(vik) is the approximate sample for feature i taking value k by the “damped mean-field” update at the t-th iteration, given the hidden configuration h; and T is the maximum number of iterations of the mean-field updates. We initialize r0 (v) to be a data vector here.
Taking a similar form to the updates in IsRBM, the updates of the connection weights and biases for fpsRBM and ws-fpsRBM by contrastive divergence are as follows,
where fpsRBM and ws-fpsRBM share the same update for the biases of the visible units. Comparing fpsRBM to ws-fpsRBM, we see that the former is more complex and flexible than the latter, and both models have more orders of explicit feature interactions than fsRBM.
Next we will discuss Semi-supervised semi-RBM and conditional distribution for visibles. The semi-RBMs for modeling discrete categorical data described in the previous section can be easily extended to a semi-supervised setting, and then we get semi-supervised semi-RBMs (s3 RBMs). To do that, we simply view the multi-class label of a data vector as an additional softmax visible input. For description convenience, we assume that the number of classes is equal to the number of possible discrete values taken by input features. Thereby, the energy functions of s3 RBMs will be almost the same as the energy functions of semi-RBMs described in the previous section, except that we call one of the visible units (for example, the i-th one) {yk} instead of {vik}. And yk=1 if and only if the class label of an input data vector is k.
For unlabeled data, we treat {yk} as missing values, and we train a separate semi-RBM without the class unit y, which shares all the other weights and biases with the semi-RBM containing visible unit y.
In s3RBM, given an input vector, we can easily predict its class label. The conditional distributions of p(y|v) for IsRBM, fpsRBM, and ws-fpsRBM have the following respective forms,
where byk is the biase term for yk. Because y in the subscript indexes the special visible unit corresponding to the class label of v, we can use exactly the same equations above to calculate the conditional distributions p(vik|v−i) by simply replacing the subscript index y with i.
Although we can efficiently compute the conditionals p(yk=1|v) and p(vik|v−i), we must sum an exponential number of configurations over v−(S∪V) to compute p(vS|vV) for all the factored semi-RBMs with multiplicative interactions, where S and V denote two arbitrary subsets of visible units. We took a similar approach to the one in [?]. But unlike in RBM, we cannot compute p(h|vV) analytically due to the interaction terms involving other visible units than in V. Instead, we approximate the conditional distribution over hiddens by treating other visible units v−(S∪V) as missing values and ignoring them. Given the approximate conditional distribution over hiddens {circumflex over (p)}(h|vF), we run the damped mean-field updates by clamping observed visibles on vV at each iteration t, and we use the final output of the mean-field updates {rT(vik)}i∈Sk∈{1 . . . k} to approximate p(vS|vV).
For IsRBM, we can compute p(vS|vV) exactly as follows,
where [·] is an indicator function. We must enumerate Ksize(S) possible configurations to compute the conditional distributions above, but we can use a similar mean-field approximation strategy to the one for fsRBMs to approximate p(vS|vV) for IsRBM.
Next, one application of the system of
To train the deep gated high-order neural network for nonlinear semantic indexing in
The invention may be implemented in hardware, firmware or software, or a combination of the three. Preferably the invention is implemented in a computer program executed on a programmable computer having a processor, a data storage system, volatile and non-volatile memory and/or storage elements, at least one input device and at least one output device.
By way of example, a block diagram of a computer to support the system is discussed next. The computer preferably includes a processor, random access memory (RAM), a program memory (preferably a writable read-only memory (ROM) such as a flash ROM) and an input/output (I/O) controller coupled by a CPU bus. The computer may optionally include a hard drive controller which is coupled to a hard disk and CPU bus. Hard disk may be used for storing application programs, such as the present invention, and data. Alternatively, application programs may be stored in RAM or ROM. I/O controller is coupled by means of an I/O bus to an I/O interface. I/O interface receives and transmits data in analog or digital form over communication links such as a serial link, local area network, wireless link, and parallel link. Optionally, a display, a keyboard and a pointing device (mouse) may also be connected to I/O bus. Alternatively, separate connections (separate buses) may be used for I/O interface, display, keyboard and pointing device. Programmable processing system may be preprogrammed or it may be programmed (and reprogrammed) by downloading a program from another source (e.g., a floppy disk, CD-ROM, or another computer).
Each computer program is tangibly stored in a machine-readable storage media or device (e.g., program memory or magnetic disk) readable by a general or special purpose programmable computer, for configuring and controlling operation of a computer when the storage media or device is read by the computer to perform the procedures described herein. The inventive system may also be considered to be embodied in a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
The invention has been described herein in considerable detail in order to comply with the patent Statutes and to provide those skilled in the art with the information needed to apply the novel principles and to construct and use such specialized components as are required. However, it is to be understood that the invention can be carried out by specifically different equipment and devices, and that various modifications, both as to the equipment details and operating procedures, can be accomplished without departing from the scope of the invention itself.
The present application claims priority to Provisional Application Ser. No. 61/810,812 filed on Apr. 11 2013, the content of which is incorporated by reference.
Number | Date | Country | |
---|---|---|---|
61810812 | Apr 2013 | US |