Annotating or labeling observation sequences arises in many applications across a variety of scientific disciplines, most prominently in natural language processing, information extraction, speech recognition, and bio-informatics. Recently, the predominant formalism for modeling and predicting label sequences has been based on discriminative models and variants. Conditional Random Fields (CRFs) are perhaps the most commonly used technique for probabilistic sequence modeling.
The following detailed description references the drawings, wherein:
As detailed above, CRFs are commonly used for probabilistic sequence modeling. Structured data are widely prevalent in the real world, and observation sequences tend to have distinct internal sub-structure and indicate predictable relationships between individual class labels, especially for natural language. For example in the task of noun phrase chunking, a noun phrase begins with a noun or a pronoun and may be accompanied by a set of modifiers. In this example, a noun phrase may contain one or more base noun phrases. In the named entity recognition task, named entities have particular characteristics in their composition. A location name can end with a location salient word but cannot end with any organization salient word. A complex, nested organization name may be composed of a person name, a location name, or even another organization name. Such complex and expressive structures can largely influence predictions. The efficiency of the CRF approach heavily depends on its first order Markov property—given the observation, the label of a token is assumed to depend only on the labels of its adjacent tokens. Further, the CRF approach models the transitions between class labels to enjoy advantages of both generative and discriminative methods capture external dynamics without consideration for internal sub-structure.
In examples described herein, the internal sub-structure in sequence data is directly modeled by incorporating a set of observed variables with additional latent, or hidden state variables to model relevant sub-structure in a given sequence, resulting in a new discriminative framework, Hidden Dynamic Conditional Random Fields (HDCRFs). The model learns the external dependencies by modeling a continuous stream of class labels and learns internal sub-structure by utilizing intermediate hidden states. HDCRFs define a conditional distribution over the class labels and hidden state labels conditioned on the observations, where dependencies between the hidden variables can be expressed by an undirected graph. Such modeling is able to deal with features that can be arbitrary functions of the observations. Efficient parameter estimation and inference can be carried out using standard graphical model algorithms such as belief propagation.
For example in web data extraction from encyclopedic pages such as WIKIPEDIA®, each encyclopedic page has a major topic or concept represented by a principal data record such as “Beijing”. A goal of HDCRFs is to extract all the interested data records such as “Beijing municipality”, “October 28”, “1420”, and “Qing Dynasty”, and assign class labels to these data records. In this example, the class labels can include pre-defined labels such as “person”, “date”, “year”, “organization” labels assigned to each data record and hidden state variables to identify substructures like the relationship between “Beijing” and “municipality” or “Qing” and “Dynasty.” If the substructure between “Beijing” and “municipality” is identified, “Beijing municipality” can be properly labeled as an “organization.” WIKIPEDIA® is a registered trademark of the Wikimedia Foundation, Inc., which is headquartered in San Francisco, Calif.
In some examples, a conditional probability distribution for labeling data record segments is defined, where the conditional probability distribution models dependencies between class labels and internal substructures of the data record segments. Data record segments may be observed data such as content from web pages, text from books, documents, etc. At this stage, optimal parameter values are determined for the conditional probability distribution by applying a quasi-Newton gradient ascent method to training data, where the conditional probability distribution is restricted to a disjoint set of hidden states for each of the class labels. The conditional probability distribution and the optimal parameter values are used to determine a most probable labeling sequence for the data record segments.
Referring now to the drawings,
Processor 110 may be central processing unit(s) (CPUs), microprocessor(s), and/or other hardware device(s) suitable for retrieval and execution of instructions stored in machine-readable storage medium 120. Processor 110 may fetch, decode, and execute instructions 122, 124, 126 to enable analyzing data using hidden dynamic systems (e.g., hidden states). As an alternative or in addition to retrieving and executing instructions, processor 110 may include electronic circuits comprising a number of electronic components for performing the functionality of instructions 122, 124, 126.
Interface 115 may include a number of electronic components for communicating with a server device. For example, interface 115 may be an Ethernet interface, a Universal Serial Bus (USB) interface, an IEEE 1394 (Firewire) interface, an external Serial Advanced Technology Attachment (eSATA) interface, or any other physical connection interface suitable for communication with the server device. Alternatively, interface 115 may be a wireless interface, such as a wireless local area network (WLAN) interface or a near-field communication (NFC) interface. In operation, as detailed below, interface 115 may be used to send and receive data to and from a corresponding interface of a server device.
Machine-readable storage medium 120 may be any electronic, magnetic, optical, or other physical storage device that stores executable instructions. Thus, machine-readable storage medium 120 may be, for example, Random Access Memory (RAM), an Electrically-Erasable Programmable Read-Only Memory (EEPROM), a storage drive, an optical disc, and the like. As described in detail below, machine-readable storage medium 120 may be encoded with executable instructions for analyzing data using hidden dynamic systems.
Probability distribution defining instructions 122 define a probability distribution for labeling observation sequences. Suppose X is a random variable over data sequences to be labeled, and Y is a random variable over corresponding label sequences. The distribution defines mappings between an observation sequence X=(x1, x2, . . . , xT) and the corresponding label sequence Y=(y1, y2, . . . , yT). Each yj is a member of the possible class label set. For each sequence, a vector of sub-structure variables S=(s1, s2, . . . , sT) are assumed, which are not observed in training examples and, thus, form a set of hidden variables. Each sj is a member of a finite set Syj of possible hidden states for the class label yj. Suppose S is the set of all possible hidden states of all Sy sets. Each sj corresponds to a labeling of xj with some member of S, which may correspond to substructure of the sequence.
Given the above definitions, a hidden dynamic probabilistic model can be defined as follows:
By definition, sequences which have any sj Syj will obviously have p(Y|S, X)=0, so the model above can be rewritten as:
Similar to CRFs, the conditional probability distributions, p(S|X), can take the form:
where Z(X) is an instance-specific normalization function:
and ƒk(sj-1, sj, X, j)k=1K is a set of real-valued feature functions. Λ={λk}εK is a parameter vector that reflects the confidence of feature functions. Each feature function can be either a transition function tk(sj-1, sj, X, j) over the entire observation sequence and the hidden variables at positions i and i−1, or a state function sk(sj, X, j) depends on a single hidden variable at position i. Note that the model is different from hidden conditional random fields (HCRFs), which model the conditional probability of one class label y given the observation sequence X through:
where the partition function Z′(X) is given by:
HDCRFs combine the strengths of CRFs and HCRFs by modeling both external dependencies between class labels and internal substructure. Specifically, the weights A associated with the transition function tk(sj-1, sj, X, j) model both the internal sub-structure and external dependencies between different class labels. Weights associated with a transition function for hidden states that are in the same subset Syj model the substructure patterns while weights associated with the transition functions for hidden states from different subsets will model the external dependencies between labels.
Optimal parameter determining instructions 124 determine optimal parameters for probability distribution. Given some training data consist of n labeled sequences D=(X1, Y1), (X2, Y2), . . . , (Xn, Yn), the parameters Λ={λk} are set to maximize the conditional log-likelihood. Following previous work on CRFs, the following objective function can be used to estimate the parameters:
To avoid over-fitting, log-likelihood can be penalized by a prior distribution over parameters that provide smoothing to help with sparsity in the training data. A commonly used prior is a zero-mean (with variance σ2) Gaussian. With a Gaussian prior, log-likelihood is penalized as follows:
Structural constraints can be encoded with an undirected graph structure, where the hidden variables {s1, s2, . . . , sT} correspond to vertices in the graph. To ensure the training and inference remains tractable, the model can be restricted to have disjoint sets (i.e., a set that contains no elements in common) of hidden states associated with each class label. A quasi-Newton gradient ascent method can be used to search for the optimal parameter values, Λ*=arg maxΛ L(Λ), under this criterion.
where P(sj=a|Y,X) and P(sj=a, sk=b|Y,X) are marginal distributions over individual variables sj or pairs of variables {sj, sk} corresponding to edges in the graph. The gradient of L() can be defined in terms of these marginal distributions and can therefore be calculated efficiently.
We first consider derivatives with respect to the parameters λk associated with a state function sk. Taking derivatives results in:
It shows that
can be expressed in terms of components P(sj=a|X) and P(Y|X), which can be computed using belief propagation.
For derivatives with respect to the parameters λi corresponding to a transition function ti, a similar calculation provides:
hence
can also be expressed in terms of expressions (e.g., the marginal probabilities P(sj=a, sk=b|Y,X)) that can be computed efficiently using belief propagation. Gradient ascent can be performed with the limited-memory quasi-Newton BFGS optimization technique.
Labeling sequence determining instructions 126 determine a labeling sequence for observation data (e.g., data record segments). Given a new test sequence X, the most probable labeling sequence Y* can be estimated that maximizes the conditional model:
where the parameters are learned via a training process. Assuming each class label is associated with a disjoint set of hidden states, equation (13) can be rewritten as:
The marginal probabilities P(sj=a|X,*) can be computed for all possible hidden states aεS to estimate the label yj*. These marginal probabilities may then be summed according to the disjoint sets of hidden states Syj, and the label associated with the optimal set can be selected. As discussed in the previous subsection, these marginal probabilities can also be computed efficiently using belief propagation. For example, the above maximal marginal probabilities approach can be used to estimate the sequence of labels because it minimizes error.
In the embodiment of
Interface module 210 may manage communications with the server devices 250A, 250N. Specifically, the interface module 210 may initiate connections with the server devices 250A, 250N and then send or receive observation data (e.g., data record segments) to/from the server devices 250A, 250N.
Modeling module 220 generates hidden dynamic probabilistic models for analyzing data. Specifically, modeling module 220 may generate a probabilistic model as described above with respect to
Training module 226 is to estimate parameters of the probabilistic model. Specifically, training module 226 uses training data to maximize the conditional log-likelihood function.
Analysis module 230 is to determine the most probably labeling sequence for observation data (e.g., data record segments). Specifically, labeling sequence module 234 of analysis module 230 computes marginal probabilities for all possible hidden states to estimate a label. Then these marginal probabilities are summed according to the disjoint sets of hidden states and the label associated with the optimal set is chosen.
Server devices 250A, 250N may be any servers accessible to computing device 200 over a network 245 that is suitable for executing the functionality described below. As detailed below, each server device 250A, 250N may include a series of modules 260-264 for providing web content.
API module 260 is configured to provide access to observation data of server device A 250A. Content module 262 of API module 260 is configured to provide the observation data as content over the network 245. For example, the content can be provided as HTML pages that are configured to be displayed in web browsers. In this example, server computer device 200 obtains the HTML pages from the content module 262 for processing as observation data as described above.
Metadata module 264 of API module 260 manages metadata related to the content. The metadata describes the content and can be included in, for example, web pages provided by the content module 262. In this example, keywords describing various page elements can be embedded as metadata in the web pages.
Method 300 may start in block 305 and continue to block 310, where computing device 100 generates a hidden dynamic probabilistic model for analyzing data using hidden dynamic systems. The probabilistic model can include hidden states for modeling the internal substructure of an observation sequence. Further, weights associated with a transition function for hidden states that are in the same subset model the sub-structure patterns, while weights associated with the transition functions for hidden states from different subsets will model the external dependencies between labels.
In block 315, computing device 100 determines optimal parameters of the probabilistic model by applying an ascent method. In block 320, computing device 100 uses the probabilistic model and the optimal parameters to determine the most probably labeling sequence for observation data. Method 300 may then continue to block 325, where method 300 may stop.
In the example graph 400, each transition function defines an edge feature while each state function defines a node feature as described above with respect to
The foregoing disclosure describes a number of examples for analyzing data using hidden dynamic systems. In this manner, the example disclosed herein improves labeling of observation data by modeling both external dependencies between the class labels and internal substructure of the observation data.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2015/076307 | 4/10/2015 | WO | 00 |