Two-dimensional conditional random fields for web extraction

Information

  • Patent Application
  • 20070150486
  • Publication Number
    20070150486
  • Date Filed
    December 14, 2005
    19 years ago
  • Date Published
    June 28, 2007
    17 years ago
Abstract
A labeling system uses a two-dimensional conditional random fields technique to label the object elements. The labeling system represents transition features and state features that depend on object elements that are adjacent in two dimensions. The labeling system represents the grid as a graph of vertices and edges with a vertex representing an object element and an edge representing a relationship between the object elements. The labeling system represents each diagonal of the graph as a sequence of states. The labeling system selects a labeling for the vertices of the diagonals that has the highest probability based on transition probabilities between vertices of adjacent diagonals and on the state probabilities of a position within a diagonal.
Description

BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 illustrates a web object representing a doll having object elements.



FIG. 2 illustrates the graphical structure of a two-dimensional CRF technique.



FIG. 3 illustrates the diagonals of the graphical structure.



FIG. 4 illustrates the two-dimensionally indexed object block of FIG. 1.



FIG. 5 illustrates the association of each element with only one state in one embodiment.



FIG. 6 illustrates virtual states resulting from the associations.



FIG. 7 is a block diagram illustrating components of the labeling system in one embodiment.



FIG. 8 is a flow diagram that illustrates the processing of the identify labels component of the labeling system in one embodiment.



FIG. 9 is a flow diagram that illustrates the processing of the calculate numerator component of the labeling system in one embodiment.



FIG. 10 is a flow diagram that illustrates the processing of the align object elements component of the labeling system in one embodiment.


Claims
  • 1. A method for labeling observations, the method comprising: receiving observations having relationships in two dimensions; anddetermining a labeling for the observations using a conditional random fields technique that factors in the relationships in two dimensions.
  • 2. The method of claim 1 wherein the observations are represented as a grid and the determining includes identifying diagonals of the grid and calculating a probability for sets of labels based on transition probabilities between diagonals.
  • 3. The method of claim 1 including deriving weights for feature functions based on training data and wherein the determining includes calculating a probability for a set of labels based on the derived weights.
  • 4. The method of claim 3 wherein the deriving includes optimizing a log-likelihood function based on the training data.
  • 5. The method of claim 4 wherein the optimizing uses a gradient-based L-BFGS technique.
  • 6. The method of claim 3 wherein the calculating uses a variable-state Viterbi technique.
  • 7. The method of claim 1 wherein the observations are represented as a grid and when an observation represents multiple positions within the grid, representing the observation using a real state at one position and a virtual state at another position within the grid.
  • 8. The method of claim 7 wherein a real state and the corresponding virtual state are constrained to have the same values in a transition.
  • 9. The method of claim 8 including identifying diagonals of the grid and calculating a probability for sets of labels based on transition probabilities between diagonals.
  • 10. A system for identifying object elements of a web object, comprising: a component that identifies a two-dimensional relationship between the object elements of the web object; anda component that applies a two-dimensional conditional random fields technique to identify a set of labels based on the two-dimensional relationship between the object elements of the web object.
  • 11. The system of claim 10 wherein the object elements are represented as a grid and the component that applies the two-dimensional conditional random fields technique identifies diagonals of the grid and calculates a probability for sets of labels based on transition probabilities between diagonals.
  • 12. The system of claim 11 including a component deriving weights for feature functions based on training data of object elements and labels and wherein the component that applies calculates a probability for a set of labels based on the derived weights.
  • 13. The system of claim 12 wherein the deriving uses a gradient-based L-BFGS technique.
  • 14. The system of claim 13 wherein the applying uses a variable-state Viterbi technique.
  • 15. The system of claim 10 wherein the object elements are represented as a grid and when an object element represents multiple positions within the grid, representing the object element using a real state at one position and a virtual state at another position within the grid.
  • 16. The system of claim 15 wherein a real state and the corresponding virtual state are constrained to have the same values in a transition.
  • 17. A computer-readable medium containing instructions for controlling a computer system to identify object elements of a web object, by a method comprising: representing a two-dimensional relationship between the object elements of the web object as a grid having positions, a position representing a real state or a virtual state, a real state indicating that an object element corresponds to the position and a virtual state indicating that an object element encompasses multiple positions; andapplying a two-dimensional conditional random fields technique to identify a set of labels based on the two-dimensional relationship between the object elements of the web object as indicated by diagonals of the grid.
  • 18. The computer-readable medium of claim 17 wherein the applying includes calculating a probability for sets of labels based on transition probabilities between diagonals.
  • 19. The computer-readable medium of claim 17 including deriving weights for feature functions based on training data of object elements and labels and wherein the applying calculates a probability for a set of labels based on the derived weights.
  • 20. The computer-readable medium of claim 19 wherein the deriving uses a gradient-based L-BFGS technique and the applying uses a variable-state Viterbi technique.