Two-dimensional conditional random fields for web extraction

Description

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a web object representing a doll having object elements.

FIG. 2 illustrates the graphical structure of a two-dimensional CRF technique.

FIG. 3 illustrates the diagonals of the graphical structure.

FIG. 4 illustrates the two-dimensionally indexed object block of FIG. 1.

FIG. 5 illustrates the association of each element with only one state in one embodiment.

FIG. 6 illustrates virtual states resulting from the associations.

FIG. 7 is a block diagram illustrating components of the labeling system in one embodiment.

FIG. 8 is a flow diagram that illustrates the processing of the identify labels component of the labeling system in one embodiment.

FIG. 9 is a flow diagram that illustrates the processing of the calculate numerator component of the labeling system in one embodiment.

FIG. 10 is a flow diagram that illustrates the processing of the align object elements component of the labeling system in one embodiment.

Claims

1. A method for labeling observations, the method comprising: receiving observations having relationships in two dimensions; anddetermining a labeling for the observations using a conditional random fields technique that factors in the relationships in two dimensions.
2. The method of claim 1 wherein the observations are represented as a grid and the determining includes identifying diagonals of the grid and calculating a probability for sets of labels based on transition probabilities between diagonals.
3. The method of claim 1 including deriving weights for feature functions based on training data and wherein the determining includes calculating a probability for a set of labels based on the derived weights.
4. The method of claim 3 wherein the deriving includes optimizing a log-likelihood function based on the training data.
5. The method of claim 4 wherein the optimizing uses a gradient-based L-BFGS technique.
6. The method of claim 3 wherein the calculating uses a variable-state Viterbi technique.
7. The method of claim 1 wherein the observations are represented as a grid and when an observation represents multiple positions within the grid, representing the observation using a real state at one position and a virtual state at another position within the grid.
8. The method of claim 7 wherein a real state and the corresponding virtual state are constrained to have the same values in a transition.
9. The method of claim 8 including identifying diagonals of the grid and calculating a probability for sets of labels based on transition probabilities between diagonals.
10. A system for identifying object elements of a web object, comprising: a component that identifies a two-dimensional relationship between the object elements of the web object; anda component that applies a two-dimensional conditional random fields technique to identify a set of labels based on the two-dimensional relationship between the object elements of the web object.
11. The system of claim 10 wherein the object elements are represented as a grid and the component that applies the two-dimensional conditional random fields technique identifies diagonals of the grid and calculates a probability for sets of labels based on transition probabilities between diagonals.
12. The system of claim 11 including a component deriving weights for feature functions based on training data of object elements and labels and wherein the component that applies calculates a probability for a set of labels based on the derived weights.
13. The system of claim 12 wherein the deriving uses a gradient-based L-BFGS technique.
14. The system of claim 13 wherein the applying uses a variable-state Viterbi technique.
15. The system of claim 10 wherein the object elements are represented as a grid and when an object element represents multiple positions within the grid, representing the object element using a real state at one position and a virtual state at another position within the grid.
16. The system of claim 15 wherein a real state and the corresponding virtual state are constrained to have the same values in a transition.
17. A computer-readable medium containing instructions for controlling a computer system to identify object elements of a web object, by a method comprising: representing a two-dimensional relationship between the object elements of the web object as a grid having positions, a position representing a real state or a virtual state, a real state indicating that an object element corresponds to the position and a virtual state indicating that an object element encompasses multiple positions; andapplying a two-dimensional conditional random fields technique to identify a set of labels based on the two-dimensional relationship between the object elements of the web object as indicated by diagonals of the grid.
18. The computer-readable medium of claim 17 wherein the applying includes calculating a probability for sets of labels based on transition probabilities between diagonals.
19. The computer-readable medium of claim 17 including deriving weights for feature functions based on training data of object elements and labels and wherein the applying calculates a probability for a set of labels based on the derived weights.
20. The computer-readable medium of claim 19 wherein the deriving uses a gradient-based L-BFGS technique and the applying uses a variable-state Viterbi technique.

Two-dimensional conditional random fields for web extraction

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

US Classifications

International Classifications

Abstract

Description

Claims