None.
The embodiments disclosed herein relate to clustering of document page collections, and more particularly to methods for clustering document page collections based on page layout attributes.
Clustering document collections into conceptually meaningful clusters is a well-studied problem. In many clustering tasks, unlabeled data is plentiful but labeled data is limited and expensive to generate. Consequently, semi-supervised clustering, which employs a small amount of labeled data to aid and bias the clustering of unlabeled data, has been developed. Existing methods for semi-supervised clustering fall into two general approaches, constraint-based methods and distance-based (metric-based) methods. In constraint-based approaches, the clustering algorithm itself is modified so that the available labels or constraints are used to bias the search for an appropriate clustering of the data. In distance-based approaches, an existing clustering algorithm that uses a distance measure is employed; however, the distance measure is first trained to satisfy the labels or constraints in the supervised data. Various methods of clustering document collections are described in U.S. Pat. No. 5,619,709 entitled “System and Method of Context Vector Generation and Retrieval”, U.S. Pat. No. 6,542,635 entitled “Method for Document Comparison and Classification Using Document Image Layout”, U.S. Pat. No. 6,598,054 entitled “System and Method for Quantitatively Representing Data Objects in Vector Space”, U.S. Pat. No. 6,658,626 entitled “User Interface for Displaying Document Comparison Information”, and U.S. Pat. No. 6,922,699 entitled “System and Method for Quantitatively Representing Data Objects in Vector Space”, all of which are incorporated by reference in their entireties for the teachings therein.
Prior attempts for clustering document collections typically rely on extracting unique content-bearing words from the set of documents, treating these words as features, and then representing each document as a vector of certain weighted word frequencies in this feature space. Typically, a large number of words exist in even a moderately sized set of documents where a few thousand words or more are common; hence the document vectors are very high-dimensional. Thus, there is a need in the art for methods of clustering of document pages based on layout rather than content. By using a distance-based approach to semi-supervised clustering, document page collections can be clustered efficiently based on document page layout attributes.
Methods for clustering a document page collection based on page layout attributes are disclosed herein.
According to aspects illustrated herein, there is provided a method for computing a distance metric for a document page collection that includes obtaining a document page collection, each document page in the collection having one or more features, the one or more features defining a page layout attribute; extracting information from the one or more features on each document page; constructing a feature vector for the one or more features on each document page; assigning a feature weight to each feature; and computing a distance metric based on the feature weight and the feature vector.
According to aspects illustrated herein, there is provided a method for evaluating a generated clustering for a document page collection that includes obtaining a document page collection, each document page in the collection having one or more features, the one or more features defining a page layout attribute; choosing a sample of document pages from the collection; computing a reference clustering for the sample of document pages; extracting information from the one or more features on each document page in the sample; constructing a feature vector for the one or more features on each document page; assigning a feature weight to each feature; computing a distance metric between any two pages in the sample of document pages based on the feature weight and the feature vector; clustering the sample of document pages using the distance metric in a clustering algorithm to obtain a generated clustering for the sample of document pages; and comparing the reference clustering to the generated clustering.
According to aspects illustrated herein, there is provided a method for clustering a document page collection that includes obtaining a document page collection, each document page in the collection having one or more features, the one or more features defining a page layout attribute; extracting information from the one or more features on each document page and constructing a feature vector; computing a distance metric based on an assigned feature weight for each feature; and clustering the document page collection using the distance metric.
The presently disclosed embodiments will be further explained with reference to the attached drawings, wherein like structures are referred to by like numerals throughout the several views. The drawings are not necessarily to scale, the emphasis having instead been generally placed upon illustrating the principles of the presently disclosed embodiments.
While the above-identified drawings set forth presently disclosed embodiments, other embodiments are also contemplated, as noted in the discussion. This disclosure presents illustrative embodiments by way of representation and not limitation. Numerous other modifications and embodiments can be devised by those skilled in the art which fall within the scope and spirit of the principles of the presently disclosed embodiments.
A method for clustering a document page collection is disclosed. In the method for clustering a document page collection, a reference clustering on a sample of document pages from the collection is computed, one or more features from each of the document pages in the sample are extracted and assigned a weight, a distance metric between two pages in the sample of document pages is computed based on the assigned feature weights, the sample of document pages are plugged into a clustering algorithm and a clustering of the sample of document pages is generated, the generated clustering is compared to the reference clustering and if any modifications are necessary new feature weights are assigned, and the document page collection is plugged into the clustering algorithm, using the learned feature weights.
“Document” as used herein refers to any printed or written item containing visually perceptible data, as well as to any electronic or data file which may be used to produce a printed or written item. A document may be a hardcopy, an electronic document file, one or a plurality of electronic images, electronic data from a printing operation, a file attached to an electronic communication or data from other forms of electronic communication. A “document page collection” or “collection of document pages” as used herein includes, but is not limited to, at least two pages, sheets, labels, boxes, packages, tags, boards, signs and any other item which contains or includes a “writing surface” as defined herein below. Typically, a document page collection includes more than two pages. In an embodiment, the document page collection includes at least six pages. In an embodiment, the document page collection includes at least twenty pages. In an embodiment, the document page collection includes at least fifty pages. “Writing surface” as used herein includes, but is not limited to, paper, cardboard, acetate, plastic, fabric, metal, wood, adhesive backed materials and similar surfaces.
“Features” as used herein refers to attributes found on a document including, but not limited to, paragraphs, images (icons, graphics, pictures, clip art), page numbers, tables and graphs. “Information” extracted from the features includes, but is not limited to, the number of paragraphs in a document page (1 feature); the total area of all paragraphs on a document page (1 feature); the paragraph coordinates of their upper left and lower right corner (there are four coordinates for every paragraph: upper left x-coordinate (X1), upper left y-coordinate (Y1), lower right x-coordinate (X2), and lower right y-coordinate (Y2), each coordinate is represented by five values, the minimum and maximum, the mean, and the quartiles for a total of 20 features); the paragraph widths and heights (10 features); the number of textboxes per paragraph (5 features); the font size of the paragraphs (5 features); the number of images in a page (1 feature); the total area of images in a page (1 feature); the image widths and heights (10 features); the number of SVG-type images (1 feature); the vertical fill degree (1 feature—all text and images are projected to the Y-axis, and then the percentage of the “occupied” space on the Y-axis is used as a feature); the number of vertical spaces (1 feature—output the number of spaces between lines of texts and images, gives an indication about the fill degree and fragmentation of the page; the size of the vertical spaces (5 features—each vertical space on the page is recorded and use the five numbers as features) the number of textboxes ending with a number (1 feature); -Left, right, one-sided, and two-sided paragraph areas (4 features—the set of all paragraphs is divided into those that are completely in the left half of the page, those that are completely in the right half of the page, and those that overlap both halves. The total area of the first set (left paragraphs area), the total area of the second set (right paragraphs area), the total area of both the first and the second set (one-sided paragraphs area), and the total area of the third set (two-sided paragraph area) are added together; -Left, right, one-sided, and two-sided image areas (4 features); and the page number (1 feature). Some of the features may be derived from other features, for example, width and height can be computed from the coordinates. For some features more than one representation is selected. For example, the number of textboxes per paragraph could be represented by the average or the mean over all paragraphs on a page. To get a better picture of the overall distribution, the minimum and maximum, the mean, and the quartiles are added (the values at 25% and 75% of the overall spectrum).
The method starts at 400 and includes obtaining a document page collection that a user wishes to cluster, as shown in step 407. Each of the document pages of the collection has one or more features. In step 414, a sample of document pages from the collection is selected. The sample of document pages is annotated to compute a reference clustering in step 421. Step 421 includes a user browsing the sample of document pages and clustering the sample by hand to produce a reference clustering. The annotation process will be further described in
After the sample of document pages is clustered by hand, and the reference clustering is computed, the user inputs the annotated sample of document pages into an electronic document processing system in step 428. Typically, the electronic document processing system generally includes an input device for electronically capturing the general appearance (i.e., the content and the basic graphical layout) of a hardcopy sample of document pages; programmed computers for enabling the user to create, edit and otherwise manipulate an electronic version of the sample of document pages; and printers for producing hardcopy renderings of the electronic version of the sample of document pages. The input device may include one or more of the following known devices: a copier, a xerographic system, an electrostatographic machine, a digital image scanner (e.g., a flat bed scanner or a facsimile device), a disk reader having a digital representation of the sample of document pages on removable media (CD, floppy disk, rigid disk, tape, or other storage medium) therein, or a hard disk or other digital storage media having the sample of document pages as images recorded thereon. Those skilled in the art will recognize that the method would work with any device suitable for storing a digitized representation of a sample of document pages.
The sample of document pages may be in any electronic format for which the one or more features can be extracted and includes, but is not limited to, the following open formats, ASCII, PostScript, PDF, HTML, XML (in particular XHTML and SVG). Document types such as Microsoft Word, Excel, and PowerPoint can be converted into XML format by appropriate software (available as PDF2XML or CambridgeDocs, for example). In an embodiment, the sample of document pages is in XML format. The XML format may display features including, but not limited to, TEXT, PARAGRAPH, and IMAGE. The one or more features are marked with attributes indicating the x-position and y-position of the one or more features on the document page, the width and height of the one or more features and further information, such as text font name and size. Information regarding the one or more features in the XML document may be extracted for each document page in the sample as shown in step 435.
Once the feature information is extracted for each document page, an n-dimensional feature vector is created as shown step 442. For example, for two pages pi and pj the feature vectors ƒi and ƒj are created. The distance metric d(pi, pj) between page pi and page pj is the weighted sum of the distances between the different features of the pages:
The n distance functions dk for the features are often just the absolute value of the difference of the feature values |ƒi[k]−ƒj[k]|. For some features, in particular area features (i.e., area of paragraphs, area of images) the square root of that distance |ƒi[k]−ƒj[k]| is used instead. The disclosed embodiments are not limited to any particular choice. An important step is to learn the feature weights λk in step 449. A search is performed to search for the values of the feature weights. The weights of the one or more features are assigned an initial value and the distance metric is computed from the initial value. The distance metric is used in a clustering algorithm to generate a clustering for the sample of document pages. The generated clustering is evaluated against the reference clustering, and based on this evaluation the feature weights may be modified or kept the same. The search and evaluation steps are further described in
After the search and evaluation steps are performed to determine the feature weights, step 470, the method continues to step 477. Initially, the entire document page collection is processed through the electronic processing system, so that the same features are extracted from the entire document page collection as shown in step 456. The feature extraction process will result in a much larger set of feature vectors as shown in step 463. The feature weights determined from the sample of document pages are now used to determine the distance metric for the overall collection by plugging in the distance metric into a clustering algorithm as shown in step 477. The result is a clustering of the complete document page collection as shown in step 484. The method terminates at step 491.
In the simple search approach, a sample of document pages 600 from a document page collection is obtained and the feature information associated with each page is extracted to create a feature vector set 610 Initially, all feature weights 620 are given a value of 1/n, where n is the total number of features. A distance 630 between two document pages in the sample is determined, as described above, and then the document pages are given to a clustering algorithm 640. The clustering algorithm 640 produces some generated clustering 650, and the generated clustering 650 is compared 670 to a reference clustering 660, also known as the “correct” clustering. Then, the features are reviewed one by one and the weights 620 of the respective features are increased by multiplying the features with a certain factor a. If this weight 620 update yields a better clustering 650, then the update is kept permanent. The iterative procedure is repeated until no further improvement is achieved. In an embodiment, the value of a ranges from about 1.1 to about 20.
In the genetic algorithm approach, the feature weights 620 are encoded as chromosomes. A pool of chromosomes is created; in every chromosome every feature weight 620 is initialized to be a random number between 0.0 and 1.0. The usual operations of mutation (reinitialization to a random value), crossover and selection are applied. Selection is based on the fitness of a chromosome, which translates to the evaluation of the clustering 650 imposed by the feature weights 620 encoded in the chromosome. Besides the size of the pool, there are other parameters: the number of generations, the probability of a mutation, the probability of a crossover, and other parameters known to those skilled in the art.
In an embodiment, the clustering algorithm used is hierarchical agglomerative clustering algorithm 640, including single-link, complete-link, and average-link clustering. In agglomerative clustering each object is initially treated as a separate group (cluster). Then, clusters are successively combined based on similarity until there is only one cluster remaining or a specified termination condition is satisfied. In an embodiment, the clustering algorithm is an average-link clustering algorithm. Those skilled in the art will recognize that the methods disclosed herein can be used with any clustering algorithm and still be within the scope and spirit of the presently disclosed embodiments.
To give back feedback to the search algorithm, the clustering produced by a particular choice of feature weights has to be evaluated. That is, the generated clustering has to be compared to the reference clustering. Various evaluation indexes have been proposed to compare two clusterings including, but not limited to, the rand index, the Jacquard similarity index, the split/join distance and the variation of information measure. In an embodiment, the variation of information measure is used as the evaluation method.
In
In the maximum entropy approach, the maximum entropy classification method is used to detect the weights 830 of the features. Two classes are created: “same cluster” and “different cluster”. For the maximum entropy classifier 820, a training sample is created for each pair of points (document pages) of the original clustering problem. Each new training sample has n features, namely the n “feature distance” values dk(ƒi[k],ƒj[k]). Each training sample is assigned the class “same cluster” if both points of the pair are in the same cluster in the reference clustering 870, otherwise the sample is assigned the class “different cluster”. Maximum entropy classification is performed with the created sample set. The maximum entropy algorithm creates a model in which each feature is assigned a certain weight. The n weights are extracted from the model and output as the learned feature weights 830 for the original problem.
In the linear program approach, the output weights 830 are calculated in one go by reformulating the optimization goal. The goal is to derive a linear program from the original problem, which can then be solved using standard techniques. All pairs of points (document pages) (pi,pj) are considered. S is the set of point pairs, where both points belong to the same cluster, and T is the set of point pairs, where the points belong to a different cluster.
If pi and pj are in the same cluster (i.e., (pi,pj)εS), then the two document pages are used to formulate the optimization goal. The goal is to find feature weights 830 that minimize the distances 840 between points in the same cluster. So, the optimization goal is to minimize the sum of all distances 840 between point pairs from S:
If pi and pj are not in the same cluster (i.e., (pi,pj) εT), a constraint is formulated. For each such pair, the distance between those two points should be larger than the distance between points from the same cluster.
In the constraint, the first summand is the distance between the two points pi and pj from T. The second term is the normalized optimization goal, the average distance between points from the same cluster. The distance between points from different clusters should to be larger than that, by a certain amount ε>0. Through this definition a large number of constraints are obtained. All the weights are imposed to be nonnegative. By solving the so defined linear program a set of feature weights 830 is obtained. The linear program may not have a solution, but those skilled in the art will recognize that methods exist to produce an approximate solution.
A method for computing a distance metric for a document page collection includes obtaining a document page collection, each document page in the collection having one or more features; extracting information from the one or more features on each document page; constructing a feature vector for the one or more features on each document page; assigning a feature weight to each feature; and computing a distance metric based on the feature weight and the feature vector.
A method for evaluating a generated clustering for a document page collection includes obtaining a document page collection, each document page in the collection having one or more features; choosing a sample of document pages from the collection; computing a reference clustering for the sample of document pages; extracting information from the one or more features on each document page in the sample; constructing a feature vector for the one or more features on each document page; assigning a feature weight to each feature; computing a distance metric between any two pages in the sample of document pages based on the feature weight and the feature vector; clustering the sample of document pages using the distance metric in a clustering algorithm to obtain a generated clustering for the sample of document pages; and comparing the reference clustering to the generated clustering.
A method for clustering a document page collection includes obtaining a document page collection, each document page in the collection having one or more features; extracting information from the one or more features on each document page and constructing a feature vector; computing a distance metric based on an assigned feature weight for each feature; and clustering the document page collection using the distance metric.
Although the methods disclosed herein relate to clustering a document page collection, those skilled in the art will recognize that the methods can be used in other clustering approaches, including, but not limited to, a scientist clustering proteins into homology groups; a user clustering document pages for legacy document conversion, a company clustering customers into customer groups, a person clustering web pages into catalogs, and a person clustering images into different groups.
All patents, patent applications, and published references cited herein are hereby incorporated by reference in their entirety. It will be appreciated that various of the above-disclosed and other features and functions, or alternatives thereof, may be desirably combined into many other different systems or applications. Various presently unforeseen or unanticipated alternatives, modifications, variations, or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims.