The present innovations generally address tools for extracting structure and header information from documents. Large professional documents such as those found in the legal domain are normally hierarchically structured into sections which contain sub-sections which further contain sub-sub-sections and so on. In addition, each of these sections may contain lists, with sub-lists, etc. This structure can convey important information when analyzing a document for many downstream tasks such as information retrieval, information extraction, document presentation and/or document navigation.
Using a computer to reliably extract a document's structure for real world documents is challenging not only because many documents don't follow a consistent template but also because of errors introduced by document conversion and user error. Furthermore, the structure of a document can be obscured by boilerplate text such as page headers and footers that are captured during an optical character recognition (“OCR”) process and must be reliably identified and removed.
The existing literature about document structure analysis can be roughly divided into the identification of physical, logical and/or semantic structure. See Dengel and Shafait (Andreas Dengel and Faisal Shafait. [n.d.]. Analysis of the Logical Layout of Documents. In Handbook of Document Image Processing and Recognition, David Doermann and Karl Tombre (Eds.). Springer London, 177-222.) and Mao et. al (Song Mao, Azriel Rosenfeld, and Tapas Kanungo. [n.d.]. Document structure analysis algorithms: a literature survey, Tapas Kanungo, Elisa H. Barney Smith, Jianying Hu, and Paul B. Kantor (Eds.). 197-207.) for reviews, both of which are incorporated herein in their entireties. Physical structure extraction deals with capturing a digital representation of a paper document and involves image processing/enhancement, grouping the pixels of an image of a document into sections, identifying the type of each section (e.g. text or image) and performing OCR on text sections. Logical structure analysis involves identifying relationships between physical components, e.g. the caption of a figure, the agglomeration of coherent sections of text, the document's reading order and possibly its section hierarchy. Logical structure analysis may be performed on natively digital documents where structure information is not readily available as in PDF documents. Semantic analysis normally involves identifying section types specific to a certain domain although this is sometimes grouped under logical structure analysis. These processes are generally applied sequentially and errors in one process can accumulate in downstream processes.
The present inventions may fall into the domain of logical structure analysis and take as input text blocks in reading order that are annotated with layout and formatting information and produces a hierarchy of sections and/or list items in the form of a tree. The present inventions deal, therefore, not only with scanned documents but natively electronic documents that do not have structure annotations.
Tuarob et. al. (S. Tuarob, P. Mitra, and C. L. Giles. [n.d.]. A hybrid approach to discover semantic hierarchical sections in scholarly documents. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR) (2015-08). 1081-1085.) identify and classify sections and creates a hierarchy using domain specific rules for scholarly articles. Constantin et. al. (Alexandru Constantin, Steve Pettifer, and Andrei Voronkov. [n.d.]. PDFX: fully-automated PDF-to-XML conversion of scientific literature. ACM Press, 177.) identify the logical parts of scientific documents using rules based on some font characteristics. While both of these use font characteristics to identify section headings and/or boundaries, neither is completely sufficient. Rahman and Finin (Muhammad Mahbubur Rahman and Tim Finin. [n.d.]. Understanding the Logical and Semantic Structure of Large Documents. ([n. d.]). arXiv:1709.00770) also work in the domain of scholarly articles, however they only identify structure as a byproduct of identifying a constrained number of section headings using an ML approach. All of these approaches only derive structure to a limited depth. Finally, Rausch et. al. (Johannes Rausch, Octavio Martinez, Fabian Bissig, Ce Zhang, and Stefan Feuer-riegel. [n.d.]. DocParser: Hierarchical Structure Parsing of Document Renderings. ([n. d.]). arXiv:1911.01702) derive a logical hierarchy using rules which rely on the relationship between bounding boxes of elements extracted from a document rendering, but this is still insufficient. Each of these references are incorporated by reference herein in their entireties.
The present inventions differ from these in many ways, but at least in that we propose a system based on optimizing a global measure of the internal coherence of a document's structure as opposed to domain specific rules that are applied locally or machine learned models that identify section headings in isolation.
Accordingly, the present inventions address the need for improvements in computer functionality to extract structure and header information from digital documents.
In order to develop a reader's understanding of the innovations, disclosures have been compiled into a single description to illustrate and clarify how aspects of these innovations operate independently, interoperate as between individual innovations, and/or cooperate collectively. The application goes on to further describe the interrelations and synergies as between the various innovations; all of which is to further compliance with 35 U.S.C. § 112.
The present invention provides a system and method for structure and/or header extraction.
In one aspect, a method for extracting headers comprises receiving an input body of text containing a plurality of chunks of text, identifying a set of features of each chunk, classifying each text chunk as a potential header depending on whether the chunk includes a mark or title text, identifying any boilerplate in each potential header and removing it to form cleaned potential headers, and comparing the cleaned potential headers to each other and to a remainder of the input body of text not included in the cleaned potential headers to confirm whether each cleaned potential header is a header.
In one example, the features include typography characteristics. For example, the features may include at least two or more of font family, font size, italic, bold, underline, space above, space left, space left first line, and justification.
In another example, the features include orthography characteristics.
In another example, the features include page layout.
In another example, the features include at least two or more of typography characteristics, orthography characteristics and page layout.
In another example, the method further comprises determining if a chunk includes title text by at least comparing features of the chunk to features of a remainder of the input body of text and identifying title text if its features differ from those of a majority of the remainder.
In another example, the comparison of cleaned headers includes comparing the number of characters included in the cleaned potential headers and chunks of text in the input body of text covered by the cleaned potential headers to a total number of characters in the input body of text.
In another example, the comparison of cleaned potential headers includes determining a similarity among all of the cleaned potential headers based on their features.
In another example, the comparison of cleaned potential headers includes discounting groups of similar cleaned potential headers based on an average number of characters among the cleaned potential headers.
In another example, identifying boilerplate includes comparing an average number of characters in a group of potential headers with similar features to a threshold.
In another example, identifying boilerplate includes comparing an average number of characters in a group of potential headers with similar features to a number of character edits required to transform each potential header in the group into a subsequent potential header in the group.
In another example, identifying boilerplate includes comparing an average number of characters in a group of potential headers with similar features to a threshold and to a number of character edits required to transform each potential header in the group into a subsequent potential header in the group.
In another example, identifying boilerplate includes comparing potential headers to a set of one or more predetermined non-boilerplate words.
In another aspect, a method for extracting structure among headers comprises receiving a plurality of headers in reading order as they appear in a document, identifying a set of features for each header, determining a similarity between all pairs of headers based on their features, segmenting the headers into groups of one or more similar adjacent headers based on similarities between adjacent headers in the reading order, and matching non-adjacent groups of similar adjacent headers based on feature similarities between headers of the groups.
In one example, the matching of non-adjacent groups of similar adjacent headers is based on similarities between last headers in one group and first headers in another group.
In another example, the headers are segmented into groups of one or more similar adjacent headers based on zero crossings of a second derivative of adjacent heading similarities along the reading order. In one example, the adjacent heading similarities are smoothed before the second derivative is performed. For example, the smoothing may include convolution with a smoothing kernel.
In another example, the method further comprises cutting any headers that cross one another, resulting in only non-crossing headers.
In another example, the matched non-adjacent groups of similar adjacent headers form sequences and the matching includes maximizing a document-wide sum of similarities between adjacent headers within each sequence.
In another example, the features include typography characteristics.
In another example, the features include at least two or more of font family, font size, italic, bold, underline, space above, space left, space left first line, and justification.
In another example, the features include orthography characteristics.
In another example, the features include page layout.
In another example, determining a similarity between pairs of headers includes comparing marks of the headers.
In another example, determining a similarity between pairs of headers includes determining whether marks of the headers are derived from a same template.
In another example, determining a similarity between pairs of headers includes determining whether marks of the headers are in sequence from the header that is first in reading order to the header that is later in reading order.
In another aspect, a method for extracting structure among headers, comprising receiving a plurality of headers in reading order as they appear in a document, identifying a set of features for each header, determining a similarity between all pairs of headers based on their features, and sequencing the headers into one or more sequences by maximizing a document-wide sum of similarities between adjacent headers within each sequence.
In another example, the features include typography characteristics.
In another example, the features include orthography characteristics.
In another example, the features include page layout.
In another example, determining a similarity between pairs of headers includes determining whether marks of the headers are derived from a same template.
In another example, determining a similarity between pairs of headers includes determining whether marks of the headers are in sequence from the header that is first in reading order to the header that is later in reading order.
The accompanying drawings illustrate various non-limiting, example, innovative aspects in accordance with the present descriptions:
Embodiments of systems and methods for extracting structure and header information from documents are described herein. While aspects of the described systems and methods can be implemented in any number of different configurations, the embodiments are described in the context of the following exemplary configurations. The descriptions and details of well-known components and structures are omitted for simplicity of the description, but would be readily familiar to those having ordinary skill in the art.
The description and figures merely illustrate exemplary embodiments of the inventive systems and methods. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the principles of the present subject matter. Furthermore, all examples recited herein are intended to be for illustrative purposes only to aid the reader in understanding the principles of the present subject matter and the concepts contributed by the inventors to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Moreover, all statements herein reciting principles, aspects, and embodiments of the present subject matter, as well as specific examples thereof, are intended to encompass all equivalents thereof.
In general, the systems and methods described herein may relate to improvements to aspects of using computers to extract structure and header information from documents. These improvements not only improve the functioning of how such a computer (or any number of computers employed in extracting structure and header information from documents) is able to operate to serve the user's document analysis goals, but also improves the accuracy, efficiency and usefulness of the structure and heading extraction results that are returned to the user.
The tools described herein are particularly suited to legal documents and are generally discussed in that context, however it will be appreciated that many other types of documents, texts and users will benefit from the inventive tools disclosed and claimed herein.
Large professional documents such as those found in the legal domain are normally hierarchically structured into sections which contain sub-sections which further contain sub-sub-sections and so on. In addition, each of these sections may contain lists, with sub-lists, etc. This hierarchical structure (or just hierarchy) contains important information the author intended to convey to the reader and properly extracting it can aid many downstream tasks such as information retrieval, information extraction, document presentation and/or document navigation. Furthermore, if a document is created by an OCR scanning process the text of any headers, footers, and/or page numbers or other “boilerplate” must be identified and removed.
Many real-world documents, however, are corrupted by many kinds of errors or “noise”. Errors in a document's text, format and/or layout can be introduced during transformation from one form to another, e.g. OCR scanning or rendering of web content, and by user error. Furthermore, the well-known difficulty in accurately identifying boilerplate and section boundaries introduces additional noise when attempting to reconstruct a document's hierarchy if not done well.
Many of the documents to which the present innovations are particularly relevant do not adhere to any specific formatting style and can vary between geographies, domains, organizations, and even individuals. Because of this balkanization it is difficult to create template-based or machine learned models that can generalize well not only because of the cost of acquiring training data but because it is expected that previously unseen document styles will be routinely observed at runtime.
As one example, shown in
In another example, on the second page of an exemplary document, shown in
In another example, on another page of an exemplary document, shown in
As shown in these examples, real-world documents can contain many kinds of “three dimensional” information that is hard to interpret via a linear scan of the text. In describing this information as “three dimensional,” this accounts for the two physical dimensions of the page plus the third dimension of text format, e.g. bold, underline, caps, font, etc. As humans, we can easily chunk this text because we see in two dimensions of the page and can easily distinguish the third format dimension. When processing the document as linear text, this three-dimensional structure needs to be recreated or inferred to properly extract structure. Finally, it is important to keep in mind that other documents will look very different from this one. Their sections may be identified with roman numerals or have negative left indentation, different fonts or styles. Their page numbers may be in the header instead of the footer and may contain other kinds of boilerplate, etc.
Hierarchical structure extraction is suitable not only for scanned documents but for natively electronic ones as well because hierarchical information may not be readily exportable from some formats, i.e. PDF, and/or a document may contain noise due to repeated editing. This last situation arises frequently in legal documents.
This work introduces the section hierarchy problem as the problem of identifying a documents' hierarchy of sections (including lists) and its boilerplate in the presence of noise. Unlike other structure extraction work, we are interested not only in the top-level sections of a document but in all of its hierarchical components including itemized lists. We are also specifically interested in working with large (50+ page) documents. We characterize the complexity of the section hierarchy problem as NP-hard and present a tractable solution. Our approach does not necessarily require the use of training data but instead attempts to identify the hierarchy of a document in a way that that optimizes a function of the coherence, or readability, of a document. This function is relatively computationally easy to construct and encodes knowledge about how an author creates documents in order to convey sectioning information to a reader. One key aspect of the authoring process relied on is that those section headings or boilerplate of a document that are logically connected are assumed to look the same.
Since we propose to handle as many different kinds of documents as possible, most of which we don't have examples for, we take the approach of designing an algorithm based on a generalized notion of how a document might be segmented by its creator. A key idea is that when a lawyer (or any user or creator for that matter) writes a document, she will likely make textual elements that should be grouped together look the same. So, for example, all top-level headings may likely look the same, e.g. use the same format and relative horizontal position. All clause titles may likely have a similar format such as “<number> Bold Title in Camel Case: First sentence of clause . . . ” or “CENTERED TEXT IN ALL CAPS”. Similarly, all sub-sections will likely look the same, e.g. “<roman numeral>—text goes here . . . ” or “(<lowercase letter>)—text begins on next line”.
Also, since, in particular, we are dealing with OCRed input, any solution is preferably resilient to errors. For example, text that is in bold or italics or underlining may be interpreted as plain or vice versa. Spaces may be missing and odd characters inserted at random, etc.
One approach may be to identify and extract possible titles or headers from a document and then to group them into sequences that belong together based on their “appearance” in the document. These titles or headers can then be merged into a hierarchy or structure, ones that have certain properties can be eliminated as boilerplate, and we can search the remaining sequences for the most “clause like”. In a simple case, each title or heading in a “best” sequence is then used to delimit each clause while in a more complex case, sequences can be merged to produce a full hierarchy of titles, e.g. sections, clauses, sub-clauses, etc.
In one exemplary embodiment, text of a document is input in reading order and produces a tree-structured hierarchy of the document's text where each node of the tree represents a logical section of text. All potential sections are identified from top level sections with headings to bulleted list items. The present innovations are not restricted to a finite nesting depth. See
Extracting a document's hierarchy can be a deceptively difficult problem for several reasons. The first is that documents, even natively electronic documents, may be corrupted by many sources of noise. These include:
The effect of this noise is to render many downstream tasks much more difficult, specifically the task of identifying section boundaries. This is particularly important for our task as each node in the output tree may represent a section.
For the documents that may be a subject of the present innovations, a section may be identified as text between headings such as “Introduction” and “Related Work” or list items such as “a) the area . . . ” and “b) notwithstanding . . . ”. Many authors have documented the difficulty of section identification in real-world documents even in very specific domains. Therefore, any system that solves the section identification and hierarchy identification problems, such as the one embodied by the present innovations, will preferably be configured to deal with both false positives and false negatives in section and hierarchy identification.
Furthermore, the section hierarchy problem is complicated by the fact that logically related headings or boilerplate may appear very far from each other in a document with many different kinds of intervening headings. This is particularly true of headings towards the root of the hierarchy, but can also apply to deeper headings as demonstrated by the relationship between the headings at lines 7 and 11 of
In one exemplary approach, we define a document D=(c1, . . . , cl) as an ordered list of l text chunks ci in reading order of the document. Each text chunk is endowed with a set of features fci such as font size, emphasis and justification. The features of each text chunk are assumed to be corrupted by noise, e.g. a chunk that is actually underlined may instead be reported as italic. Some exemplary raw text features that may be used in this work are listed in Table 1. Furthermore, we do not assume the input contains page break markers which makes identifying boilerplate much more difficult.
Let a heading be any text chunk that is a section heading, boilerplate item or list item where a boilerplate item is a page header, page footer, page number or any similar kind of repeated text in a document. Let H=ϕ(D)={h1, . . . , hn} be those text chunks of D that are classified as headings by a binary classifier ϕ. In real world documents, H may be corrupted by both false positives and false negatives.
Let any Q⊆H be referred to as a sequence since Q is naturally fully ordered based on the reading order of the document. Our goal is to identify those sequences of H that the author intended to be considered as a coherent list of logically related sections.
For example, the headings “Article 1”, “Article 2” and “Article 3” form a sequence of sections which are related and should appear at the same level in a document's hierarchy. Real world documents, however, contain many sequences which are not so easily identified as belonging together.
We introduce some notation to make working with sequences easier. Let a sequence S=(s1, . . . , sm) be represented as an ordered list of monotonically increasing indices into [1, n] such that si⇔i<j. Each sequence naturally represents an ordered list of headings S=(hs
We show that finding a partition P of H where each sequence in P is maximally coherent is a solution to the section hierarchy problem.
The difficulty of partitioning H becomes evident when one considers that headings in a sequence may be very far apart in the original document. For example, consider a document with long multi-page sections at the top of its hierarchy where each section contains many sub-sections, lists and boilerplate. The headings of the top-level sections must be placed into the same sequence as opposed to sequences containing any intermediary headings. This is a difficult problem considering that the set of identified headings His noisy (i.e. contains both false headings and is missing some true headings). Furthermore, comparing elements in H to determine if they belong in the same sequence is complicated by the fact that the features ƒ for each text chunk, from which the headings in H are derived, are also noisy. For example, some elements in H may have their text corrupted by OCR errors, their emphasis may be wrong or their text size may be inaccurate.
To find the best partition of H we define a coherence function g(S): 2H which is assigns a positive value to any sequence in proportion to how logically consistent its elements are with each other. For example, for the two sequences A=(“INTRODUCTION”, “PROBLEM DEFINITION”, “HISTORY”) and B=(“INTRODUCTION”, “Orientation”, “PROBLEM DEFINITION”, “HISTORY”) it should be the case that g(A)>g(B) because the second title in A is more coherent with the other titles in A than the second title in B is with the other titles in B.
Nearly all of what constitutes an acceptable document structure can be encoded in g by ensuring that higher values are given to sequences whose headings are more heterogeneous with respect to their “look”. The function g is relatively easy to construct for large classes of documents, as discussed in more detail below.
The coherence of a partition P can be represented as:
The present innovations focus on finding that partition which maximizes G over all possible partitions H of H:
The optimization problem described in equation (2) can be shown to be NP-hard using a simple reduction to the weighted set cover problem. The present innovations present a principled method to estimate a solution to (2) that runs in polynomial time and produces good results.
Before describing the solution, recall that a noncrossing partition P of H is a partition whose subsets don't overlap each other. Specifically, P is noncrossing if for any S1, S2∈P with sa<sb<sd, sa, sc∈S1 and sb, sd∈S2 then S1=S2. A noncrossing partition of H captures the notion that each section of a document is a contiguous portion of text that does not bisect any other section.
The proposed solution begins by identifying section headings H from the text of a document. A crossing partition of H is derived in polynomial time and is used to identify boilerplate sequences. This is because boilerplate sequences will normally cross/bisect other sequences, i.e. consider a sequence of page numbers. Once boilerplate sequences are removed, the remaining sequences are “shattered” to create a noncrossing partition of H whose coherence can be monotonically increased using search-based optimization techniques. A tree can be produced directly from the resulting noncrossing partition.
The steps of an exemplary solution according to one embodiment are shown generally in
In another exemplary embodiment, the steps of an exemplary solution may be simplified to finding the headings 10, identifying and removing boilerplate 14 and constructing final partitions and sequences of headings 20.
Heading classification is a non-trivial task and in fact our system is designed specifically because heading detection is expected to produce both type I and type II errors. In our system a heading can have up to four parts as exemplified in
A heading should preferably consist of at least a mark, e.g. “a)”, or a title, e.g. “SUMMARY”. We identify headings based on character patterns, layout and formatting information using a small number of regular expressions and some control logic. Formatting includes things like emphasis, e.g. bold, and character case. Layout includes left indent and justification. Importantly, we do not use word-based features or rules except for a small number of phrases used to reject a heading, e.g. “Signature:”. This allows us to generalize the system to more domains quickly.
As outlined above, identifying a partition of H that maximizes the coherence of the document is NP-hard. This is because the domain of maximization in (2) is over all partitions of H which is necessary because g is computed using an entire sequence all at once. To overcome this problem, we propose to generate a candidate partition which maximizes the coherence of a partition over pairs of headings as opposed to over entire sequences of headings.
Let a problem graph G=(H, E) be a graph representation of a document where the vertices of the graph are the headings H and each edge hi, hj∈E represents that the pair of nodes hi and hj are in the same sequence. The edges of G are initially populated with all allowable associations: E={hi, hj; i<j}.
h+≤1;h∈H
h−≤1;h∈H (3)
where h+/− are the out-degree/in-degree of the vertex h. Let Gp=(H, Ep) be a partition graph of the problem graph G where Ep⊂E and Gp satisfies (3).
Finding a partition graph is equivalent to finding a vertex disjoint path cover for G. We transform the problem graph G into a bipartite graph G′ and apply a matching algorithm to identify the set of vertex disjoint paths/sequences.
Let G′=(L∪R, E′) be constructed from G as follows: each vertex hi∈H is split into two vertexes li∈L and ri∈R and each edge hi, hj∈E is added as li, rj∈E′. The graph G′ is bipartite as all edges start with a vertex in L and end with a vertex in R.
A matching in a bipartite graph is a set of edges that are vertex disjoint, i.e. each node in L is incident to at most one node in R and each node in R is incident to at most one node in L. A perfect matching is one in which all nodes are incident to one edge. If G is acyclic any matching in G′ has a one-to-one correspondence with a vertex-disjoint path cover of G. Since the problem graph G is acyclic, finding a matching in G′ will provide a vertex disjoint path cover in G and hence a partition of H.
The problem then is how to choose a matching between the nodes in L and R in G′ that optimizes the coherence of the sequences represented by the resulting partition. A convenient choice is g itself but restricted to pairs of headings, i.e. sequences of length 2. Let each edge in G′ be assigned a weight:
The edges of G′ are thus enhanced with a score indicating the similarity of each pair of headings. A matching of the nodes in G′ that maximizes these weights would represent a partition P of H that maximizes the pairwise coherency between the elements of each sequence in P. In addition, since we expect each header to participate in a sequence, we add the additional constraint that all nodes should be matched. This is an instance of a maximum weight perfect matching problem which can be cast as a linear sum assignment problem (LSAP) for which there exist polynomial time solutions.
Let a square adjacency matrix be defined as:
Then a mathematical formulation of LSAP is:
A solution to (4) provides an adjacency matrix X which indicates which edges are part of a perfect matching between L and R. A partition graph is easily constructed from X by keeping only those edges in E′ with non-zero entries in X and merging nodes li and r back into a single node hi and incorporating all incident edges of li and r into hi.
The partition graph in turn directly represents a partition S=(S1, . . . , Sk) of H into k sets. Each subset Si=(hs
In another exemplary embodiment, the titles in a document are represented in a totally ordered sequence as T=(t1,tn), i.e. title ti appears before tj iff i<j. Let mi,j:=m(l,j) be a similarity measure between titles ti and tj. Higher scores indicate the titles are more similar. The similarity between two titles will depend on things like their font style, indentation, format, marks, etc. as discussed elsewhere herein. The similarity score is not, however, based on a language model but more on typography (the way text “looks”), orthography (the way the text is spelled, e.g. all uppercase) and page layout (e.g. space between text). The “content” of each title is for the most part ignored to make the solution as general as possible.
For example, consider these five titles:
1: “Clause 1. USE”
2: “Clause 2: INDEMNITY”
3: “Exercise 3: Solutions”
4: “page 3:”
5: “Clause 3: RelEASE”
We expect m1,2>m1,3, m2,3<m2,5, m2,4<m2,5 and m1,2>m1,5.
Our goal is to group titles into related sequences using only values of m and so avoid defining any thresholds that might be needed for example when using a supervised learning approach. For example, all “Clause” titles in the above list should be in the same sequence whereas titles 3 and 4 should be in other sequences or in their own sequence. We will call the set of extracted sequences a “sequence pool” so that it is easier to identify this concept.
The set of titles and similarity metrics can be visualized as a graph where each node is a title, there are edges between each node ti and each node tj where j>1 and where the weight on edge (l, j)=mi,j. Alternatively, this can be viewed as an upper triangular matrix M of similarity values where the entry at row i and column j is mi,j.
Grouping titles into groups of distinct sequences means that for each title ti we need to determine the most likely subsequent title s(ti):=si. For example, in the above list of titles s(1)=2, s(2)=5, s(3)=None and s(4)=None. Furthermore, given these values for s( ) we can extract one sequence from these 5 titles, specifically [1, 2, 5].
One way to grow sequences from T would be a greedy approach where we first order the set U={mi,j: 1<=i,j<=n} by decreasing value and define the function s as follows:
1. While U is not empty:
2. Use s( ) to create the sequence pool
Such approaches can run into difficulties because of their greedy nature and can be susceptible to small changes in how m( ) is calculated and/or errors from OCR. Consider the configuration of titles from a real document as shown in
In the example shown in
In this case we see that m(1,5)=1.50 which is greater than m(1, 2) even though title 2 is the correct subsequent title for title 1. The value of m(1,5) is slightly greater than m(1,2) because of small errors in the similarity metric due to OCR errors not visible here. Instead of getting stuck in the “rabbit hole” of continually trying to improve m( ) for every exception discovered, we wish to devise an algorithm that will still work even when m( ) produces such “errors.”
The correct solution consists of two sequences: ([1, 2, 3], [4, 5]). The greedy solution would produce ([1, 5], [2, 3], [4, 6]). Using m( ) as a rough measure of quality, we see that the sum of m( ) values for the correct set of sequences is m(1,2)+m(2,3)+m(4,5)=4.37 while that for the greedy solution is m(1,5)+m(2, 3)+m(4, 6)=4.28.
We implement techniques from combinatorial optimization to solve this problem by refactoring the graph representation of our problem as depicted in
Let G=(T, E) be a graph representing our problem as in
We wish to find a matching between the nodes in L and those in R that maximizes the sum of the weights of the edges in the matching. This is a LSAP for which multiple polynomial time algorithms exist. This can be represented mathematically as follows. First the solution is represented by a matrix X:
Let M be the matrix of similarity measures between each node in G′ such that mij=m(ti, tj) if edge (ti, tj)∈E′ and 0 otherwise. The problem can then be expressed as:
Such that:
The two constraints ensure that each node in L is connected to exactly one node in R and each node in R is connected to only one node in L.
Given a solution X* to this problem we can extract a set of title sequences by noting that title ti is followed by title tj in some sequence iff xi,j=1, i<j and m(ij)>0. By repeatedly starting with the first index that is not already part of a sequence until no more indexes are available we end up with a set of title sequences S=(Sl, . . . Sm).
With an initial partitioning and sequencing of the document titles or headers completed, boilerplate is identified and removed. This step may also be completed before initial partitioning and sequencing in some embodiments.
The boilerplate sequences in S are those that consist of things like headers, footers or page numbers. These can be identified by their regularity, that is, the text of the headings in a boilerplate sequence differ from each other only by a small amount. Let d(hi, hi) be the Levenshtein distance between the text of two headings. See (V. I. Levenshtein. [n.d.]. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. 10 ([n. d.]), 707), which is incorporated by reference herein in its entirety.
For example, in one embodiment, to determine which sequence in S is the true set of “clauses” of a document, we use a set of heuristics. First, we eliminate from S any sequence S for which any of the following are true:
The first item can eliminate sequences of things like simple page numbers. The second item eliminates other kinds of boilerplate that contain a lot of repeated text. Recall that the Levenshtein distance between two strings A and B is the number of edits (insertions, deletions or substitutions) that are needed to transform A into B and can be quickly computed using dynamic programming techniques. It is a good proxy for a measure of the similarity between two strings. For example, the Levenshtein distance between two strings like “Appendix 4” and “Appendix 5” will be relatively low since each title contains a kind of boilerplate string, e.g. “Appendix”. In contrast, the Levenshtein distance between two strings like “Permitted Use” and “Environmental Requirements” are much higher. This is consistent with our intuition that clause titles should convey a lot of information, i.e. they should be quite different from each other.
In another embodiment, a sequence is determined to be boilerplate if its elements cluster into a single group using any suitable clustering algorithm and threshold parameter. We used the DBSCAN algorithm with a value of 3 for both the epsilon and min points parameters. See (Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu. [n.d.]. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In KDD Proceedings (1996). AAAI, 6.), which is incorporated by reference herein in its entirety. We exclude from consideration sequences that contain special domain specific words such as “Article” or “Appendix”. This is the only other domain specific knowledge used besides the definition of g.
With boilerplate removed, the final sequencing and partitioning of the remaining header or title sequences may begin. For the remainder of the proposed solution it may be assumed that boilerplate sequences have been removed from S.
In one embodiment, sequences may be further limited according to a set of heuristics intended to isolate those sequences of the document that represent the substantive content of the document—“the clauses.” For example, the sequence or set of sequences that have the highest value of h( ) may be chosen as a set of “clauses” where h is defined as:
where c(Si) is the number of characters of the original document covered by the titles and their corresponding textual sections in the sequence Si, N is the total number of characters in the document, |Si| is the total number of titles in a sequence and are the number of characters in the title tj. The three parts of this heuristic can be interpreted as: (for the a1 term) the percentage of the document covered by the titles in Si, (for the a2 term) the mean strength of the similarity between titles in Si, and (for the a3 term) a discount for sequences with very long titles. Values for the weights a{1,2,3} may be manually set or may be set via a learned model. Exemplary weights for a{1,2,3} could be, for example, 0.2, 0.7 and 0.1 respectively.
The solution to the LSAP (4), referenced above, introduces two problems into S. The first is that because it implements a perfect matching it encourages sub-sequences to be merged with their parent sequences and adjacent sequences to be appended. As an example of the former problem, consider the following set of headings which are presented in their correct hierarchical relationship with line numbers to their left and the value of mi,i+1 to their right:
Using line numbers, the correct partition of these headings should be into two sequences: (1, 2, 3, 6) and (4, 5). The solution of (4), however, will be the single sequence (1, 2, 3, 4, 5, 6) because it includes 5 edges whose weights add up to be greater than the four edges of the correct solution. That is, even though m3,6>m3,4 and m3,6>m5,6 because of the mismatch in sequence marks (e.g. “d” is more likely to follow “c” than “ii” in a numbered list) it is likely that m3,4+m5,6>m3,6. Again, this is a consequence of deriving a perfect matching.
To remedy this problem we cut sequences at edges of reduced similarity. For a sequence S, the values Sm=(mi,i+1:0<i<|S|) can be considered as a discrete time-varying signal. The “edges” in this sequence will identify sub-sequences boundaries. A common approach to detecting edges in a function ƒ(t) corrupted by noise is to search for the zero crossings of its second derivative ƒ″(t)=0 after first smoothing the data. For discrete functions that are corrupted by noise, as is the case of Sm, this can be accomplished by convolving the signal with a smoothing kernel and one that numerically approximates the second derivative. Since edges in our case have zero width and manifest between adjacent sample points the support for the kernels should be very small. An exemplary kernel used for smoothing is KS=[0.15, 0.7, 0.15] and to numerically approximate the second derivative an example is KD=[1, −2, 1]. Sequence are cut at zero crossings of KDKS Sm and the sub-sequences added back to S where is the convolution operator.
In addition, those sub-sequences that have marks which form runs of roman numerals are identified, cut and the pieces added back to S. This is an example of how the interpretation of the hierarchical position of a heading depends on the other elements in its sequence.
Finally, all crossing sequences in S are cut so that the resulting set of sequences are noncrossing.
if G(Sa/b)>G(Sb/a) cut Sb otherwise cut Sa (5)
“Shattering” the whole partition S is done greedily by cutting one of each pair of crossing sequences and updating S with the results of the cut. Let P be the resulting set of noncrossing sequences.
Some valid sequences in S may, however, have been cut when the set was shattered. In this stage, we merge sequences in P that increase global coherence while maintaining the set as noncrossing. The candidates for merging are those sequences which lie between edges in P. A sequence S is directly covered by an edge j, k if the indices of its headings lie between j and k and there is no other edge i, m in any sequence of P that also covers S such that j<<m<k. Let P j, k be the set of sequences that are directly covered by the edge j, k in P.
As before, we will partition PI, k into subsets which maximize the total coherence of the sequences represented by each subset. In this case, however, we will consider only non-crossing partitions and the elements are not headings but sequences (although they may be sequences of length 1).
If is the set of all noncrossing partitions of P let P*j,k be the set of sequences which maximizes total coherence:
The size of the domain needed to be searched in (6) is much smaller than that in (2) not only because |Pj, k|<|H| but also because the number of noncrossing partitions of a set is much smaller than the number of partitions of that set. This makes solving (6) using a beam search feasible.
The size of is given by the Catalan number:
is the Stirling number of the second kind.
We can therefore feasibly enumerate the most viable candidates in using a beam search where we keep only a constant number of the top noncrossing partitions by coherence value. A maximum beam of just 50 candidates was enough to ensure optimal performance for the documents in our test set. For each edge j, k∈P the elements of Pj,kare replaced with . It is easy to see that each such update of P does not decrease G(P) and maintains it as a noncrossing set. Finally, to ensure any top level sequences are properly merged, a virtual edge which spans the entire document is included in P.
The hierarchical structure of the document can be easily constructed from P by setting the depth and parent of each heading in P. For example, the algorithm shown in
The coherence function g assigns higher values to sequences which are more coherent, or consistent with themselves. Its definition is the core of the proposed system and the only thing the user needs to define besides a heading classifier and some constants to aid in boilerplate classification. The fewer features used in its definition the more generalizable the system is at the expense of accuracy. In our case, we did not use any word-based features but only layout, formatting (typography) and case. In all, 17 features were used which fall into the four main classes listed in Table 2.
Since all features will be corrupted by noise it is important that g vary smoothly with changes in its inputs and produce its lowest value when the features for a sequence are most inconsistent with each other. If there are k features then:
where ƒi(S) are the values of feature i for sequence S and H is the entropy function.
The algorithm was evaluated on a test set T of 35 randomly selected documents that were originally part of publicly available financial disclosure documents filed with the SEC. These documents had been digitally scanned and their text extracted using OCR technology. In addition, they had been processed through a complex pipeline of transformations, many of which have introduced small but significant errors in the format and/or text of the document. The headings of each document were classified by hand into one of 14 classes. These consisted of TitlePage, TableOfContents, Heading-X, Listltem-X and Other where X can be 1 through 5 indicating the depth of the heading or list-item. All boilerplate is classified as Other. Table of Contents headings are ignored from both the test documents and our algorithm output.
Our primary performance metric is how well our algorithm re-constructs the tree of the original document whereas less emphasis has been placed on detecting section headings. This is because identified headings will always contain errors and a focus of this work is to reconstruct a document's hierarchy in the presence of such errors. The F1 score for heading classification is 0.89 with more details given in Table 3.
The quality of a document's reconstructed hierarchy is determined by comparing each ground truth tree T from the test set with the one from our algorithm T′. In order to separate the performance measurement of heading identification from hierarchy reconstruction, we compare the versions of T and T′ that have been restricted to those nodes that are common between them. In this way, we measure how well the algorithm reconstructed the document's hierarchy independently of how well it has identified headings. Denote these restricted trees as M and M′ as depicted in
Table 3 lists the PC score for the test set.
Finally, the ability to identify boilerplate is determined by the macro-averaged precision and recall of the Other headings classified as boilerplate by the algorithm and which are also reported in Table 3.
The systems and methods described herein may be embodied in a standalone system, a system accessible by other systems or any combination. For example, in a standalone system embodiment, the structure and header extraction tools may be comprised in a standalone application residing on a user's computing device or accessed vie a network or internet link from the user's device. Such a standalone application may be configured to obtain standard documents such as standard playbooks or standard contracts from a contract analytics tool or other library through a web, network and/or API link, for example. Such an application may be configured to create user dashboards, visualizations and detection result exports. Such an application may be configured to interact with another application configured to perform any of the steps described herein.
The systems and methods described herein may also be embodied in a structure and/or header extraction service accessible to other applications via a web, network or API link. For example, a contract evaluation tool may be configured to access a structure and/or header extraction service independently via an API.
In software implementations, computer software (e.g., programs or other instructions) and/or data is stored on a machine readable medium as part of a computer program product, and is loaded into a computer system or other device or machine via a removable storage drive, hard drive, or communications interface. Computer programs (also called computer control logic or computer readable program code) are stored in a main and/or secondary memory, and executed by one or more processors (controllers, or the like) to cause the one or more processors to perform the functions of the disclosure as described herein. In this document, the terms “machine readable medium,” “computer program medium” and “computer usable medium” are used to generally refer to media such as a random access memory (RAM); a read only memory (ROM); a removable storage unit (e.g., a magnetic or optical disc, flash memory device, or the like); a hard disk; or the like.
Notably, the figures and examples above are not meant to limit the scope of the present disclosure to a single embodiment, as other embodiments are possible by way of interchange of some or all of the described or illustrated elements. Moreover, where certain elements of the present disclosure can be partially or fully implemented using known components, only those portions of such known components that are necessary for an understanding of the present disclosure are described, and detailed descriptions of other portions of such known components are omitted so as not to obscure the disclosure. In the present specification, an embodiment showing a singular component should not necessarily be limited to other embodiments including a plurality of the same component, and vice-versa, unless explicitly stated otherwise herein. Moreover, the applicant does not intend for any term in the specification or claims to be ascribed an uncommon or special meaning unless explicitly set forth as such. Further, the present disclosure encompasses present and future known equivalents to the known components referred to herein by way of illustration.
The foregoing description of the specific embodiments so fully reveals the general nature of the disclosure that others can, by applying knowledge within the skill of the relevant art(s), readily modify and/or adapt for various applications such specific embodiments, without undue experimentation, without departing from the general concept of the present disclosure. Such adaptations and modifications are therefore intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance presented herein, in combination with the knowledge of one skilled in the relevant art(s).
In order to address various issues and advance the art, the entirety of this application for SYSTEMS AND METHODS FOR STRUCTURE AND HEADER EXTRACTION (including the Cover Page, Title, Abstract, Headings, Cross-Reference to Related Application, Background, Brief Summary, Brief Description of the Drawings, Detailed Description, Claims, Figures, and otherwise) shows, by way of illustration, various embodiments in which the claimed innovations may be practiced. The advantages and features of the application are of a representative sample of embodiments only, and are not exhaustive and/or exclusive. They are presented only to assist in understanding and teach the claimed principles. It should be understood that they are not representative of all claimed innovations. As such, certain aspects of the disclosure have not been discussed herein. That alternate embodiments may not have been presented for a specific portion of the innovations or that further undescribed alternate embodiments may be available for a portion is not to be considered a disclaimer of those alternate embodiments. It will be appreciated that many of those undescribed embodiments incorporate the same principles of the innovations and others are equivalent. Thus, it is to be understood that other embodiments may be utilized and functional, logical, operational, organizational, structural and/or topological modifications may be made without departing from the scope and/or spirit of the disclosure. As such, all examples and/or embodiments are deemed to be non-limiting throughout this disclosure. Also, no inference should be drawn regarding those embodiments discussed herein relative to those not discussed herein other than it is as such for purposes of reducing space and repetition. For instance, it is to be understood that the logical and/or topological structure of any combination of any program components (a component collection), other components and/or any present feature sets as described in the figures and/or throughout are not limited to a fixed operating order and/or arrangement, but rather, any disclosed order is exemplary and all equivalents, regardless of order, are contemplated by the disclosure. Furthermore, it is to be understood that such features are not limited to serial execution, but rather, any number of threads, processes, services, servers, and/or the like that may execute asynchronously, concurrently, in parallel, simultaneously, synchronously, and/or the like are contemplated by the disclosure. As such, some of these features may be mutually contradictory, in that they cannot be simultaneously present in a single embodiment. Similarly, some features are applicable to one aspect of the innovations, and inapplicable to others. In addition, the disclosure includes other innovations not presently claimed. Applicant reserves all rights in those presently unclaimed innovations including the right to claim such innovations, file additional applications, continuations, continuations in part, divisions, and/or the like thereof. As such, it should be understood that advantages, embodiments, examples, functional, features, logical, operational, organizational, structural, topological, and/or other aspects of the disclosure are not to be considered limitations on the disclosure as defined by the claims or limitations on equivalents to the claims. It is to be understood that, depending on the particular needs and/or characteristics of an individual and/or enterprise user, database configuration and/or relational model, data type, data transmission and/or network framework, syntax structure, and/or the like, various embodiments may be implemented that enable a great deal of flexibility and customization. While various embodiments and discussions have included reference to applications in the legal context, and more specifically in the context of contract review, it is to be understood that the embodiments described herein may be readily configured and/or customized for a wide variety of other applications and/or implementations.
This application claims the benefit of and priority to U.S. Provisional Application Nos. 62/965,516, filed Jan. 24, 2020; 62/965,520, filed Jan. 24, 2020; 62/965,523, filed Jan. 24, 2020; and 62/975,514, filed Feb. 12, 2020, which are hereby incorporated by reference in their entireties. This application for letters patent disclosure document describes inventive aspects that include various novel innovations (hereinafter “disclosure”) and contains material that is subject to copyright, mask work, and/or other intellectual property protection. The respective owners of such intellectual property have no objection to the facsimile reproduction of the disclosure by anyone as it appears in published Patent Office file/records, but otherwise reserve all rights.
Number | Name | Date | Kind |
---|---|---|---|
6298357 | Wexler | Oct 2001 | B1 |
7672022 | Fan | Mar 2010 | B1 |
7797622 | Dejean | Sep 2010 | B2 |
7937653 | Dejean | May 2011 | B2 |
8631097 | Seo | Jan 2014 | B1 |
8898296 | Zeng | Nov 2014 | B2 |
8914720 | Harrington | Dec 2014 | B2 |
9218326 | Dejean | Dec 2015 | B2 |
9336202 | Khan | May 2016 | B2 |
9483463 | Galle et al. | Nov 2016 | B2 |
9514499 | Kogut-O'Connell et al. | Dec 2016 | B1 |
10019488 | Levy | Jul 2018 | B2 |
10049270 | Agarwalla | Aug 2018 | B1 |
10318568 | Yasue | Jun 2019 | B2 |
10460162 | Gelosi | Oct 2019 | B2 |
10614113 | Peled | Apr 2020 | B2 |
10726198 | Gelosi | Jul 2020 | B2 |
10878195 | Duta | Dec 2020 | B2 |
10885282 | Ilic | Jan 2021 | B2 |
11023675 | Neervannan | Jun 2021 | B1 |
11170759 | Sim | Nov 2021 | B2 |
11205043 | Neervannan | Dec 2021 | B1 |
11205044 | Neervannan | Dec 2021 | B1 |
11256856 | Gelosi | Feb 2022 | B2 |
11475209 | Gelosi | Oct 2022 | B2 |
20040049462 | Wang | Mar 2004 | A1 |
20040162827 | Nakano | Aug 2004 | A1 |
20060156226 | Dejean | Jul 2006 | A1 |
20080114757 | Dejean | May 2008 | A1 |
20110029952 | Harrington | Feb 2011 | A1 |
20110055206 | Martin et al. | Mar 2011 | A1 |
20110075932 | Komaki | Mar 2011 | A1 |
20110145701 | Dejean | Jun 2011 | A1 |
20110197121 | Kletter | Aug 2011 | A1 |
20110216975 | Rother | Sep 2011 | A1 |
20120278321 | Traub | Nov 2012 | A1 |
20120297025 | Zeng | Nov 2012 | A1 |
20130174029 | O'Sullivan et al. | Jul 2013 | A1 |
20130191366 | Jovanovic | Jul 2013 | A1 |
20130311169 | Khan | Nov 2013 | A1 |
20140074455 | Galle et al. | Mar 2014 | A1 |
20140101456 | Meunier et al. | Apr 2014 | A1 |
20140337719 | Xu | Nov 2014 | A1 |
20150067476 | Song | Mar 2015 | A1 |
20150100308 | Bedrax-Weiss et al. | Apr 2015 | A1 |
20150161102 | Gidney | Jun 2015 | A1 |
20150169676 | Bohra | Jun 2015 | A1 |
20160048520 | Levy | Feb 2016 | A1 |
20160224662 | King et al. | Aug 2016 | A1 |
20170011313 | Pochert et al. | Jan 2017 | A1 |
20170017641 | Gidney | Jan 2017 | A1 |
20170052934 | Hatsutori | Feb 2017 | A1 |
20170103466 | Syed | Apr 2017 | A1 |
20170329846 | Dole | Nov 2017 | A1 |
20170351688 | Yasue | Dec 2017 | A1 |
20180039907 | Kraley | Feb 2018 | A1 |
20180096060 | Peled | Apr 2018 | A1 |
20180260378 | Theodore et al. | Sep 2018 | A1 |
20180268506 | Wodetzki et al. | Sep 2018 | A1 |
20180300315 | Leal | Oct 2018 | A1 |
20190114479 | Gelosi | Apr 2019 | A1 |
20190155944 | Mahata et al. | May 2019 | A1 |
20190220503 | Gelosi | Jul 2019 | A1 |
20190272421 | Sugaya | Sep 2019 | A1 |
20190278853 | Chen | Sep 2019 | A1 |
20190340240 | Duta | Nov 2019 | A1 |
20190347284 | Roman et al. | Nov 2019 | A1 |
20200026916 | Wood et al. | Jan 2020 | A1 |
20200043113 | DePalma et al. | Feb 2020 | A1 |
20200097759 | Nadim | Mar 2020 | A1 |
20200104957 | Guo et al. | Apr 2020 | A1 |
20200110800 | Astigarraga et al. | Apr 2020 | A1 |
20200184013 | Ilic | Jun 2020 | A1 |
20200219481 | Sim | Jul 2020 | A1 |
20200226510 | Gupta | Jul 2020 | A1 |
20200311412 | Prebble | Oct 2020 | A1 |
20200327151 | Coquard et al. | Oct 2020 | A1 |
20200349199 | Jayaraman | Nov 2020 | A1 |
20200364291 | Bentabet | Nov 2020 | A1 |
20210117667 | Mehra | Apr 2021 | A1 |
20210150128 | Gelosi | May 2021 | A1 |
20210201013 | Makhija et al. | Jul 2021 | A1 |
20220277140 | Rhim | Sep 2022 | A1 |
Entry |
---|
Cai et al., “VIPS: a Vision-based Page Segmentation Algorithm,” Nov. 1, 2003, pp. 1-29. (Year: 2003). |
Pomikalek, Jan, “Removing Boilerplate and Duplicate Content from Web Corpora.” 108 pages, (2011), https://is.muni.cz/th/45523/fi_d/phdthesis.pdf. |
Number | Date | Country | |
---|---|---|---|
20210319177 A1 | Oct 2021 | US |
Number | Date | Country | |
---|---|---|---|
62975514 | Feb 2020 | US | |
62965516 | Jan 2020 | US | |
62965523 | Jan 2020 | US | |
62965520 | Jan 2020 | US |