Titles and sections of a document aid users in reaching a preliminary understanding of the document's contents. Electronic documents (e.g., OOXML document, PDF document, etc.) include tags that help users identify these titles and sections. However, depending on how the electronic documents are created, not all titles and sections may be identified by tags, and incorrect tagging of titles and sections may occur. Regardless, users still wish to be able to accurately identify the titles and sections of these electronic documents.
In general, in one aspect, the invention relates to a method for processing an electronic document (ED) to infer titles and sections in the ED. The method comprising: applying visual analysis to the ED and identifying candidate titles and candidate sections of the ED; filtering the candidate titles based on the candidate sections; filtering the candidate sections based on the filtered candidate titles; applying semantic analysis to the ED and identifying topics and portions of the ED; refining, based on the identified topics and the portions, the filtered candidate titles and the filtered candidate sections; and generating a marked-up version of the ED that identifies the refined candidate titles and the refined candidate sections.
In general, in one aspect, the invention relates to a non-transitory computer readable medium (CRM) storing computer readable program code for processing an electronic document (ED) to infer titles and sections in a parsed version of the ED embodied therein. The computer readable program code causes a computer to: apply visual analysis to the ED and identify candidate titles and candidate sections of the ED; filter the candidate titles based on the candidate sections; filter the candidate sections based on the filtered candidate titles; apply semantic analysis to the ED and identify topics and portions of the ED; refine, based on the identified topics and the portions, the filtered candidate titles and the filtered candidate sections; and generate a marked-up version of the ED that identifies the refined candidate titles and the refined candidate sections.
In general, in one aspect, the invention relates to a system for processing an electronic document (ED) to infer titles and sections in a parsed version of the ED. The system comprising: a memory; and a processor coupled to the memory. The processor: applies visual analysis to the ED and identifies candidate titles and candidate sections of the ED; filters the candidate titles based on the candidate sections; filters the candidate sections based on the filtered candidate titles; applies semantic analysis to the ED and identifies topics and portions of the ED; refines, based on the identified topics and the portions, the filtered candidate titles and the filtered candidate sections; and generates a marked-up version of the ED that identifies the refined candidate titles and the refined candidate sections.
Other aspects of the invention will be apparent from the following description and the appended claims.
Specific embodiments of the invention will now be described in detail with reference to the accompanying figures. Like elements in the various figures are denoted by like reference numerals for consistency.
In the following detailed description of embodiments of the invention, numerous specific details are set forth in order to provide a more thorough understanding of the invention. However, it will be apparent to one of ordinary skill in the art that the invention may be practiced without these specific details. In other instances, well-known features have not been described in detail to avoid unnecessarily complicating the description.
In general, embodiments of the invention provide a method, a non-transitory computer readable medium (CRM), and a system for processing an electronic document (ED) to infer titles and sections of the ED. Specifically, an ED including one or more pages and at least one section is obtained. The ED may or may not include a title. One or more processes applying a combination of visual and semantic analyses are executed on the ED to obtain content information (e.g., candidate titles, candidate sections, topics, and portions of the ED). With the contents of the ED identified, the titles and sections of the ED can be inferred even if they are not explicitly identified (i.e., labeled and/or tagged).
The buffer (102) may be implemented in hardware (i.e., circuitry), software, or any combination thereof. The buffer (102) is configured to store an electronic document (ED) (104). The ED (104) may include a combination of one or more lines of texts made up of characters and non-text objects (e.g., images, graphics, tables, charts, graphs, etc.). The ED (104) may be obtained (e.g., downloaded, scanned, etc.) from any source. The ED (104) may be a single-paged document or a multi-paged document. Further, the ED (104) may be of any size and in any format (e.g., PDF, OOXML, ODF, HTML, etc.).
The system (100) includes the inference engine (106). The inference engine (106) may be implemented in hardware (i.e., circuitry), software, or any combination thereof. The inference engine (106) parses the ED (104) to extract content, layout, and styling information of the characters in the ED (104) and generates a parsed version of the ED (104) based on the extracted information. The parsed version of the ED (104) may be stored in the buffer (102). Alternatively, the inference engine (106) renders the ED (104) into a bitmap object and stores the rendered bitmap of the ED (104) in the buffer (102).
The inference engine (106) further applies visual analysis to the ED (106) to identify candidate (i.e., potential) titles and sections based on the layout and styling information of the characters in the parsed version or the rendered bitmap of the ED (104). Visual analysis may be applied using any system, program, software, or combination thereof (herein referred to as “visual inferencers”) that are able to accurately recognize candidate titles and sections using the layout and styling information of the characters and/or the rendered bitmap of the ED (104). For example, the visual inferencers may be any one of a Convolution Neural Network, a Recurrent Neural Network, or a combination thereof that is trained (e.g., using artificial intelligence) to recognize the titles and sections of a document.
A candidate title may include any text or combination of texts that identify any one of: a name of the ED (104) as a whole, a section of the ED (104), and/or any non-text objects within the ED (104). Candidate titles may be visually distinct from other texts in the ED (104) (e.g., candidate titles may have larger font sizes, different font styles, different font colors, or a combination thereof). The ED (104) need not necessarily include any candidate titles.
A candidate section may include a piece of the ED (104) with content that is visually distinct from other contents of the ED (104) (e.g., a paragraph or a group of paragraphs, any of the non-text objects, etc.). A candidate section may be a major section that includes two or more minor sections that are nested or presented in a hierarchical manner. The ED (104) must include at least one candidate section (e.g., a candidate section covering an entirety of the ED). Each candidate section of the ED (104) may be associated with a candidate title.
The inference engine (106) further applies semantic analysis to the ED (104) to identify topics and portions based on the content information of the characters in the parsed version or based on the rendered bitmap of the of the ED (104). The semantic analysis may be applied using any system, program, software, or combination thereof (herein referred to as “semantic inferencers”) that are able to accurately recognize the semantics (i.e., meaning and logic) of the texts in the ED (104). For example, the semantic analysis may be applied using one or more Natural Language Processing (NLP) techniques.
In one or more embodiments, a topic of the ED (104) is the subject matter of the entire or one or more parts of the ED (104). The ED (104) must have at least one topic. A topic of the ED (104) may be associated with one or more of the candidate titles and sections.
In one or more embodiments, a portion of the ED (104) is a part (i.e., area) of the ED (104) identified based on differentiating the contents of the ED (104). For example, assume that the ED (104) includes part A with content A and part B with content B. Further assume that content A and content B are different. Part A and part B of the ED (104) would each be identified as a portion of the ED (104). In one or more embodiments, each non-text object in the ED (104) is identified as a portion of the ED (104). Differentiating the contents of the ED (104) may be based on the topics (i.e., different topics are treated as different content). The ED (104) includes at least one portion (i.e., the entirety of the ED (104) is treated as a single portion). A portion may include one or more other portions that are nested or presented in a hierarchical manner within the portion. A portion of the ED (104) may be associated with one or more of the candidate titles and sections (i.e., a portion of the ED (104) may be associated with one or more topics of the ED (104)).
In one or more embodiments, a single visual inferencer may be used to identify the candidate titles and sections in the ED (104). Alternatively, multiple visual inferencers may be used to identify the candidate titles and sections (e.g., one or more visual inferencers for the candidate titles and one or more visual inferencers for the candidate sections). Similarly, a single semantic inferencer may be used to identify the topics and portions in the ED (104). Alternatively, multiple semantic inferencers may be used to identify the topics and portions (e.g., one or more semantic inferencers for the topics and one or more semantic inferencers for the portions).
The system (100) includes the convergence engine (108). The convergence engine (108) may be implemented in hardware (i.e., circuitry), software, or any combination thereof. The convergence engine (108) works in tandem with the inference engine (106) to execute an iterative process of one or more embodiments for inferring the titles and sections of the ED (104) by applying the visual and semantic analysis in a predetermined order. The iterative process of one or more embodiments is described in more detail below with reference to the flowchart shown in
The convergence engine (108) further generates a marked-up version of the ED (104) with the candidate titles and sections identified (i.e., distinguished from the other contents of the ED (104)) for the user using boxes, highlighting, etc.). In one or more embodiments, the results of the identified titles and sections in the marked-up version of the ED (104) may vary based on the type(s) of visual and semantic inferencers applied to the ED (104).
Although the system (100) is shown as having three components (102, 106, 108), in other embodiments of the invention, the system (100) may have more or fewer components. Further, the functionality of each component described above may be split across components. Further still, each component (102, 106, 108) may be utilized multiple times to carry out an iterative operation.
Initially, an ED is obtained (STEP 205). The ED may include a combination of: one or more lines of texts made up of characters, non-text objects, etc.). The ED (104) may be obtained (e.g., downloaded, scanned, etc.) from any source. The ED (104) may be a single-paged document or a multi-paged document. Further, the ED (104) may be of any size and in any format (e.g., PDF, OOXML, ODF, HTML, etc.). The ED includes at least one section, at least one topic, at least one portion, and may not include a title.
In STEP 210A, using the visual inferencers as discussed above in reference to
In STEP 215, the visual inferencers are applied to the ED to filter (i.e., refine) the candidate titles identified in STEP 210A while considering (i.e., based on) the candidate sections identified in STEP 210B. In STEP 220, the visual inferencers are applied to the ED to filter the candidate sections identified in STEP 210B while considering the candidate titles filtered in STEP 215 (i.e., the filtered candidate titles).
In one or more embodiments, the degree of change (i.e., the number of new candidate titles and sections identified, the number of identified candidate titles and sections eliminated, the association between the identified candidate titles and sections, etc.) to the identified candidate titles and sections that may occur in STEPs 215 and 220 depends on the specificity of the analysis performed by the visual inferencers (i.e., depends on the capabilities of the visual inferencers). Use of different types of visual inferencers may produce different results in STEPs 215 and 220. This is exemplified in more detail below in
In STEP 225, using the semantic inferencers as discussed above in reference to
In STEP 230, the candidate titles and sections filtered in STEPs 215 and 220 (i.e., the filtered candidate titles and sections) are re-evaluated and refined, using a combination of the visual and semantic inferencers, based on the topics and portions identified in STEP 225.
In one or more embodiments, the filtered candidate titles and sections are refined based on the topics and portions by providing the visual inferencers with refined inputs based on only parts of the ED. For example, one refined input to the inferences may be based on one of the portions identified in STEP 230 (e.g., visual analysis by the visual inferencers is performed only on that single portion). Employing these refined inputs narrows the focus of the visual inferencers, which causes certain visual features of the ED (i.e., the style and layout information of the ED or certain bits in the rendered bitmaps) to stand out more compared to applying visual analysis on the entire ED.
The focus of the visual inferencers may be narrowed to focus on parts with potential inconsistencies. For example, a potential inconsistency may be identified, with the help of the information identified by the semantic inferencers, between one or more candidate titles and a certain topic associated with the candidate titles (i.e., a candidate title seems less likely to be an actual title of the ED given the topic associated with the candidate title). The focus of the visual inferencers may then be narrowed to that part (i.e., one or more portions or candidate sections) around the potential inconsistency.
The focus of the visual inferencers may also be narrowed to focus on the non-text objects. For example, a non-text object may be associated with a caption (i.e., a title of a non-text object) that describes the non-text object. The caption may also be within a predetermined area of the non-text object in order for users to easily identify and comprehend the non-text object. The focus of the visual inferencers may then be narrowed to focus on this predetermined area in order to look for previously identified candidate titles that may potentially be the caption of the non-text object.
In one or more embodiments, determining the refined inputs may also be based on masking out parts of the ED before further visual analysis is applied. These masked out parts may include candidate titles and sections that prior visual analysis in STEPs 210A to 220 deemed to be unlikely titles of the ED. Parts of the ED that are not masked out are then submitted as the refined inputs for further analysis.
In STEP 235, the topics and portions identified in STEP 230 are re-evaluated and refined, using a combination of the visual and semantic inferencers, based on the filtered candidate titles and sections that were re-evaluated and refined in STEP 230.
In STEP 240, the refined candidate titles and sections from STEPs 230 are further re-evaluated and refined, using a combination of the visual and semantic inferencers, based on the topics and portions that were re-evaluated and refined in STEP 235.
In one or more embodiments, the degree of change to the filtered candidate titles and sections and to the topics and portions that may occur in STEPs 230 to 240 after the re-evaluation and refinement may depend on the specificity of the analysis performed by the visual and semantic inferencers (i.e., depends on the capabilities of the visual and semantic inferencers). Application of different types of visual and semantic inferencers may produce different results. This is discussed in more detail below in the description of
In STEP 245, a determination is made whether a point of convergence has been reached (i.e., a point where further refinement will no longer cause any changes and/or yield any different results). If the determination in STEP 245 is NO, the process returns to STEP 235 where the candidate titles and sections and the topics and portions are further refined based on one another.
If the determination in STEP 245 is YES, a marked-up version of the ED, as discussed above in reference to
Embodiments of the invention may be implemented on virtually any type of computing system, regardless of the platform being used. For example, the computing system may be one or more mobile devices (e.g., laptop computer, smart phone, personal digital assistant, tablet computer, or other mobile device), desktop computers, servers, blades in a server chassis, or any other type of computing device or devices that includes at least the minimum processing power, memory, and input and output device(s) to perform one or more embodiments of the invention. For example, as shown in
Software instructions in the form of computer readable program code to perform embodiments of the invention may be stored, in whole or in part, temporarily or permanently, on a non-transitory computer readable medium such as a CD, DVD, storage device, a diskette, a tape, flash memory, physical memory, or any other computer readable storage medium. Specifically, the software instructions may correspond to computer readable program code that, when executed by a processor(s), is configured to perform embodiments of the invention.
Further, one or more elements of the aforementioned computing system (400) may be located at a remote location and be connected to the other elements over a network (412). Further, one or more embodiments of the invention may be implemented on a distributed system having a plurality of nodes, where each portion of the invention may be located on a different node within the distributed system. In one embodiment of the invention, the node corresponds to a distinct computing device. Alternatively, the node may correspond to a computer processor with associated physical memory. The node may alternatively correspond to a computer processor or micro-core of a computer processor with shared memory and/or resources.
One or more embodiments of the invention may have one or more of the following advantages: the ability to accurately identify the titles and sections of one more electronic documents that do not include tags; the ability to identify any incorrectly tagged titles and sections of electronic documents; the ability to execute the above identification without intervention by a user; etc.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this disclosure, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as disclosed herein. Accordingly, the scope of the invention should be limited only by the attached claims.