The disclosed embodiments relate to image segmentation post-processing and entity extraction of images, in particular images of historical documents or records, such as newspapers.
Many existing genealogical research systems provide repositories of information stored in databases and are configured to allow connected devices to search for information stored within the repositories. As part of building genealogical data repositories and constructing genealogical trees of nodes connecting individuals and/or historical records, some genealogical research systems analyze and process information from many different data sources, including digitized versions of historical documents, including newspapers. Extracting actionable data from digitized historical documents, such as newspaper images, is a technologically complicated task due to several factors, including the arrangement of digitized content, the condition of digitized historical content, and the variation of digitized content in historical records of different times periods. Consequently, many existing systems exhibit a number of deficiencies or drawbacks, particularly regarding accuracy and computational efficiency.
As just suggested, some existing genealogical research systems inaccurately extract information from historical records, particularly newspaper images. To elaborate, many existing systems rely on image analysis techniques and models that are prone to error, especially when processing newspaper images with wide variations in article size, location, style, and condition. For instance, the language style and the article placement of newspapers has evolved over different decades and locations, and conventional models struggle to distinguish between text of distinct articles across the wide variety of newspaper images. Consequently, even if a prior system correctly identifies an article within a newspaper image, the conventional models of existing systems inaccurately (or cannot) distinguish the article from others within the newspaper image, nor can they accurately extract information from the identified article.
In addition to their inaccuracies, existing genealogical research systems can further suffer from computational inefficiencies. Indeed, the models used by many existing systems to process and analyze digitalized historical records, such as newspaper images, consumes excessive amounts of computing resources (e.g., processing power and memory) that could otherwise be preserved with a more efficient system. For example, some existing systems utilize conventional machine learning models or other conventional algorithms that, without more intelligent post-processing techniques, require significant computer resources when processing digitized content of historical records.
This disclosure describes one or more embodiments of systems, methods, and non-transitory computer-readable storage media that provide benefits and/or solve one or more of the foregoing and other problems in the art. In particular, the disclosed systems generate and provide actionable data from newspaper articles identified and segmented from digital newspaper images. For example, the disclosed systems segment articles of a newspaper image by using specially designed models to generate polygons defining article boundaries within the newspaper image. In some cases, the disclosed systems further determine article text from a polygon of an article for additional processing to determine an article topic, determine an article type, predict entity names within the article, and/or predict a location associated with the article.
The detailed description refers to the drawings briefly described below.
This disclosure describes one or more embodiments of a newspaper image system that can identify and segment individual articles within newspaper images for extracting further data, including article topics, entity names, and article locations. In many use cases, user accounts of genealogical content systems use client devices to search genealogical databases for genealogical content items (e.g., digitized newspaper articles, images, census records, obituaries, court documents, military records, immigration records, and other types of digitized historical documents) to identify content items associated with individuals within genealogical trees stored within one or more genealogical tree databases. As part of this process, the newspaper image system can process newspaper images to generate searchable newspaper articles within a database and to generate actionable data (e.g., article topics, entity names, and article localities) to use as a basis for searching through segmented newspaper articles.
As just mentioned, the newspaper image system can analyze newspaper images (or other images of digitized historical records) to determine or identify individual articles. For example, the newspaper image system utilizes an article prediction model to analyze the pixels of a newspaper image to generate polygons defining boundaries of individual articles. In some cases, the newspaper image system generates regular or irregular polygons, depending on the size and shape of the detected article in the newspaper image. Additionally, the newspaper image system can generate multiple polygons for a single article that spans (or is separated across) multiple newspaper image locations (e.g., in one or more columns). To generate the polygons, the newspaper image system can use one or more image correction techniques (each of which may require its own respective model) to detect article columns, fix image skew/rotation issues, resolve overlapping article predictions, and remove outlier columns.
From a detected newspaper article, the newspaper image system can further generate article text using one or more optical character recognition models/algorithms. Using the article text, the newspaper image system can perform additional processes, including topic prediction, entity extraction, and locality prediction. For example, the newspaper image system can utilize a topic prediction model to generate topic predictions for detected articles based on recognized text and/or other features. In some cases, the newspaper image system classifies articles into topics based (solely) on visual features of pixels within a corresponding polygon.
In one or more embodiments, the newspaper image system can also extract entities from detected articles. In particular, the newspaper image system can analyze recognized text from a polygon of a newspaper image to determine entity names and name parts (e.g., title, given name, surname) within the article polygon. In some cases, the newspaper image system can further determine or predict relationships between detected entities based on article text, such as familial relationships, residence relationships (e.g., between a person entity name and a location entity name), occupation relationships (e.g., between a person entity name and a business/government entity name), location relationships (e.g., between a business/government entity name and a location entity name), and/or other relationships.
As mentioned, in some embodiments, the newspaper image system generates locality predictions for detected articles. For example, the newspaper image system can analyze or process recognized text to determine a locality prediction (e.g., a locality classification) for an article of a polygon. In some cases, the newspaper image system can classify an article as a local article or a non-local article. In these or other cases, the newspaper image system can classify an article as a local article, a national article, an international article, an article specific to a particular city/state/region, or an unknown article. To determine a locality prediction for an article of a polygon, the newspaper image system can utilize a specialized article locality model fine-tuned to predict localities associated with article text.
As suggested above, the newspaper image system can provide improvements or advantages over existing historical content systems. For example, the newspaper image system can improve accuracy over prior systems in identifying (and extracting information from) articles within digitized historical records, especially newspaper articles. Indeed, while some prior systems inaccurately identify digitized content within low-quality historical records that vary over different locations and time periods, the newspaper image system utilizes a specialized article prediction model (robust to changing styles and qualities) to generate polygons defining article boundaries based on various techniques, such as detecting column width, determining column location, correcting image rotation, using and removing outliers. For instance, by using the specialized processes described herein to detect and process newspaper images, the newspaper image system identifies newspaper articles and extracts data from newspaper articles more accurately than prior systems.
Relating to accuracy improvements, in one or more embodiments, the newspaper image system improves or refines image segmentation outputs (e.g., polygon coordinates and corresponding confidence scores) by applying one or more post-processing techniques. For instance, image segmentation in complex and hard-to-predict newspaper images is particularly challenging because the layout and content varies significantly from issue to issue and from publication to publication, making the task of automatedly segmenting or discretizing articles within a page for downstream processing, such as optical character recognition (“OCR”), natural language processing (“NLP”), entity extraction, entity resolution, and/or others, exceedingly challenging. Accordingly, the newspaper image system 102 utilizes an article prediction model together with other models and techniques described herein to accurately predict newspaper columns, account for image rotations, and generate accurate polygons.
In some embodiments, the newspaper image system further improves computational efficiency over prior systems. For example, as opposed to prior systems that process newspaper images (or other digitized historical records) using brute force pixel analysis, the newspaper image system uses sophisticated models to more efficiently detect newspaper articles defined by polygons. Specifically, in some cases, the newspaper image system uses an article prediction model adapted from the building architecture domain to detect columns within a newspaper image (as opposed to edges of buildings in a city) to inform the process of article segmentation. Using the described models to segment newspaper articles, the newspaper image system consumes fewer computing resources than prior systems. To this point, researchers have demonstrated efficiency gains of 100× or more over the computational requirements of prior systems when testing the newspaper image system.
As illustrated by the foregoing discussion, the present disclosure utilizes a variety of terms to describe features and benefits of the newspaper image system. Additional detail is hereafter provided regarding the meaning of these terms as used in this disclosure. As used herein, the term “newspaper image” refers to a digital image depicting digitized newspaper content. For example, a newspaper image includes a high-resolution (e.g., 5000×7000 pixels) resolution image captured from a historical newspaper (e.g., from the 1700s or 1800s) and whose pixels depict digitized newspaper content from across the various pages of the newspaper in a single image. In some cases, a newspaper image has a columnar arrangement of newspaper content, where articles and advertisements are included in one or more columns of the image.
Relatedly, the term “article” refers to a discrete content body within a newspaper image, distinct from other bodies of content within the same image. For example, an article can have a particular topic or subject matter that separates it from other articles. In addition, an article can be spread across one or more columns within a newspaper image, reflecting how articles in printed newspapers are often continued on subsequent pages and sections. In certain cases, the newspaper image system determines or classifies an article type for an article based on visual features of the article (as opposed to text features). Example article types include: i) page number, ii) miscellaneous, iii) photo, iv) graphic/illustration, v) cartoon, vi) caption, vii) masthead, viii) advertisement, ix) crossword puzzle, x) title, xi) subtitle, and xii) reference text.
In some embodiments, an article includes or describes a particular topic. Example article topics include: i) arts and culture, ii) conflict and war, iii) economy and business, iv) education, v) environment, vi) health, vii) human interest, viii) labor, ix) politics, x) religion, xi) science and technology, xii) society (social issues), xiii) sports, xiv) weather, xv) birth, xvi) military, xvii) bad OCR, xviii) advertisement, xix) not an article, xx) club and association, xxi) recipes, xxii) horoscope, xxiii) miscellaneous lifestyle, xxiv) crime, xxv) law and justice, xxvi) disaster, xxvii) accident and emergency response, and xxviii) information wanted advertisement.
As mentioned, in some embodiments, the newspaper image system uses one or more models, including machine learning models, to perform various processes described herein. As used herein, the term “machine learning model” refers to a computer algorithm or a collection of computer algorithms that automatically improve for a particular task through iterative outputs or predictions based on use of data. For example, a machine learning model can utilize one or more learning techniques to improve in accuracy and/or effectiveness. Example machine learning models include various types of neural networks, decision trees, support vector machines, linear regression models, and Bayesian networks. In some embodiments, the morphing interface system utilizes a large language machine learning model in the form of a neural network.
Relatedly, the term “neural network” refers to a machine learning model that can be trained and/or tuned based on inputs to determine classifications, scores, or approximate unknown functions. For example, a neural network includes a model of interconnected artificial neurons (e.g., organized in layers) that communicate and learn to approximate complex functions and generate outputs (e.g., predicted articles, article topics, article localities, and/or entity names) based on a plurality of inputs provided to the neural network. In some cases, a neural network refers to an algorithm (or set of algorithms) that implements deep learning techniques to model high-level abstractions in data. A neural network can include various layers such as an input layer, one or more hidden layers, and an output layer that each perform tasks for processing data. For example, a neural network can include a deep neural network, a convolutional neural network, a recurrent neural network (e.g., an LSTM), a graph neural network, a transformer neural network, or a generative adversarial neural network. Upon training, such a neural network may become a large language model.
Along these lines, as used herein, the term “article prediction model” refers to a model (e.g., a machine learning model or a combination of machine learning models) that predicts, detects, identifies, or determines articles within a newspaper image. For example, an article prediction model detects columns within a newspaper image and further segments individual articles within the newspaper columns by using polygons to define boundaries between articles. In some cases, an article prediction model includes a repurposed architectural model originally designed for detecting lines or boundaries between (or within) architectural buildings, where the model is adapted to the domain of newspaper images. In some embodiments, an article prediction model refers to a segmentation model as described by Masaki Stanley Fujimoto et al. in Systems and Methods for Identifying and Segmenting Objects from Images, U.S. Patent Application Publication No. 2021/0390704, published Dec. 16, 2021, which is hereby incorporated by reference in its entirety. An article prediction model can generate image segmentation output in the form of polygon coordinates and corresponding confidence scores for one or more polygons.
In addition, as used herein, the term “information extraction model” refers to a machine learning model that predicts or identifies entity names from an article in a newspaper image. For example, an information extraction model extracts text embeddings (e.g., latent vector representations of digital text within a polygon defining a newspaper article) from article text and predicts entity names (and name parts—e.g., title, given name, and surname) from the text embeddings. In some cases, an information extraction model takes the form of a generative large language model (e.g., ChatGPT or GPT4) for extracting entity names from unstructured text (e.g., advertisements or other articles that do not have a paragraph form of text bodies).
Similarly, as used herein, the term “article locality model” refers to a model (e.g., a machine learning model such as a neural network) that determines or predicts article locations or localities (e.g., local or not local) for newspaper articles. For example, an article locality model extracts latent vectors from article text and generates predictions for localities of the article based on the latent vectors. In some cases, the article locality model generates a binary prediction (e.g., local or non-local) while in other cases the article locality model classifies an article into one of a plurality of locality classifications, such as: i) local, ii) national, iii) international, iv) or unknown (and/or other classes, such as state/region/city-specific classes.
Additional detail regarding the newspaper image system will now be provided with reference to the figures. For example,
As shown, the environment includes server(s) 104, a client device 108, a database 114, and a network 112. Each of the components of the environment can communicate via the network 112, and the network 112 may be any suitable network over which computing devices can communicate. Example networks are discussed in more detail below in relation to
As mentioned above, the example environment includes a client device 108. The client device 108 can be one of a variety of computing devices, including a smartphone, a tablet, a smart television, a desktop computer, a laptop computer, a virtual reality device, an augmented reality device, or another computing device as described in relation to
As shown, the client device 108 can include a client application 110. In particular, the client application 110 may be a web application, a native application installed on the client device 108 (e.g., a mobile application, a desktop application, etc.), or a cloud-based application where all or part of the functionality is performed by the server(s) 104. Based on instructions from the client application 110, the client device 108 can present or display information, including a user interface such as a newspaper analysis interface, a genealogy tree interface, a discover interface for additional genealogical content, or some other graphical user interface, as described herein.
As illustrated in
As shown in
As further illustrated in
Although
In some implementations, though not illustrated in
As mentioned above, the newspaper image system 102 can generate or detect articles within newspaper images and can further extract actionable data from the newspaper articles using special-purpose models designed to process newspaper image data. In particular, the newspaper image system 102 can utilize an article prediction model to generate polygons defining article boundaries within a newspaper image and can further use other models to extract other data from the polygons.
As illustrated in
As further illustrated in
As also shown in
As further illustrated in
For example, as shown, the newspaper image system 102 can extract or identify entity names 210. To elaborate, the newspaper image system 102 utilizes a domain-adapted information extraction model to extract text embeddings from the article text 208. From the text embeddings, the newspaper image system 102 further uses the information extraction model to predict entity names (and corresponding name parts) that appear within the article text 208. For example, the newspaper image system 102 extracts names of people, places, businesses, government bodies, groups, or other organizations. The newspaper image system 102 can further use the information extraction model to predict name parts for extracted names to indicate titles, given names, surnames, and/or entity types (e.g., person, place, business, government body, or organization) associated with the names. In some cases, the newspaper image system 102 adapts the information extraction model from a free text domain specifically to the newspaper text domain by fine tuning model parameters to account for phrasing and writing styles of newspapers in different eras (e.g., 1700s, 1800s, 1900s, and 2000s) and/or from different countries or regions.
As further illustrated in
As also shown, the newspaper image system 102 can determine a locality prediction 214 from the article text 208. In particular, the newspaper image system 102 can utilize an article locality model to predict a locality associated with the detected article 204. For instance, the newspaper image system 102 utilizes the article locality model to analyze the article text 208 from the detected article 204. From the article text 208, the article locality model generates a prediction whether the detected article 204 is local (e.g., originating from, or concerning, a newspaper within a particular geographic region or municipality) or non-local (e.g., originating from, or concerning, a newspaper outside of a particular geographic region or municipality). In some cases, the article locality model generates predictions for locality classifications beyond local and non-local, including national, international, unknown, specific to a particular state, specific to a particular city, or specific to a particular county (or some other region).
As further shown in
As mentioned above, in certain described embodiments, the newspaper image system 102 generates polygons to enclose detected articles in newspaper images. In particular, the newspaper image system 102 utilizes an article prediction model to identify distinct articles in a newspaper image and to generate polygons defining boundaries for the articles.
As illustrated in
As further illustrated in
In some circumstances, as shown in the box 310, predicted potential polygons for adjacent articles may overlap. If the newspaper image system 102 determines that some portion of the newspaper image 302 is overlapped by two potential polygons, the newspaper image system 102 can resolve the overlap and by subtracting the overlap from the less probable of the two potential polygons. To elaborate, given N polygon predictions with associated probabilities, the newspaper image system 102 ranks the polygons from most probable to least probable (e.g., according to respective confidence scores). In some cases, the newspaper image system 102 compares the most probable (e.g., highest ranked) polygon against the next most probable polygon and subtracts any overlapping portion from the less probable of the two polygons. The newspaper image system 102 thus compares the most probable polygon against additional polygons of decreasing confidence scores to determine whether there is any overlap, subtracting any detected overlap subtracted from less probable polygons. Accordingly, the newspaper image system 102 generates a set of non-overlapping polygons. As shown in the box 310, the newspaper image system 102 subtracts the overlapping portion from the polygon with the 0.6 probability to generate the polygon 308.
As mentioned, in certain embodiments, the newspaper image system 102 accounts for rotations in newspaper images when predicting polygons for articles. In particular, the newspaper image system 102 avoids or prevents conflation or overlap of different article content that might result from inaccurate column predictions in a newspaper image with a rotation or a slant.
As illustrated in
To this point, as further illustrated in
As noted above, in certain described embodiments, the newspaper image system 102 generates polygons to designate or define article boundaries within a newspaper image. In particular, the newspaper image system 102 generates polygons and resolves polygon shapes by correcting overlaps to ensure that single polygons correspond to single articles.
As illustrated in
In some embodiments, the newspaper image system 102 ranks or sorts polygons by ordering them according to their respective sizes, from smallest to larges (or vice-versa). For instance, the newspaper image system 102 ranks the polygons within the newspaper image 502 according to size by identifying each ranked polygon by the coordinates of its top-left vertex (or some other vertex). To elaborate, the newspaper image system 102 sorts polygons of the newspaper image 502 from smallest to largest by listing top-left coordinate values in order, from left to right across the newspaper image 502. Moving from left to right in the newspaper image 502, the newspaper image system 102 identifies all polygons with centroids (or centers of mass) between a left gridline (or column boundary) and a right gridline (or column boundary).
Based on identifying a polygon between gridlines (or within columns), the newspaper image system 102 snaps x-coordinate values of the identified polygon vertices to an x-coordinate value of the nearest gridline. As described in further detail below, by detecting or determining a suitable gridline or column corresponding to a polygon, the newspaper image system 102 can alight widths of vertically adjacent polygons to better capture the digitized content of their respective articles. Indeed, the newspaper image system 102 aligns or snaps polygons by adjusting x-coordinate values of one or more vertices to align with a detected gridline or column.
In some embodiments, the newspaper image system 102 identifies or detects a gap between two vertically adjacent polygons with near-identical widths (e.g., for two articles, with one article on top and one article on bottom in the same column), where the gap falls between polygons (and therefore includes pixels not enclosed by any polygon). In some cases, gaps between polygons can result in losing content within downstream processes, erroneously splitting articles, incorrectly transcribing articles, and/or incorrectly categorizing articles.
Accordingly, the newspaper image system 102 can rectify or resolve gaps by determining an order (e.g., a top-to-bottom order) of vertically adjacent polygons, identifying the gaps, and extending a top edge of a bottom polygon (e.g., a polygon below a detected gap) until the gap is eliminated (or the y-coordinate value of the top edge of the bottom polygon is within a threshold distance, or number of vertical pixels, from a y-coordinate value of the bottom edge of the top polygon). Similarly, the newspaper image system 102 can resolve gaps between horizontally adjacent polygons as well. For instance, the newspaper image system 102 can adjust an x-coordinate value of the left edge of a polygon to the right of a gap to close the gap and abut the right edge of the left polygon. The newspaper image system 102 can also or alternatively adjust the right edge of the left polygon to close the gap.
As further illustrated in
Based on determining or detecting that the polygon 504 and the polygon 506 overlap, the newspaper image system 102 further compares the respective confidence scores of the polygon 504 and the polygon 506. Particularly, the newspaper image system 102 determines that the polygon 504 has a lower confidence score than the polygon 506. Consequently, the newspaper image system 102 subtracts or removes the overlapping portion from the polygon 504. By removing overlapping portions according to confidence scores, the newspaper image system 102 more accurately generates polygons enclosing pixels of individual articles, without overlapping content of other articles. Indeed, as shown in
In some embodiments, the newspaper image system 102 utilizes raytracing concepts to inform the overlap correction process. To elaborate, the newspaper image system 102 can identify words detected by an optical character recognition model and can determine whether the words belong in a particular polygon or article. More specifically, the newspaper image system 102 can identify pixels included as part of, or depicting at least a portion of, a detected word. The newspaper image system 102 can further determine whether the word pixel is colliding with or located within a particular polygon. Accordingly, the newspaper image system 102 tests pixel locations for extracted words of articles to verify polygon locations and/or to correct polygon shapes and sizes to ensure that the correct words are included in the correct article-specific polygons. To this point, experimenters have demonstrated that this raytracing approach (using in conjunction with the polygon generation techniques described herein) can reduce computational resource consumption by around 100× compared to previous processes of prior systems.
As mentioned above, in certain embodiments, the newspaper image system 102 determines or generates columns for newspaper images. In particular, the newspaper image system 102 generates columns (as a basis for generating polygons) by predicting coordinates for gridlines within a newspaper image.
As illustrated in
In addition, the newspaper image system 102 compares pixel luminosities at different x-coordinate pixel values (e.g., averaged across each column of y-coordinate pixel values or for each x-coordinate pixel value along each row of y-coordinate pixel values) to determine frequencies (or overall numbers) of occurrences of various luminosity values (e.g., from left to right across pixel columns). To this point, in some embodiments, the newspaper image system 102 converts the luminosity values to the frequency spectrum. In some cases, the newspaper image system 102 can further model the luminosity values (and/or the frequencies of luminosity values) and can determine a dominant period (e.g., a most frequently occurring period) to use as a basis for determining column width.
Indeed, as shown in the graph 604, the newspaper image system 102 determines a dominant period of luminosity values for the newspaper image 602, where the dominant period has over occurrences of the same (or with a threshold difference of) distance between luminosity values (or changes/toggles of luminosity value). Thus, as indicated by the graph 604, the newspaper image system 102 determines a max peak value of occurrences at 715 pixels, thereby denoting the column width for the newspaper image 602 at 715 pixels. Indeed, because the luminosity values between the many lines of text change much more (and with more regularity) than luminosity values at other portions of the newspaper image 602, the newspaper image system 102 can thus extrapolate the column width as described.
As mentioned above, in certain described embodiments, the newspaper image system 102 determines coordinate locations for column lines or gridlines between text columns. In particular, the newspaper image system 102 utilizes one or more models (e.g., an article prediction model or a column prediction model) to predict gridline placement within a newspaper image as a basis for generating article polygons.
As illustrated in
As shown, the article prediction model 702 has a particular architecture of constituent layers and components. Specifically, the article prediction model 702 has a multi-layered encoder-decoder architecture that includes a coarse encoder, a coarse decoder, a fine encoder, and a fine decoder. More particularly the newspaper image system 102 includes the illustrated arrangement of layers and components, but with adjustments to internal parameters such as weights and biases to tune the layers and components for generating predicted line segments in the newspaper image domain. In some embodiments, the article prediction model 702 is adapted from the Line Segment Detection Using Transformers without Edges (“LETR”) model described in https://github.com/mlpc-ucsd/LETR.
Using the article prediction model 702, the newspaper image system 102 can generate predictions of (x-coordinate) locations for gridlines delineating columns of a newspaper image. As illustrated in
As just mentioned, in certain described embodiments, the newspaper image system 102 removes predicted line segments when determining gridline placement for column boundaries. In particular, the newspaper image system 102 performs a cleanup process to remove line segment outliers for more accurate column generation.
As illustrated in
More specifically, the newspaper image system 102 utilizes the gridline correction model to cluster predicted line segments based on placement within the newspaper image 802. For example, the newspaper image system 102 generates predicted line segments at particular (x-coordinate) locations within the newspaper image 802, and the newspaper image system 102 further utilizes the gridline correction model to cluster predicted line segments. In some cases, the newspaper image system 102 clusters predicted line segments by, for example, clustering line segments within particular regions (e.g., a particular range of x-coordinate pixel values) or within a threshold distance of one another in common clusters.
Using the gridline correction model, the newspaper image system 102 can further consolidate predicted line segments. To elaborate, the newspaper image system 102 can consolidate a plurality of predicted line segments within a single cluster into a single line segment (or gridline) that represents the cluster as a whole. For instance, the newspaper image system 102 determines a central line segment or an average coordinate location for a line segment within a cluster and designates the central/average location as a placement for a gridline.
Additionally, the newspaper image system 102 can utilize the gridline correction model to remove outlier predictions. In particular, the newspaper image system 102 can detect predicted line segments that fall outside any clusters and/or that are farther than a threshold distance from a cluster. In some cases, the newspaper image system 102 detects outliers as line segments at coordinate locations within predicted (or potential) polygons for articles of the newspaper image 802. As shown, the newspaper image system 102 identifies a line segment 806 and a line segment 808 as outliers. The newspaper image system 102 can further remove such outlier line segments to improve accuracy of gridline/column generation.
In one or more embodiments, the newspaper image system 102 can utilize a segment anything model (SAM), such as the SAM model developed by META. In certain cases, the newspaper image system 102 modifies a SAM model and/or a newspaper image for compatibility with a SAM model. For instance, the newspaper image system 102 generates zoomed-in versions of a newspaper image by subdividing a newspaper image into a number of sub-images at zoomed-in scales, each corresponding to a different portion of the overall newspaper image. The newspaper image system 102 can thus utilize a SAM model to analyze a zoomed-in newspaper sub-image to segment images, font characters, and other depicted objects. From the segmentation of the SAM model, the newspaper image system 102 can determine boundaries for newspaper articles and can generate polygons defining the boundaries.
As mentioned above, in certain described embodiments, the newspaper image system 102 generates predictions for article types and article topics. In particular, the newspaper image system 102 determines or classifies an article type based on visual features of pixels within a polygon. In addition, the newspaper image system 102 determines or classifies an article topic based on text features extracted from digitized content within a polygon.
As illustrated in
As further illustrated in
In some cases, the newspaper image system 102 utilizes an optical character recognition model to generate searchable, actionable digital text within a polygon, and the newspaper image system 102 further utilizes an article topic model to extract text-based features from the text. From the text-based features, the newspaper image system 102 further generates an article topic indicating a subject matter described by the article/text within the polygon. In some embodiments, the newspaper image system 102 utilizes an online learning structure where the article topic model is updatable to learn and modify parameters according to feedback from client devices (e.g., a feedback loop that informs the model on good/bad predictions and/or indicates correct topics).
As mentioned above, in certain described embodiments, the newspaper image system 102 extracts entity names from an article. In particular, the newspaper image system 102 extracts entity names (and their name parts) from text within a polygon generated for a newspaper image.
As illustrated in
As shown, the newspaper image system 102 identifies entities indicated by the bounding box 1004 and the bounding box 1006. Specifically, the newspaper image system 102 determines the entity names 1008 using the information extraction model to extract and process text features, where the entity names include “Mrs. John F. Lyons” and “Mrs. Frank S. Naugle.” In addition to determining the entity names 1008, the newspaper image system 102 can also determine name parts associated with the entity names 1008 using the information extraction model. To elaborate, the newspaper image system 102 can determine a title, a given name, and a surname for each entity. Specifically, the newspaper image system 102 can use the information extraction model to predict a name part classification for each extracted entity name. In some cases, the name parts include designations of cities, states, regions, businesses, government bodies, or other entity types.
As illustrated in
In some embodiments, the newspaper image system 102 can also or alternatively access one or more public databases to determine relationships with entities in the public databases. For example, the newspaper image system 102 accesses a WIKIDATA repository to identify famous people or well-documented people with large amounts of digital content including data about them. The newspaper image system 102 can further compare the entity names and name parts with the data for the public individuals to determine whether the article 1010 mentions a famous person or some person in the public database. Indeed, the newspaper image system 102 can determine relationships or links between extracted entity names and publicly known individuals (e.g., based on names, times, and locations).
In certain embodiments, the newspaper image system 102 can link entity names to particular events. For example, the newspaper image system 102 can determine or extract contextual data from the article 1010 indicating a particular time period and/or geographic location pertaining to the article 1010. Based on the contextual data, the newspaper image system 102 can determine relationships between extracted entity names and historical events, such as World War II (or some other event). Additionally, if the article 1010 includes mentions of keywords that indicate particular events, the newspaper image system 102 determines a relationship between extracted entity names and the corresponding event. In some cases, the newspaper image system 102 determines the article topic and further extracts entity names to inform the prediction of a relationship between the article 1010 and a historical event.
As mentioned above, in certain embodiments, the newspaper image system 102 utilizes an information extraction model (e.g., an entity extraction model) to identify and extract entity names from an article. In particular, the newspaper image system 102 utilizes a domain-adapted information extraction model as an entity extraction model to extract entity names and name parts from text within a polygon of a newspaper image.
In some embodiments, the information extraction model includes multiple constituent models that make up its structure. Particularly, the information extraction model includes two bidirectional long short-term memory networks that use conditional random fields (e.g., two BiLSTM-CRF based named entity recognition models). More specifically, the information extraction model includes, as a first model, a coarse-grained entity extraction model that takes an input of article tokens (e.g., extracted from text within a polygon) and generates an output of a prediction whether each token corresponds to the name of a person. In addition, the information extraction model includes, as a second model, a name parsing model that takes an input of article tokens and the prediction(s) from the first model to generate an output of name parts for each token (e.g., given name, surname, title, or suffix).
As noted, the information extraction model further includes a name parsing model. Particularly, the name parsing model identifies name parts for entity names predicted by the coarse-grained entity extraction model. In some cases, the architecture for the name parsing model is similar to that of the coarse-grained entity extraction model, as illustrated in
In some embodiments, the newspaper image system 102 combines the coarse-grained entity extraction model and the name parsing model to form an information extraction model. Specifically, the information extraction model generates a JavaScript object notation (“JSON”) file that includes a list of entities for each article polygon. Each item is a JSON object that indicates the name parts present for each entity name.
In some embodiments, the newspaper image system 102 utilizes a tokenizer to generate article tokens. Specifically, the tokenizer generates article tokens from article text extracted using an optical character recognition model. In some cases, a tokenizer may include NLTK's punkt tokenizer, Stanford NLP toolkit's Java-based tokenizer, a transformer tokenizer, or any other suitable tokenizer model. In certain embodiments, the newspaper image system 102 utilizes a special token approach for article locality. For instance, the newspaper image system 102 utilizes a tokenizer to generate a special token (e.g., a classification token and/or a separator token) representing a locality and/or a particular portion of an article indicating locality.
To train the information extraction model (including the coarse-grained entity extraction model and the name parsing model), the newspaper image system 102 accesses training images, such as newspaper images or images of articles extracted from newspaper images. For instance, the newspaper image system 102 randomly selects article text from approximately 10,000 newspaper images to ensure sufficient topical diversity. In some cases, the newspaper image system 102 selects approximately 150,000 articles from the 10,000 pages. The selected articles may be assessed using a suitable machine learning model, such as Universal Sentence Encoders, Sentence Transformers, Doc2Vec, combinations and/or modifications thereof, or any other suitable process, to assess topical diversity and then extract random samples of articles pertaining to topics, such as those topics most likely to contain entities which may be military draft articles, politics, crime reports, obituaries and marriage announcements, social events and other lifestyle-related articles, or other articles. In some embodiments, the newspaper image system 102 utilizes the information extraction model to also extract attributes associated with entity names, including gender, age, relationship status, and other attributes.
As mentioned above, in certain described embodiments, the newspaper image system 102 utilizes an article locality model to predict a locality for an article designated by a polygon. In particular, the newspaper image system 102 utilizes an article locality model having a particular architecture to classify the text of an article polygon as local or non-local (or in a different location-based classification).
As illustrated in
To elaborate, the article locality model generates location embeddings in the form of latent vector representations that encode location information pertaining to the article text 1202 (e.g., based on words and phrasing used in the article text 1202 that may signify, inform, or indicate particular locations). In some cases, the article locality model uses an XLNet tokenizer and an XLNet layer in place of the BERT tokenizer 1204 and the BERT layer 1208, respectively. The article locality model can also or alternatively utilize a LONGFORMER model as described by Iz Beltagy, Matthew E. Peters, and Arman Cohan in Longformer: The Long-Document Transformer, arXiv:2004.05150 (2020). Additionally, the article locality model includes a linear classifier 1214 that generates a locality prediction 1216. Specifically, the linear classifier 1214 processes the location embeddings 1212 to generate probabilities that the article text 1202 belongs within various locality classifications, such as local, non-local, national, international, or unknown. In some cases, the linear classier 1214 can classify additional categories as well, including categories for individual cities, states, counties, regions, or other geographic areas.
The components of the newspaper image system 102 can include software, hardware, or both. For example, the components of the newspaper image system 102 can include one or more instructions stored on a computer-readable storage medium and executable by processors of one or more computing devices. When executed by one or more processors, the computer-executable instructions of the newspaper image system 102 can cause a computing device to perform the methods described herein. Alternatively, the components of the newspaper image system 102 can comprise hardware, such as a special purpose processing device to perform a certain function or group of functions. Additionally or alternatively, the components of the newspaper image system 102 can include a combination of computer-executable instructions and hardware.
Furthermore, the components of the newspaper image system 102 performing the functions described herein may, for example, be implemented as part of a stand-alone application, as a module of an application, as a plug-in for applications including content management applications, as a library function or functions that may be called by other applications, and/or as a cloud-computing model. Thus, the components of the newspaper image system 102 may be implemented as part of a stand-alone application on a personal computing device or a mobile device.
While
As illustrated in
The series of acts 1300 can include an act of extracting text features from the text within the polygon utilizing a domain-adapted information extraction model (e.g., an information extraction model adapted to a newspaper domain from a free text domain). In addition, the series of acts 1300 can include an act of predicting, from the text features, entity names (and name parts for the entity names) within the text of the polygon. Additionally, the series of acts 1300 can include an act of generating, using an article locality model, a locality prediction for the article based on the text within the polygon. The locality prediction indicates whether the article is a local article or a non-local article.
The series of acts 1300 can include detecting the article within the newspaper image by: determining text columns within the newspaper image using a column prediction model; and aligning the article within one or more of the text columns. Additionally, the series of acts 1300 can include an act of generating the polygon by generating an irregularly shaped polygon enclosing portions of multiple text columns within the newspaper image based on the article spanning the multiple text columns. The series of acts 1300 can include an act of extracting visual features from pixels of the newspaper image enclosed by the polygon and an act of classifying the article into an article topic based on the visual features.
In some embodiments, the series of acts 1300 includes an act of generating the polygon defining the boundaries of the article by utilizing a column prediction model repurposed from an architectural building detection model. The series of acts 1300 can include an act of generating, utilizing a column prediction model, a plurality of column predictions for the newspaper image. Additionally, the series of acts 1300 can include an act of determining, from the plurality of column predictions, text columns within the newspaper image according to densities of column predictions at respective locations. Further, the series of acts 1300 can include an act of aligning the article within one or more of the text columns.
In one or more embodiments, the series of acts 1300 includes an act of generating the polygon by generating a rectangular polygon enclosing one or more text columns within the newspaper image including text of the article. In some cases, the series of acts 1300 includes acts of extracting visual features from pixels of the newspaper image enclosed by the polygon and classifying the article into an article type based on the visual features.
Embodiments of the present disclosure may comprise or utilize a special purpose or general-purpose computer including computer hardware, such as, for example, one or more processors and system memory, as discussed in greater detail below. Implementations within the scope of the present disclosure also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures. In particular, one or more of the processes described herein may be implemented at least in part as instructions embodied in a non-transitory computer-readable medium and executable by one or more computing devices (e.g., any of the media content access devices described herein). In general, a processor (e.g., a microprocessor) receives instructions, from a non-transitory computer-readable medium, (e.g., a memory, etc.), and executes those instructions, thereby performing one or more processes, including one or more of the processes described herein.
Computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer system. Computer-readable media that store computer-executable instructions are non-transitory computer-readable storage media (devices). Computer-readable media that carry computer-executable instructions are transmission media. Thus, by way of example, and not limitation, implementations of the disclosure can comprise at least two distinctly different kinds of computer-readable media: non-transitory computer-readable storage media (devices) and transmission media.
Non-transitory computer-readable storage media (devices) includes RAM, ROM, EEPROM, CD-ROM, solid state drives (“SSDs”) (e.g., based on RAM), Flash memory, phase-change memory (“PCM”), other types of memory, other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer.
A “network” is defined as one or more data links that enable the transport of electronic data between computer systems and/or modules and/or other electronic devices. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer, the computer properly views the connection as a transmission medium. Transmissions media can include a network and/or data links which can be used to carry desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. Combinations of the above should also be included within the scope of computer-readable media.
Further, upon reaching various computer system components, program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission media to non-transitory computer-readable storage media (devices) (or vice versa). For example, computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a “NIC”), and then eventually transferred to computer system RAM and/or to less volatile computer storage media (devices) at a computer system. Thus, it should be understood that non-transitory computer-readable storage media (devices) can be included in computer system components that also (or even primarily) utilize transmission media.
Computer-executable instructions comprise, for example, instructions and data which, when executed by a processor, cause a general-purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. In some implementations, computer-executable instructions are executed on a general-purpose computer to turn the general-purpose computer into a special purpose computer implementing elements of the disclosure. The computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the described features or acts described above. Rather, the described features and acts are disclosed as example forms of implementing the claims.
Those skilled in the art will appreciate that the disclosure may be practiced in network computing environments with many types of computer system configurations, including, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like. The disclosure may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks. In a distributed system environment, program modules may be located in both local and remote memory storage devices.
Implementations of the present disclosure can also be implemented in cloud computing environments. In this description, “cloud computing” is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources. For example, cloud computing can be employed in the marketplace to offer ubiquitous and convenient on-demand access to the shared pool of configurable computing resources. The shared pool of configurable computing resources can be rapidly provisioned via virtualization and released with low management effort or service provider interaction, and then scaled accordingly.
A cloud-computing model can be composed of various characteristics such as, for example, on-demand self-service, broad network access, resource pooling, rapid elasticity, measured service, and so forth. A cloud-computing model can also expose various service models, such as, for example, Software as a Service (“SaaS”), Platform as a Service (“PaaS”), and Infrastructure as a Service (“IaaS”). A cloud-computing model can also be deployed using different deployment models such as private cloud, community cloud, public cloud, hybrid cloud, and so forth. In this description and in the claims, a “cloud-computing environment” is an environment in which cloud computing is employed.
In particular implementations, processor 1402 includes hardware for executing instructions, such as those making up a computer program. As an example and not by way of limitation, to execute instructions, processor 1402 may retrieve (or fetch) the instructions from an internal register, an internal cache, memory 1404, or storage device 1406 and decode and execute them. In particular implementations, processor 1402 may include one or more internal caches for data, instructions, or addresses. As an example and not by way of limitation, processor 1402 may include one or more instruction caches, one or more data caches, and one or more translation lookaside buffers (TLBs). Instructions in the instruction caches may be copies of instructions in memory 1404 or storage device 1406.
Memory 1404 may be used for storing data, metadata, and programs for execution by the processor(s). Memory 1404 may include one or more of volatile and non-volatile memories, such as Random Access Memory (“RAM”), Read Only Memory (“ROM”), a solid state disk (“SSD”), Flash, Phase Change Memory (“PCM”), or other types of data storage. Memory 1404 may be internal or distributed memory.
Storage device 1406 includes storage for storing data or instructions. As an example and not by way of limitation, storage device 1406 can comprise a non-transitory storage medium described above. Storage device 1406 may include a hard disk drive (HDD), a floppy disk drive, flash memory, an optical disc, a magneto-optical disc, magnetic tape, or a Universal Serial Bus (USB) drive or a combination of two or more of these. Storage device 1406 may include removable or non-removable (or fixed) media, where appropriate. Storage device 1406 may be internal or external to computing device 1400. In particular implementations, storage device 1406 is non-volatile, solid-state memory. In other implementations, Storage device 1406 includes read-only memory (ROM). Where appropriate, this ROM may be mask programmed ROM, programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), electrically alterable ROM (EAROM), or flash memory or a combination of two or more of these.
I/O interface 1408 allows a user to provide input to, receive output from, and otherwise transfer data to and receive data from computing device 1400. I/O interface 1408 may include a mouse, a keypad or a keyboard, a touch screen, a camera, an optical scanner, network interface, modem, other known I/O devices or a combination of such I/O interfaces. I/O interface 1408 may include one or more devices for presenting output to a user, including, but not limited to, a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., display drivers), one or more audio speakers, and one or more audio drivers. In certain implementations, I/O interface 1408 is configured to provide graphical data to a display for presentation to a user. The graphical data may be representative of one or more graphical user interfaces and/or any other graphical content as may serve a particular implementation.
Communication interface 1410 can include hardware, software, or both. In any event, communication interface 1410 can provide one or more interfaces for communication (such as, for example, packet-based communication) between computing device 1400 and one or more other computing devices or networks. As an example and not by way of limitation, communication interface 1410 may include a network interface controller (NIC) or network adapter for communicating with an Ethernet or other wire-based network or a wireless NIC (WNIC) or wireless adapter for communicating with a wireless network, such as a WI-FI.
Additionally or alternatively, communication interface 1410 may facilitate communications with an ad hoc network, a personal area network (PAN), a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), or one or more portions of the Internet or a combination of two or more of these. One or more portions of one or more of these networks may be wired or wireless. As an example, communication interface 1410 may facilitate communications with a wireless PAN (WPAN) (such as, for example, a BLUETOOTH WPAN), a WI-FI network, a WI-MAX network, a cellular telephone network (such as, for example, a Global System for Mobile Communications (GSM) network), or other suitable wireless network or a combination thereof.
Additionally, communication interface 1410 may facilitate communications various communication protocols. Examples of communication protocols that may be used include, but are not limited to, data transmission media, communications devices, Transmission Control Protocol (“TCP”), Internet Protocol (“IP”), File Transfer Protocol (“FTP”), Telnet, Hypertext Transfer Protocol (“HTTP”), Hypertext Transfer Protocol Secure (“HTTPS”), Session Initiation Protocol (“SIP”), Simple Object Access Protocol (“SOAP”), Extensible Mark-up Language (“XML”) and variations thereof, Simple Mail Transfer Protocol (“SMTP”), Real-Time Transport Protocol (“RTP”), User Datagram Protocol (“UDP”), Global System for Mobile Communications (“GSM”) technologies, Code Division Multiple Access (“CDMA”) technologies, Time Division Multiple Access (“TDMA”) technologies, Short Message Service (“SMS”), Multimedia Message Service (“MMS”), radio frequency (“RF”) signaling technologies, Long Term Evolution (“LTE”) technologies, wireless communication technologies, in-band and out-of-band signaling technologies, and other suitable communications networks and technologies.
Communication infrastructure 1412 may include hardware, software, or both that couples components of computing device 1400 to each other. As an example and not by way of limitation, communication infrastructure 1412 may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a front-side bus (FSB), a HYPERTRANSPORT (HT) interconnect, an Industry Standard Architecture (ISA) bus, an INFINIBAND interconnect, a low-pin-count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCIe) bus, a serial advanced technology attachment (SATA) bus, a Video Electronics Standards Association local (VLB) bus, or another suitable bus or a combination thereof.
In particular, the genealogical data system 1502 can manage synchronizing digital content across multiple client devices 1506 associated with one or more user accounts. For example, a user may edit a digitized historical document or a node within a genealogy tree using client device 1506. The genealogical data system 1502 can cause client device 1506 to send the edited genealogical content to the genealogical data system 1502, whereupon the genealogical data system 1502 synchronizes the genealogical content on one or more additional computing devices.
As shown, the client device 1506 may be a desktop computer, a laptop computer, a tablet computer, an augmented reality device, a virtual reality device, a personal digital assistant (PDA), an in- or out-of-car navigation system, a handheld device, a smart phone or other cellular or mobile phone, or a mobile gaming device, other mobile device, or other suitable computing devices. The client device 1506 may execute one or more client applications, such as a web browser (e.g., Microsoft Windows Internet Explorer, Mozilla Firefox, Apple Safari, Google Chrome, Opera, etc.) or a native or special-purpose client application (e.g., Ancestry: Family History & DNA for iPhone or iPad, Ancestry: Family History & DNA for Android, etc.), to access and view content over the network 1504.
The network 1504 may represent a network or collection of networks (such as the Internet, a corporate intranet, a virtual private network (VPN), a local area network (LAN), a wireless local area network (WLAN), a cellular network, a wide area network (WAN), a metropolitan area network (MAN), or a combination of two or more such networks) over which client devices 1506 may access genealogical data system 1502.
In the foregoing specification, the present disclosure has been described with reference to specific exemplary implementations thereof. Various implementations and aspects of the present disclosure(s) are described with reference to details discussed herein, and the accompanying drawings illustrate the various implementations. The description above and drawings are illustrative of the disclosure and are not to be construed as limiting the disclosure. Numerous specific details are described to provide a thorough understanding of various implementations of the present disclosure.
The present disclosure may be embodied in other specific forms without departing from its spirit or essential characteristics. The described implementations are to be considered in all respects only as illustrative and not restrictive. For example, the methods described herein may be performed with less or more steps/acts or the steps/acts may be performed in differing orders. Additionally, the steps/acts described herein may be repeated or performed in parallel with one another or in parallel with different instances of the same or similar steps/acts. The scope of the present application is, therefore, indicated by the appended claims rather than by the foregoing description. All changes that come within the meaning and range of equivalency of the claims are to be embraced within their scope.
The foregoing specification is described with reference to specific exemplary implementations thereof. Various implementations and aspects of the disclosure are described with reference to details discussed herein, and the accompanying drawings illustrate the various implementations. The description above and drawings are illustrative and are not to be construed as limiting. Numerous specific details are described to provide a thorough understanding of various implementations.
The additional or alternative implementations may be embodied in other specific forms without departing from its spirit or essential characteristics. The described implementations are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes that come within the meaning and range of equivalency of the claims are to be embraced within their scope.
This application claims the benefit of and priority to U.S. Provisional Patent Application No. 63/357,725, filed Jul. 1, 2022, entitled IMAGE SEGMENTATION POST-PROCESSING AND ENTITY EXTRACTION, which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63357725 | Jul 2022 | US |