Models used in machine learning applications are typically trained using data from a source domain, e.g., labeled data, resulting in a pre-trained model. However, when such models are tested on, for example, customer data (“test-time”) the model is typically required to adapt to unlabeled target data. More specifically, self-supervised pre-training has been able to produce transferable representations for various visual document understanding (“VDU”) tasks. VDU seeks to extract structured information from document pages represented in various visual formats. Once a pre-trained model is fine-tuned with labeled data in a source domain, performance may be impacted when such models are applied to a new unseen target domain. This phenomenon is generally referred to as “domain shift” or distribution shift. Domain shift is commonly encountered in real-world VDU applications where training and test-time distributions are different, e.g., a new layout, unseen data, different handwriting style, etc. Such real-world applications include financial services, insurance, healthcare, or legal, where document templates used by each customer oftentimes introduce domain shift. For instance, such applications may include tax/invoice/mortgage/claims processing, identity/risk/vaccine verification, medical records understanding, compliance management, as well as others. Adapting to unseen unlabeled documents at test-time is a challenging task in document understanding.
The disclosed technology may comprise one or more of a method, process, non-transitory computer readable medium, computing device, or system. For example, the method may comprise training, via a source domain, a machine learning model to use with one or more visual document understanding (“VDU”) tasks; determining a distribution shift when the machine learning model is applied in a target domain; applying a masked visual language modeling (“MVLM”) to target domain data detected as associated with the distribution shift to produce model predictions; generating pseudo-labels using the model predictions; and adapting the machine learning model to include the pseudo-labels to produce an adapted model.
In accordance with this aspect of the disclosed technology, the method may comprise applying self-training to the machine learning model using the pseudo-labels. The method may also comprise processing the target domain data detected as associated with the distribution shift using the adapted model. The method may further comprise applying thresholding to the pseudo-labels to reduce the pseudo-labels by a given amount. In addition, applying a threshold comprises applying an entropy-based uncertainty-aware pseudo-labeling selection mechanism to determine which of the pseudo-labels are reliable.
In accordance with this aspect of the disclosed technology, the method may comprise generating the pseudo-labels on a per-batch basis. The method may also comprise processing the target domain data using a visual encoder.
As another example, the disclosed technology may comprise a method for processing one or more electronic documents. The method may include receiving the one or more electronic documents as an input data stream; applying a machine learning model to the input data stream; determining that there is a domain shift associated with the input data stream; applying masked visual language modeling (“MVLM”) to target domain data determined as associated with the domain shift to produce model predictions; adapting the machine learning model to include the pseudo-labels to produce an adapted model; and processing the input data stream using the adapted model.
In accordance with this aspect of the disclosed technology, the machine learning model is trained on source domain data that does not account for the target domain data.
Further in accordance with this aspect of the disclosed technology, the method comprises applying self-training to the machine learning model using the pseudo-labels. The method may also comprise applying threshold to the pseudo-labels to reduce the pseudo-labels by a given amount. In addition, applying the threshold comprises applying an entropy-based uncertainty-aware pseudo-labeling selection mechanism to determine which of the pseudo-labels are reliable.
In accordance with this aspect of the disclosed technology, the method may comprise generating the pseudo-labels on a per-batch basis. The method may also comprise processing the target domain data using a visual encoder. The method may further comprise processing the target domain data using an optical character recognition parser. In addition, the target domain data may comprise test-time data.
Another aspect of the disclosed technology may comprise a non-transitory computer readable medium having stored thereon instructions that, when executed by one or more computing devices, cause the one or computing devices to: determine a distribution shift when the machine learning model is applied in a target domain; apply a masked visual language modeling (“MVLM”) to target domain data detected as associated with the distribution shift to produce model predictions; generate pseudo-labels using the model predictions; and adapt the machine learning model to include the pseudo-labels to produce an adapted model. In accordance with this aspect of the disclosed technology, the instructions may cause the one or computing devices to apply self-training to the machine learning model using the pseudo-labels. Further, the instructions may cause the one or computing devices to process the target domain data detected as associated with the distribution shift using the adapted model.
An aspect of the disclosed technology comprises a test-time adaptation (“TTA”) technique for VDU tasks that uses self-supervised learning on different modalities (e.g., text and layout) by applying masked visual language modeling (“MVLM”) along with pseudo-labeling. The VDU tasks may include key-value extraction, entity recognition, and document visual question answering (“VQA”). MVLM is employed at test-time to make the model learn the language modality of the test data given two-dimensional positions and other text tokens, e.g., by intentionally masking out text and asking the model to make predictions. In one aspect, pseudo-labeling comprises an uncertainty-aware pseudo-labeling selection mechanism which more accurately predicts labels for new target samples. For instance, hard pseudo-labels may be generated on a per-batch basis using model predictions. The uncertainty-aware selection mechanism results in the selection of a subset of labels with low uncertainty. In this regard, the uncertainty technique is based on Shannon's Entropy. In addition, the technique makes use of class diversification to mitigate against blindly trusting the most probable class label during pseudo-label generation.
The disclosed technology also introduces various benchmarks for VDU tasks including key-value extraction, entity recognition, and document visual question answering (“DocVQA”). These benchmarks are generated using publicly available datasets by modifying them to simulate real-world adaptation scenarios.
The disclosed technology may be implemented as a methodology, method, or process in a machine learning system. The methodology, method, or process may be instantiated during test-time when new unseen data is detected as part of a set of customer data and used to apply adaptation to such data. The disclosed methodology is referred to as document test-time adaptation (“DocTTA”). The disclosed methodology leverages cross-modality self-supervised learning via MVLM, as well as pseudo-labeling, to adapt models trained on a source domain to an unlabeled target domain at test-time. From a system perspective, the disclosed technology may comprise instructions in a software module.
Unsupervised domain adaptation (“UDA”) methods attempt to mitigate the adverse effect of domain or data shifts, often by training a joint model on labeled source and unlabeled target domains that map both domains into a common feature space. However, simultaneous access to data from source and target domains may not always be feasible in VDU tasks. In addition, the training and serving may be done in different computational environments, and thus, the training data and resources may not be available.
TTA methods have been also introduced to adapt a model that is trained on source to unseen target data, without using any source data. Existing TTA methods have mainly focused on image classification tasks, while VDU remains to be unexplored, despite the clear motivations of the distribution shift and challenges for employment of standard UDA. Current TTA approaches for image classification typically use entropy minimization or pseudo-labeling combined with self-supervised contrastive learning. However, VDU significantly differs from other computer vision tasks. In VDU, information is extracted from multiple modalities (including image, text, and layout), unlike other computer vision tasks. In addition, multiple outputs (e.g., entities or questions) are obtained from the same document, creating the scenario that their similarity in some aspects (e.g., in document format or context) can be utilized. Moreover, the popular self-supervised contrastive methods in computer vision that are known to increase generalizability using image augmentation techniques are not as effective in VDU.
Turning now to
As shown, the process 200 begins upon receipt of an input document or second data set at block 210. The input document or second data set comprise customer data that is to be tested on a document model, e.g., test-time data. As is discussed in more detail below, the input document or second data set will typically be received by a computing device that carries out the processing or method steps of process 200. The input document or second data set may comprise data associated with document pages that are provided by a customer and from which the customer expects certain structured data to be outputted after being processed by the computing device. The input document or second data set may be considered a target domain and may comprise, for example, data associated with documents provided by a corporation.
Assuming the documents comprise one or more W2 forms, the customer may want, for example, to extract certain financial and employee information recorded on the form. Further, let's assume that the stream of data is associated with two different types of W2 forms—a legacy form and an updated form, which includes data not provided in the legacy form. In addition, the document model used in processing the stream of data is assumed to be trained on data associated with the legacy form, e.g., data associated with a source domain, and typically comprises a machine learning model. As such, some of the data or information associated with the new form doesn't look like past data associated with the legacy form. In accordance with the disclosed technology, such new data or target data comprises unlabeled data within the document model. As one skilled in the art may appreciate, in some systems the document model may be unable to continue processing the input data stream or fail when a domain shift occurs.
The input document or second data set is fed to a model that is trained on a first data set, as shown at block 220. The first data set comprises data that is labeled in accordance with the model. As the model is trained on a first data set, any data within the second data set or data associated with the input document that is different than the first data comprises unlabeled data. The unlabeled data comprises data that represents a domain or distribution shift by the model.
Responsive to detecting a domain or distribution shift event, processing moves to block 230, where the model is adapted to account for the shift caused by the unlabeled data. The document model adapts automatically and may adapt on the fly or in real time, e.g., without any noticeable delay or performance impact. The model is adapted based on application of MVLM, self-training using pseudo-labels, and a diversity cost objective. In accordance with the disclosed technology, each of MVLM, self-training, and diversity cost comprises objective functions as part of the DocTTA methodology and system.
A framework in accordance with the disclosed methodology or system (e.g., DocTTA methodology and system) includes defining a framework (e.g., DocTTA framework) that includes a domain as a pair of distribution D on inputs X and a labeling function 1:X→Y. In accordance with the disclosed technology, we consider source and target domains. In the source domain, denoted as (Ds, ls), we assume to have a model denoted as fs and parameterized with Os to be trained on source data {xs(i); xs(i)}i=1n
Unlike single-modality inputs commonly used in computer vision, documents are images with rich textual information. To extract the text from the image, we consider optical character recognition (“OCR”) is performed and use its outputs, characters, and their corresponding bounding boxes, as shown for instance via the example in
In accordance with the disclosed technology, MVLM (Objective I) is employed at test-time to encourage the model to better learn the text representation of the test data given the 2D positions and other text tokens. The intuition behind using this objective for TTA is to enable the target model to learn the language modality of the new data given visual cues, thereby bridging the gap between the different modalities on the target domain. We randomly mask 15% of input text tokens, among which 80% are replaced by a special token [MASK] and the remaining tokens are replaced by a random word from the entire vocabulary. The model is then trained to recover the masked tokens while the layout information remains fixed. To do so, the output representations of masked tokens from the encoder are fed into a classifier which outputs logits over the whole vocabulary, to minimize the negative log-likelihood of correctly recovering masked text tokens xT given masked image tokens xI and masked layout :
MVLM(θt)=−x
The second objective function comprises self-training with pseudo-labels (Objective II). While optimizing MVLM loss during the adaptation, we also generate pseudo-labels for the unlabeled target data and treat them as ground truth labels to perform supervised learning on the target domain. We generate pseudo-labels per batch aiming to use the latest version of the model for predictions. We consider a full epoch to be one training loop where we iterate over the entire dataset, batch-by-batch. In addition, using a clustering mechanism to generate pseudo-labels may be computationally expensive for documents. As such, we directly use predictions by the model. However, simply using all the predictions would lead to noisy pseudo-labels.
As such, in accordance with processing block 240 of
{tilde over (y)}
c
i
=
[u(pc(i))≤γ], (2)
CE(θt)=−x
Turning now to the diversity objective function (Objective III of
DIV=x
where
DecTTA=MVLM+CE+D (5)
In accordance with the foregoing, the DocTTA procedure can be formulated as the following algorithm:
Update θ via total loss in Eq. 5
indicates data missing or illegible when filed
As indicated above, predictions are used by the model to determine pseudo-labels that are correct, as indicated at processing block 240 of
At block 250 of
In accordance with the process 1200, target streams containing unlabeled data may be processed seamlessly. This accounts for cases where a customer may have new data that was not accounted for during the training of the document model. In other cases, the amount of data available for training the model may be modest for certain customers and therefore such customers may have new data as a result of same and more frequently. The capability to adapt the document model and continue processing the input data stream improves processing by mitigating against unlabeled or unaccounted-for data causing the model to fail and associated computing systems to crash. In addition, the model is adapted without having to train the document model offline—with or without human intervention.
As indicated above, an aspect of the disclosed technology is the introduction of new benchmarks for VDU. Our benchmark datasets are constructed from existing popular and publicly-available VDU data to mimic real-world challenges.
One benchmark is an entity recognition benchmark. We consider a Form Understanding in Noisy Scanned Documents (“FUNSD”) dataset for this benchmark, which is a noisy form understanding collection consists of sparsely-filled forms, with sparsity varying across the use cases the forms are from. In addition, the scanned images are noisy with different degradation amounts due to the disparity in scanning processes, which can further exacerbate the sparsity issue as the limited information might be based on incorrect OCR outputs. As a representative distribution shift challenge on FUNSD, we split the source and target documents based on the sparsity of available information measure. The original dataset has 9707 semantic entities and 31,485 words with 4 categories of entities question, answer, header, and other, where each category (except other) is either the beginning or the intermediate word of a sentence. Therefore, in total, we have 7 classes. We first combine the original training and test splits and then manually divide them into two groups. We set aside 149 forms that are filled with more texts for the source domain and put 50 forms that are sparsely filled for the target domain. We randomly choose 10 out of 149 documents for validation, and the remaining 139 for training.
Another benchmark is a key-value extraction adaptation benchmark. We use Scanned Receipts OCR and Information Extraction (“SROIE”) dataset with 9 classes in total. Similar to FUNSD, we first combine the original training and test splits. Then, we manually divide them into two groups based on their visual appearance—source domain with 600 documents contains standard-looking receipts with proper angle of view and clear black ink color. We use 37 documents from this split for validation, which we use to tune adaptation hyperparameters. Note that the validation split does not overlap with the target domain, which has 347 receipts with slightly blurry look, rotated view, colored ink, and large empty margins.
Another benchmark is a document VQA benchmark. We use DocVQA, a large-scale VQA dataset with nearly 20 different types of documents including scientific reports, letters, notes, invoices, publications, tables, etc. The original training and validation splits contain questions from all of these document types. However, for the purpose of creating an adaptation benchmark, we select 4 domains of documents: i) Emails & Letters (E), ii) Tables & Lists (T), iii) Figure & Diagrams (F), and iv) Layout (L). Since DocVQA doesn't have public meta-data to easily sort all documents with their questions, we use a simple keyword search to find our desired categories of questions and their matching documents. We use the same words in domains' names to search among questions (i.e., we search for the words of “email” and “letter” for Emails & Letters domain). However, for Layout domain, our list of keywords is [“top”, “bottom”, “right”, “left”, “header”, “page number”] which identifies questions that are querying information from a specific location in the document. Among the four domains, L and E have the shortest gap because emails/letters have structured layouts and extracting information from them requires understanding relational positions. For example, the name and signature of the sender usually appear at the bottom, while the date usually appears at top left. However, F and T domains seem to have larger gaps with other domains, that we attributed to learning to answer questions on figures or tables requires understanding local information within the list or table.
The computing device 700 can take on a variety of configurations, such as, for example, a controller or microcontroller, a processor, or an ASIC. In some instances, computing device 700 may comprise a server or host machine that carries out the operations discussed above. In other instances, such operations may be performed by one or more of the computing devices in a data center. The computing device may include memory 704, which includes data 708 and instructions 712, and a processing element 716, as well as other components typically present in computing devices (e.g., input/output interfaces for a keyboard, display, etc.; communication ports for connecting to different types of networks).
The memory 704 can store information accessible by the processing element 716, including instructions 712 that can be executed by processing element 716. Memory 704 can also include data 708 that can be retrieved, manipulated, or stored by the processing element 716. The memory 704 may be a type of non-transitory computer-readable medium capable of storing information accessible by the processing element 716, such as a hard drive, solid state drive, tape drive, optical storage, memory card, ROM, RAM, DVD, CD-ROM, write-capable, and read-only memories. The processing element 716 can be a well-known processor or other lesser-known types of processors. Alternatively, the processing element 716 can be a dedicated controller such as an ASIC.
The instructions 712 can be a set of instructions executed directly, such as machine code, or indirectly, such as scripts, by the processor 716. In this regard, the terms “instructions,” “steps,” and “programs” can be used interchangeably herein. The instructions 712 can be stored in object code format for direct processing by the processor 716, or can be stored in other types of computer language, including scripts or collections of independent source code modules that are interpreted on demand or compiled in advance. For example, the instructions 712 may include instructions to carry out the processes, methods, and functions discussed above in relation to
The data 708 can be retrieved, stored, or modified by the processor 716 in accordance with the instructions 712. For instance, although the system and method are not limited by a particular data structure, the data 708 can be stored in computer registers, in a relational database as a table having a plurality of different fields and records, or in XML documents. The data 708 can also be formatted in a computer-readable format such as, but not limited to, binary values, ASCII, or Unicode. Moreover, the data 708 can include information sufficient to identify relevant information, such as numbers, descriptive text, proprietary codes, pointers, references to data stored in other memories, including other network locations, or information that is used by a function to calculate relevant data.
The computing device 700 may also include one or more modules 720. Modules 720 may comprise software modules that include a set of instructions, data, and other components (e.g., libraries) used to operate computing device 700 so that it performs specific tasks. For example, the modules 720 may comprise scripts, programs, or instructions to implement one or more of the functions associated with the modules or components discussed in
Computing device 700 may also include one or more input/output ports 730. Each I/O port 730 may receive an input stream as discussed above and after processing output the data stream updated with pseudo-labels. Each output port may comprise an I/O interface that communicates with local and wide area networks.
In some examples, the disclosed technology may be implemented as a system 800 in a distributed computing environment as shown in
Computing device 810 may comprise a computing device as discussed in relation to
Computing device 810 may also include a display 820 (e.g., a monitor having a screen, a touch-screen, a projector, a television, or other device that is operable to display information) that provides a user interface that allows for controlling the computing device 810. Such control may include, for example, using a computing device to cause data to be uploaded through input system 828 to cloud system 850 for processing, causing accumulation of data on storage 836, or more generally, managing different aspects of a customer's computing system. While input system 828 may be used to upload data, e.g., a USB port, computing system 800 may also include a mouse, keyboard, touchscreen, or microphone that can be used to receive commands and/or data.
The network 840 may include various configurations and protocols, including short-range communication protocols such as Bluetooth™, Bluetooth LE, the Internet, World Wide Web, intranets, virtual private networks, wide area networks, local networks, private networks using communication protocols proprietary to one or more companies, Ethernet, WiFi, HTTP, etc., and various combinations of the foregoing. Such communication may be facilitated by any device capable of transmitting data to and from other computing devices, such as modems and wireless interfaces. Computing device 810 interfaces with network 840 through communication interface 824, which may include the hardware, drivers, and software necessary to support a given communications protocol.
Cloud computing systems 850 may comprise one or more data centers that may be linked via high speed communications or computing networks. A given data center within system 850 may comprise dedicated space within a building that houses computing systems and their associated components, e.g., storage systems and communication systems. Typically, a data center will include racks of communication equipment, servers/hosts, and disks. The servers/hosts and disks comprise physical computing resources that are used to provide virtual computing resources such as VMs. To the extent that a given cloud computing system includes more than one data center, those data centers may be at different geographic locations within relative close proximity to each other, chosen to deliver services in a timely and economically efficient manner, as well as provide redundancy and maintain high availability. Similarly, different cloud computing systems are typically provided at different geographic locations.
As shown in
Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims.
The present application claims the benefit of the filing date of U.S. Provisional Patent Application No. 63/343,211, filed May 18, 2022, the disclosure of which is hereby incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63343211 | May 2022 | US |