Computerized document understanding typically includes computer vision, optical character recognition, and/or other processing techniques to comprehend the contents of documents without human intervention. Understanding or comprehending a document can include anything from the classification of the document type to the identification, extraction, and/or storage of relevant values or information from the document. In some fields, such as accounting, tax, and other fields that can be document-intensive, document comprehension can also include the presentation of relevant information to various applications in a structured way. For example, when a user utilizes accounting software to analyze his/her financials or utilizes tax software to prepare a tax return, it is common for the user to upload, via the internet, various types of documents for processing such as e.g., W-2's, 1099's, invoices, etc.
However, when going through the initial process of filling out documents (either online forms or a physical document), the user may have questions about a particular field or box. For example, for tax preparation, the user may be confused on what is considered to be his or her dependent. In other examples, the user may be confused by what a withholding is, or may not understand other terms and/or how to perform certain calculations. To get help, the user may contact customer service (e.g., by calling a customer service line or connecting with an online representative via chat) to ask questions related to his/her document. This can create various issues such as e.g., overloading the customer service centers. More significantly, the customer representative may not be knowledgeable enough to answer the user's question and may be forced to provide generic or unspecific answers, such as directing the user to an online search tool, instruction manual, or reference page. This can be potentially frustrating, tedious, and time-consuming for a user having trouble filling out and trying to upload a document.
Similarly, if the user is confused or is having trouble filling out a particular field or putting forth certain information in a document, he/she may decide to simply leave the field blank and upload the incomplete document for processing. While document processing platforms may be able to detect an error, the information provided by the platform may not be the most helpful for the user. In some cases, the only information provided may be an identification of the blank field (e.g., “SSN is blank”). In other cases, the information provided by the platform, if any, may not be any more informative than what the user would have obtained from a customer service representative, search engine or reference manual. In yet other cases, a long and potentially complex and/or tedious list of steps may be provided to the user. Each of these situations are undesirable.
An example of the type of information displayed to the user that may not be preferred or the most efficient methodology of answering the user's question is shown in
Embodiments of the present disclosure relate to various systems and methods for providing contextual information for document understanding. The disclosed principles can be used to assist users in filling out documents by providing contextual information based on the deficiencies (or anomalies) identified in an uploaded document.
For example, a user may upload to a tax service a W-2 form missing its social security number (SSN). Without the disclosed principles, this would cause an error and force the user to receive help in the undesirable manners mentioned above. However, according to the disclosed principles, the disclosed methods and systems may identify the deficiency in the document and automatically generate a question related to the anomaly (herein referred to as a “query”). The query can be fed as an input to a trained question-answering (QA) model that may be specifically finetuned with keywords and or other jargon related to the application (e.g., tax, accounting, and or other financial services). The QA model can provide an answer (herein referred to as contextual information or contextual explanations), which can be forwarded for display on a device associated with the user. The contextual information may include various information such as the required format for the missing information and or a specific action that should be taken to correct the error, although the contextual information may vary according to the anomaly and underlying service (i.e., accounting, taxes, financial management, etc.). It should be appreciated that, while the embodiments described herein are described as being utilized with accounting, tax, and or financial documents, the disclosed principles are not so limited and may apply to any form-based document and its related service.
In some embodiments, the disclosed principles may perform various analyses to determine whether identified document deficiencies are indeed anomalies. For example, many tax or financial documents commonly contain blank spaces or blank fields, but are still considered complete. A blank field does not necessarily correlate to an anomaly in those documents and thus does not necessarily correlate to something a user should receive contextual information for. For example, a majority of users may leave a certain field blank in a certain document, which suggests that a value may not be necessary to complete the document and that it would be a waste of time and processing resources to provide an unwanted piece of information to the user. Accordingly, the disclosed principles may analyze a history of similar documents prior to providing contextual information to determine if the contextual information would be valued by the user.
A user device 202 can include one or more computing devices capable of receiving user input as well as transmitting and/or receiving data via network 204 or communicating with server device 206. In some embodiments, a user device 202 can include a conventional computer system, such as a desktop or laptop computer. Alternatively, a user device 202 may include a device having computer functionality, such as a personal digital assistant (PDA), a mobile telephone, a smartphone, or other suitable device. In some embodiments, a user device 202 may be the same as or similar to user device 1000 described below with respect to
Network 204 may include one or more wide areas networks (WANs), metropolitan area networks (MANs), local area networks (LANs), personal area networks (PANs), or any combination of these networks. Network 204 may include a combination of one or more types of networks, such as Internet, intranet, Ethernet, twisted-pair, coaxial cable, fiber optic, cellular, satellite, IEEE 801.11, terrestrial, and/or other types of wired or wireless networks. Network 204 can also use standard communication technologies and/or protocols.
As shown in
In one or more embodiments, the extraction module 208 may be configured to analyze a document or an image of a document received from a user device 202. For example, the extraction module 208 may perform image processing such as OCR in accordance with pre-defined models for extracting text from specific document types, as well as any other text extraction technique known in the art. The extraction module 208 may be configured to extract financial data written onto a tax form (e.g., handwritten or typed by a user) and use it to fill out and complete a tax return. In some embodiments, the extraction module 208 can be configured to detect the document type received, identify fields related to “boxes” or “spaces” that can be filled out on a form, and detect empty spaces. In some embodiments, the extraction module 208 can also be configured to extract values from fields within the document, such as income, number of dependents, or other such values. In one or more embodiments, the extraction module 208 provides an extracted output based on the above principles.
The anomaly detection module 210 can be configured to detect anomalies in a document received from the user device 202, such as e.g., the extracted output of the extraction module 208. In some embodiments, the anomaly detection module 210 may be configured to receive the identified document type and identified field related to a detected empty space from the extraction module 208. The anomaly detection module 210 may be configured to detect various types of anomalies in the document from the extracted output. In some embodiments, the detection of an anomaly in a document can include identifying an insufficiency in the document and then analyzing the insufficiency to determine if it should be classified as an anomaly. For example, insufficiencies can include blank spaces or blank fields and the anomaly detection module 210 may be configured to determine whether the insufficiency is anomaly by analyzing similar documents (e.g., documents stored in database 216) for statistics on that particular field. If a particular field is commonly left blank, then the insufficiency may not be determined to be an anomaly by the anomaly detection module 210. However, if a particular field is rarely left blank by other users, the empty space may be determined to be an anomaly by the anomaly detection module 210.
In some embodiments, an anomaly may be detected based on the actual values extracted from the document by the extraction module 208. That is, the disclosed principles are not limited to finding anomalies based on blank fields and may instead detect anomalies based on incorrect entered values or content. For example, the anomaly detection module 210 may determine that the wages entered by a user are less than the tax amount, which would be considered abnormal and thus an anomaly. It should be appreciated that these are merely examples and that the anomaly detection module 210 may include a list of rules to apply to each document or the extracted output of the document to determine if there are various types of anomalies.
The query generation module 212 can be configured to generate queries (e.g., questions) based on the anomalies detected by the anomaly detection module 210. In some embodiments, the query generation module 212 can be configured to compile text to form a phrase or question. It should be appreciated that the query is not required to be in the form of a question. Generating the query can include compiling text associated with the field in which the anomaly has been detected and associated with the document type. For example, if a document is missing a social security number, the query can be “What is a social security number and where do I find it?” In another example, if a document has a tax bill generated that is higher than the wages provided as being earned, the query can be “what do I do if my taxes owed are incorrect?” In addition, the user could ask questions such as “why are my taxes more than last year?”
The explanation module 214 can include a trained QA model and can be configured to receive a textual query from the query generation module 212 and feed the query into the QA model. In some embodiments, the QA model can include a bidirectional encoder representation from transformers (BERT) model. As is known in the art, a BERT model is a language processing model and can include various transformer encoder blocks that are trained to understand contextual relations between words. A BERT model can analyze text bidirectionally instead of left to right or right to left. A standard BERT model can include two mechanisms in its transformer: an encoder that can read input text and a decoder that predicts text to follow the input text. A BERT model may operate on and process word or text vectors (e.g., text that has been embedded to a vector space). A neural network with layers (e.g., various transformer blocks, self-attention heads) then analyzes the word vectors for prediction or classification.
In some embodiments, the BERT model may be converted to a QA model (e.g., where the BERT model predicts an answer for the input text that is in the form of a question) and fine-tuned with various tax-related and/or finance-related keywords and may be trained to identify relevant answers within specifically defined areas of text. For example, the BERT model may be fine-tuned with instructions and references from the Internal Revenue Service (IRS), internal documents of an organization that are related to FAQs and other helpful-type resources that a user would normally have to sift through themselves, online tax or finance related publications, and tax documents. After receiving a question from the query generation module 212, the fine-tuned BERT model can receive an embedded question (e.g., in vector format or a query vector) and predict an answer from within a pre-defined sequence or passage of text or a body of text. As described herein, the pre-defined sequence or passage of text can include the IRS instructions and references, other tax documents, FAQs and other resources related to taxes and finance, and online tax or finance related publications. In some embodiments, fine-tuning the BERT model for tax or finance specific purposes can include altering parameters in the self-attention head mechanisms. The BERT model can be trained using annotated examples of various question answering situations pertaining to the tax and accounting domain, while fine tuning can be done on tax and accounting taxonomy including pre/post processing and annotation. For example, the explanation module 214 can receive a query from query generation module 212, embed the query to a vector format (herein referred to as a query vector), and feed the query vector to the fine-tuned BERT model. The BERT model can predict an answer from the pre-trained references and output the answer, which can herein be referred to as contextual information or a contextual explanation. The explanation module 214 can then be configured to cause this output to be displayed on a user device 202.
At block 303, in response to an anomaly being detected in the received document, the query generation module 212 can generate a query based on the detected anomaly. For example, generating a query may include compiling a textual phrase or question based on the anomaly (e.g., the query may be based on the field associated with the anomaly). At block 304, the explanation module 214 can feed the query to a QA model (e.g., a fine-tuned BERT model as described in relation to
At block 305, the explanation module 214 can receive an answer (herein referred to as contextual information or a contextual explanation) from the QA model. The contextual information can be the identified span of text from block 304. In some embodiments, the identified span of text can be de-embedded from the vector format into a textual format. At block 306, the contextual information can be sent to and displayed on the user device 202 associated with the user. As can be appreciated, the contextual information may help the user fix or correct any anomalies or other issues in its submitted document.
At block 404, the anomaly detection module 210 can compare the document to a database (e.g., database 216 of
In one or more embodiments, the anomaly detection module 210 can analyze values extracted by extraction module 208 to determine anomalies (i.e., one or more anomalies can be detected for non-blank spaces). For example, an income value may be analyzed and compared to a determined amount of tax owed for the user. If the income value provided in the received document is greater than the amount of tax owed, anomaly detection module 210 can determine that this is an anomaly. In some embodiments, another example of an anomaly can be in a 1099-INT form. If the total interest does not match the addition of itemized values, an anomaly can be flagged. In yet another example, if on a W2 box 12, the code is not a valid code, an anomaly can also be flagged. For example, ZZ is not a valid code according to IRS instructions.
In some embodiments, the data and/or passage/body of text that the QA model searches and parses for relevant passages can be embedded to a vector format (e.g., a body vector); the query can also be embedded to a vector format and relevant passages can be identified in the vector space. As used herein, unstructured data can refer to data that does not have a pre-defined model or is not organized in a pre-defined manner. The identified passage in the unstructured data can be de-embedded from vector format back to a text format if desired. At 508, the contextual information (e.g., the “answer”) can be provided for display to the user in the software they operated to submit the original document. For example, as shown in
Processor(s) 902 may use any known processor technology, including but not limited to graphics processors and multi-core processors. Suitable processors for the execution of a program of instructions may include, by way of example, both general and special purpose microprocessors, and the sole processor or one of multiple processors or cores, of any kind of computer. Bus 910 may be any known internal or external bus technology, including but not limited to ISA, EISA, PCI, PCI Express, USB, Serial ATA, or FireWire. Volatile memory 904 may include, for example, SDRAM. Processor 902 may receive instructions and data from a read-only memory or a random access memory or both. Essential elements of a computer may include a processor for executing instructions and one or more memories for storing instructions and data.
Non-volatile memory 906 may include by way of example semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. Non-volatile memory 906 may store various computer instructions including operating system instructions 912, communication instructions 914, application instructions 916, and application data 917. Operating system instructions 912 may include instructions for implementing an operating system (e.g., Mac OS®, Windows®, or Linux). The operating system may be multi-user, multiprocessing, multitasking, multithreading, real-time, and the like. Communication instructions 914 may include network communications instructions, for example, software for implementing communication protocols, such as TCP/IP, HTTP, Ethernet, telephony, etc. Application instructions 916 may include instructions for providing contextual information in conjunction with document understanding according to the systems and methods disclosed herein. For example, application instructions 916 may include instructions for components 208-214 described above in conjunction with
Peripherals 908 may be included within server device 900 or operatively coupled to communicate with server device 900. Peripherals 908 may include, for example, network subsystem 918, input controller 920, and disk controller 922. Network subsystem 918 may include, for example, an Ethernet of WiFi adapter. Input controller 920 may be any known input device technology, including but not limited to a keyboard (including a virtual keyboard), mouse, track ball, and touch-sensitive pad or display. Disk controller 922 may include one or more mass storage devices for storing data files; such devices include magnetic disks, such as internal hard disks and removable disks; magneto-optical disks; and optical disks.
Sensors, devices, and subsystems may be coupled to peripherals subsystem 1006 to facilitate multiple functionalities. For example, motion sensor 1010, light sensor 1012, and proximity sensor 1014 may be coupled to peripherals subsystem 1006 to facilitate orientation, lighting, and proximity functions. Other sensors 1016 may also be connected to peripherals subsystem 1006, such as a global navigation satellite system (GNSS) (e.g., GPS receiver), a temperature sensor, a biometric sensor, magnetometer, or other sensing device, to facilitate related functionalities.
Camera subsystem 1020 and optical sensor 1022, e.g., a charged coupled device (CCD) or a complementary metal-oxide semiconductor (CMOS) optical sensor, may be utilized to facilitate camera functions, such as recording photographs and video clips. Camera subsystem 1020 and optical sensor 1022 may be used to collect images of a user to be used during authentication of a user, e.g., by performing facial recognition analysis.
Communication functions may be facilitated through one or more wired and/or wireless communication subsystems 1024, which may include radio frequency receivers and transmitters and/or optical (e.g., infrared) receivers and transmitters. For example, the Bluetooth (e.g., Bluetooth low energy (BTLE)) and/or WiFi communications described herein may be handled by wireless communication subsystems 1024. The specific design and implementation of communication subsystems 1024 may depend on the communication network(s) over which the user device 1000 is intended to operate. For example, user device 1000 may include communication subsystems 1024 designed to operate over a GSM network, a GPRS network, an EDGE network, a WiFi or WiMax network, and a Bluetooth™ network. For example, wireless communication subsystems 1024 may include hosting protocols such that device 1000 may be configured as a base station for other wireless devices and/or to provide a WiFi service.
Audio subsystem 1026 may be coupled to speaker 1028 and microphone 1030 to facilitate voice-enabled functions, such as speaker recognition, voice replication, digital recording, and telephony functions. Audio subsystem 1026 may be configured to facilitate processing voice commands, voice-printing, and voice authentication, for example.
I/O subsystem 1040 may include a touch-surface controller 1042 and/or other input controller(s) 1044. Touch-surface controller 1042 may be coupled to a touch-surface 1046. Touch-surface 1046 and touch-surface controller 1042 may, for example, detect contact and movement or break thereof using any of a plurality of touch sensitivity technologies, including but not limited to capacitive, resistive, infrared, and surface acoustic wave technologies, as well as other proximity sensor arrays or other elements for determining one or more points of contact with touch-surface 1046.
The other input controller(s) 1044 may be coupled to other input/control devices 1048, such as one or more buttons, rocker switches, thumb-wheel, infrared port, USB port, and/or a pointer device such as a stylus. The one or more buttons (not shown) may include an up/down button for volume control of speaker 1028 and/or microphone 1030.
In some implementations, a pressing of the button for a first duration may disengage a lock of touch-surface 1046; and a pressing of the button for a second duration that is longer than the first duration may turn power to user device 1000 on or off. Pressing the button for a third duration may activate a voice control, or voice command, module that enables the user to speak commands into microphone 1030 to cause the device to execute the spoken command. The user may customize a functionality of one or more of the buttons. Touch-surface 1046 may, for example, also be used to implement virtual or soft buttons and/or a keyboard.
In some implementations, user device 1000 may present recorded audio and/or video files, such as MP3, AAC, and MPEG files. In some implementations, user device 1000 may include the functionality of an MP3 player, such as an iPod™. User device 1000 may, therefore, include a 36-pin connector and/or 8-pin connector that is compatible with the iPod. Other input/output and control devices may also be used.
Memory interface 1002 may be coupled to memory 1050. Memory 1050 may include high-speed random access memory and/or non-volatile memory, such as one or more magnetic disk storage devices, one or more optical storage devices, and/or flash memory (e.g., NAND, NOR). Memory 1050 may store an operating system 1052, such as Darwin, RTXC, LINUX, UNIX, OS X, Windows, or an embedded operating system such as VxWorks.
Operating system 1052 may include instructions for handling basic system services and for performing hardware dependent tasks. In some implementations, operating system 1052 may be a kernel (e.g., UNIX kernel). In some implementations, operating system 1052 may include instructions for performing voice authentication.
Memory 1050 may also store communication instructions 1054 to facilitate communicating with one or more additional devices, one or more computers and/or one or more servers. Memory 1050 may include graphical user interface instructions 1056 to facilitate graphic user interface processing; sensor processing instructions 1058 to facilitate sensor-related processing and functions; phone instructions 1060 to facilitate phone-related processes and functions; electronic messaging instructions 1062 to facilitate electronic messaging-related process and functions; web browsing instructions 1064 to facilitate web browsing-related processes and functions; media processing instructions 1066 to facilitate media processing-related functions and processes; GNSS/Navigation instructions 1068 to facilitate GNSS and navigation-related processes and instructions; and/or camera instructions 1070 to facilitate camera-related processes and functions.
Memory 1050 may store application (or “app”) instructions and data 1072, such as instructions for the apps described above in the context of
The described features may be implemented in one or more computer programs that may be executable on a programmable system including at least one programmable processor coupled to receive data and instructions from, and to transmit data and instructions to, a data storage system, at least one input device, and at least one output device. A computer program is a set of instructions that may be used, directly or indirectly, in a computer to perform a certain activity or bring about a certain result. A computer program may be written in any form of programming language (e.g., Objective-C, Java), including compiled or interpreted languages, and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
Suitable processors for the execution of a program of instructions may include, by way of example, both general and special purpose microprocessors, and the sole processor or one of multiple processors or cores, of any kind of computer. Generally, a processor may receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer may include a processor for executing instructions and one or more memories for storing instructions and data. Generally, a computer may also include, or be operatively coupled to communicate with, one or more mass storage devices for storing data files; such devices include magnetic disks, such as internal hard disks and removable disks; magneto-optical disks; and optical disks. Storage devices suitable for tangibly embodying computer program instructions and data may include all forms of non-volatile memory, including by way of example semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory may be supplemented by, or incorporated in, ASICs (application-specific integrated circuits).
To provide for interaction with a user, the features may be implemented on a computer having a display device such as an LED or LCD monitor for displaying information to the user and a keyboard and a pointing device such as a mouse or a trackball by which the user may provide input to the computer.
The features may be implemented in a computer system that includes a back-end component, such as a data server, or that includes a middleware component, such as an application server or an Internet server, or that includes a front-end component, such as a client computer having a graphical user interface or an Internet browser, or any combination thereof. The components of the system may be connected by any form or medium of digital data communication such as a communication network. Examples of communication networks include, e.g., a telephone network, a LAN, a WAN, and the computers and networks forming the Internet.
The computer system may include clients and servers. A client and server may generally be remote from each other and may typically interact through a network. The relationship of client and server may arise by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
One or more features or steps of the disclosed embodiments may be implemented using an API. An API may define one or more parameters that are passed between a calling application and other software code (e.g., an operating system, library routine, function) that provides a service, that provides data, or that performs an operation or a computation.
The API may be implemented as one or more calls in program code that send or receive one or more parameters through a parameter list or other structure based on a call convention defined in an API specification document. A parameter may be a constant, a key, a data structure, an object, an object class, a variable, a data type, a pointer, an array, a list, or another call. API calls and parameters may be implemented in any programming language. The programming language may define the vocabulary and calling convention that a programmer will employ to access functions supporting the API
In some implementations, an API call may report to an application the capabilities of a device running the application, such as input capability, output capability, processing capability, power capability, communications capability, etc.
While various embodiments have been described above, it should be understood that they have been presented by way of example and not limitation. It will be apparent to persons skilled in the relevant art(s) that various changes in form and detail may be made therein without departing from the spirit and scope. In fact, after reading the above description, it will be apparent to one skilled in the relevant art(s) how to implement alternative embodiments. For example, other steps may be provided, or steps may be eliminated, from the described flows, and other components may be added to, or removed from, the described systems. Accordingly, other implementations are within the scope of the following claims.
In addition, it should be understood that any figures which highlight the functionality and advantages are presented for example purposes only. The disclosed methodology and system are each sufficiently flexible and configurable such that they may be utilized in ways other than that shown.
Although the term “at least one” may often be used in the specification, claims and drawings, the terms “a”, “an”, “the”, “said”, etc. also signify “at least one” or “the at least one” in the specification, claims and drawings.
Finally, it is the applicant's intent that only claims that include the express language “means for” or “step for” be interpreted under 35 U.S.C. 112(f). Claims that do not expressly include the phrase “means for” or “step for” are not to be interpreted under 35 U.S.C. 112(f).