1. Field
Embodiments of the present invention are directed towards implementations and functionalities related to an electronic file storage and management system operable on a variety of devices and across many storage mediums. At least some embodiments are directed towards optical character recognition (OCR) and intelligent character recognition OCR (ICR) that is capable of processing documents.
2. Description of the Related Art
Computer users generate a great amount of electronic files that are stored on personal computers, mobile devices, portable storage devices, and cloud services. As the volume of files expands the ability to find and sort the files efficiently becomes more important and difficult. One option for locating files is to give each file a text-based name that briefly describes the form and substance of the document contained within the file. Usually, users must create names manually when saving or sending documents and often fail to do so for all but the most important documents. At best, currently available applications provide a name for a document based on text found as a first line of a yet-to-be-named document or with information or default text available to the application naming the document.
One option for locating documents is to give each document a text-based name that briefly describes its form and substance. Currently, users must create file names manually when saving or sending documents, often failing to do so except for the most important documents. Often, users save documents into a single store, and over time accumulate documents with names such as, “image—0001.jpg”, and “21082008.pdf”, making recollection of their contents and searches for particular or important documents almost impossible. For example, when processing groups or batches of documents with an OCR application, the output is typically a batch or group of documents with recognized data with files named according to a generic pattern, for example: “Document0001,” “Document0002,” etc. The resulting documents may be sent to the user by e-mail or placed in a pre-defined folder.
When a user regularly accumulates a large number of unnamed documents, the result is a multitude of files with similar-looking meaningless names in the user's mailbox or pre-defined folder. Opening and checking the contents of these files and renaming them involve a significant amount of repetitive manual work and substantial loss of time. Therefore, there is substantial opportunity to automate this process and provide meaningful names to such files.
The invention provides methods for determining one or more document types associated with a document or electronic file and its unique features. The method comprises generating at least one document hypothesis for corresponding to the type of the document. For each document hypothesis, the method further comprises verifying said document type hypothesis, selecting a document type hypothesis, and forming a document name based on the best type hypothesis and one or more unique features of the document. This method can be repeated for a batch of electronic files.
Further, this application describes methods of automatically naming documents or electronic files. Each electronic file processed through this system receives a unique or semi-unique name that describes some of the electronic file's contents, attributes and/or characteristics. One exemplary output of the described technology is one or more possible names for each of the user's electronic files which allow one to understand the contents and significance of each electronic file.
While the appended claims set forth the features of the present invention with particularity, the invention, together with its objects and advantages, will be more readily appreciated from the following detailed description, taken in conjunction with the accompanying drawings.
The present invention automates the process of file name generation by intelligently naming documents or electronic files based on their form and content making electronic files easier to sort and find. The terms “document” and “electronic file” are used interchangeably herein. Examples of electronic files that could be named automatically include document images in the form of portable document format files (PDF's), scanned forms, email messages, attachments, websites, photos, and others.
One particular class of electronic files is document images originating from scanners, mobile phone photographs, cameras, and email messages, either as attachments or as embedded images. Generally, these images are stored as electronic files for future access and accumulate greatly over time on personal computer and cloud storage locations like in email accounts. As document images accumulate without proper file naming they become particularly difficult to locate.
One option for locating a document image or electronic file is to perform a key word search on the text portions of documents. This requires applying an OCR system that is capable of transforming document images into a computer-readable, computer-editable, and searchable form. OCR systems may also be used to extract data from document images. Typically, OCR systems output a plain text file, with simplified layout and formatting. These files retain simplified properties of the source document such as paragraphs, fonts, font styles, font sizes, and some other simple features. Often these OCR files are not as useful to the user as the original document image. For example, when applying OCR to newspaper or magazine pages with several articles on a page, it may be difficult or impossible to separate one story from another, resulting in an unacceptable text file. Therefore, to facilitate searches for the original document image, OCR systems may enable keyword searches of the previously recognized text in the source document image. The system enables a keyword search by recognizing and indexing the text of each new document image.
The contents of electronic files and images can serve as input information to generate an automatic name for the electronic file. The term “electronic file” herein shall include any collection of electronic data that is capable of having of having at least one tag extracted from it and that may be saved in electronic form. The text of a document or a scanned or photographed image of a document contained within a file may be used. A document image can be one or more pages. An image that includes “vector” or “vector-based” information about the disposition and content of text and graphic elements can also be used as input information. For example, a document image could include a portable document format (PDF) file with a text layer, a vector-based PDF file, an XPS-type file, a DOCX-type file, an XSLX-type file, a plain text (TXT) file, etc. An electronic text document could include text files, emails, websites, social media posts, and annotations.
For example, a document like a newspaper or magazine page may include several different articles with separate titles, inserts, and pictures. In accordance with embodiments of the present invention, a result of performing optical character recognition (OCR) or intelligent character recognition (ICR) is an editable text-encoded document that replicates the logical structure, layout, and formatting of the original paper document or document image that was fed to the system.
A text string briefly reflecting the content of a document can be used as the file name of the document. Such a useful file name is a result of the methods described herein. “File name string” is the term used for this string herein. Certain structural elements of a document or electronic file, their order and spatial relationships, and certain keywords or unique features in titles or in other parts of the electronic file may sometimes be used to compose a file name string. For example, the file name string can include information about a type or category describing the document (e.g., letter, business card). The file name string also can include information from “tags” inside the document (e.g., date, address, names).
“Tags” is the term used herein to describe keywords and unique features of a document as described more fully below. Tags are small parts of a text reflecting a document's properties. For example, the title of the document, the name of the document's author, the date of writing, and the header can be used as a tag.
Each tag may comprise a type (for example: Author) and value (for example: “Mark Twain”). Several examples of types are illustrative: a header, a running title, a page number, a quotation, a date of purchase (such as from a bank statement, receipt), a date that a contract was executed, a url, and an e-mail address.
The tags result from an analysis of a document. At its simplest, the tag can be found in the text (e.g., text string, body of text) of a document. In more sophisticated cases, one or more tags for a document can be calculated or generated on the basis of data contained in the document—hidden data, metadata, format data and data in the content of the document. Also, a tag can be generated from data received or queried from additional sources of knowledge outside of the document. For example, one can find the name of a book's author by performing an internet search. In another case, the name of a book can be recognized from a barcode in an image or a photo of the book's cover.
Also shown in
Features of the document, text-based and non-text-based, can serve as tags or portions of tags—or may be used to identify elements that can serve as tags. For example, font size and relative text location may be used. Running titles, such as the one shown in
Sources external to the text or text elements may be used to generate or locate text or data that may be used as the source of a tag or portion thereof. For example, a QR barcode with an encoded URL may be found in an image of a document. A tag generation algorithm may include recognizing this QR barcode, decoding the associated URL, accessing a Web page at the URL, and retrieving information from a header of the accessed Web page.
In yet another embodiment, a telephone number may be found in the document. A tag or portion thereof may be generated by using an external phone book or database of telephone numbers, searching for and locating the number, and retrieving a name of a company associated with the telephone number. In another embodiment, a telephone number may be called and a recording made subsequent to the call reaching a destination (e.g., an automated greeting message of a company); subsequently, a voice to text procedure may be performed, and the text derived from or based on such text may be used as a tag or portion thereof.
In yet another example, when a quotation appears in an image of a document, the quotation may be used to derive the name, birthdate, etc. of its author to be used as a tag or portion thereof.
In another example, if a postal ZIP code appears in an a document, the ZIP code may be used to derive an associated city, state or other information that may then be used as a tag or portion thereof. For a ZIP code in the United States of America, a ZIP code of 10118 could be used to derive “Empire State Building” in New York City.
In another example, suppose a URL and other text appears at the top of a page in a document (such as when creating a PDF document from a Web browser). In this example, suppose that “http://www.ibsen.net/?id=1430” and “30.09.2008” appear along the top of one or more pages of a document. A tag extracting function or functionality may identify the date (i.e., 30.09.2008) and domain name (i.e., ibsen.net). These two tags could be processed together or independently to form the name of the document, e.g., “ibsen_net—1430—2008-09-30.”
A file name string can also be generated at the time of document conversion (e.g., renaming; subjected to OCR, saved and renamed). Such generation may be embedded in or may operate in conjunction with functions of the operating system or file browser (e.g., file explorer). For example, suppose a file has the name “picture—001.jpg”; this file can be saved as “Letter_from_John—30.Aug.12.pdf⇄ when processing is completed. A file browser may facilitate or offer a function titled or named “intelligent renaming.” A user may, for example, right-click on a file, trigger “intelligent renaming,” and without further input or action from a user, may rename the file based upon tags derived from the document according to one or more of the functions and examples described herein. For example, an “intelligent renaming” function may use information obtained or derived the EXIF data from a JPG image file to rename the file from, for example, “img0701.jpg” to “2012—04—28—2041—2240_x—1680”, which includes information about the date and time on which the image was taken, and a width and height (dimensions) of the image. Such renaming could be automated such that a batch of documents (irrespective of file type) may be renamed. For example, a batch of documents that include rich text format (RTF) documents, JPG images and TIFF images may be processes as a batch. Such renaming allows for more useful names of files with a minimal amount of effort required by a user.
In another embodiment, the file naming process could be open to tuning by the user so that the file name reflected the most relevant information to the user. This is accomplished by user input on tag section and file name formatting. For example, the user can adjust the settings such that the author name to always appear in the file name or alternatively such that the company name appears in the file name.
In other embodiment, the file naming process could be implemented on a hard disc drive (HDD or SDD) with software. Such a drive could be a part of RAID system. The process would operate when a user saves electronic documents onto the drive. The system could perform smart naming automatically or perform it if it is approved by user. Such a function could be turned off via a hardware option (such as an additional switch on the drive) or with a software setting. Further, the function could be activated or deactivated with the Master\Slave pins on the HDD or SDD.
Another exemplary function that could be implemented is to use the generated file name and the tags to send files into folders designated to receive certain file types. For instance, folders could be designated as letters, photos, checks, invoices, etc. Additionally the functionality could run on a server and send uploaded files to designated shared folders. Such functionality could operate automatically or upon a prompt to the user.
Another exemplary function that could be implemented using the derived or generated tags is adjusting file properties. File properties can be updated based upon tags derived from the document contained in the file according to one or more of the functions and examples described herein. For example, one or more of the following properties may be specified with data: title, subject, categories, and author name. Such file property categories are dependent upon the file system used (e.g., Linux, Microsoft® Windows®).
In another embodiment, the file naming could be implemented as part of a web browser or search engine. For instance when a user finds a page or file she would like to save or bookmark, the naming function would operate when saving the file or bookmark to give it a meaningful name. This function does not replace traditional the traditional ‘Save as’, but can be an alternative to it. Alternatively, the process could be implemented as part of an electronic messaging system, such as email, whereby a smart name is generated for the subject line of an email when the subject is left blank by the user. This feature could operate automatically or as a user selected option.
In another embodiment, the file naming system could be integrated into a cloud based service like Google Docs, Facebook, or Dropbox, and provide an option for file naming. The file naming feature could operate on all new files uploaded to the service or on files already service. The file names and tag information could subsequently be used to send the files to specific folders in the cloud service.
Selection of tags may include ranking of tags for subsequent file name generation. In a preferred implementation, all extracted tags are ranked. An assigned rank can depend on one or more factors such as a tag type, a document type, presence of other similar tags in the document, presence of other different tags in the document, and a tag's location in the document. One or more tags with a maximal rank are selected. A file name is formed using the selected tags. In one embodiment, an optimal file name is a combination of a group of tags. This group may include two parts. The first part is a “descriptive” and corresponds to a document type description. The second part is a unique or semi-unique part, such as a serial number, or some text that can likely distinguish the file name from hundreds or thousands of other file names. Examples of a two-art file name are “invoice 20_march” or “Business card John Smith, ABBYY”. Several extracted tags (or parts thereof) may be combined when creating a “part” for a two-part file name. In another embodiment, a file name can include only one of the two parts from a two-part file name. For example, a file name may be “20_march” (no ‘descriptive part’) or “invoice” (no unique part). The exact parts used may be automatically determined, or may be based on configurations or preferences available to the name generation algorithms, routines, software, etc.
Returning to
Optionally, the process may involve performing a document classification 438 from the image 420. Document classification 438 is described in further detail herein. Document classification 438 yields one or more document type hypotheses 440. These document type hypotheses, either verified or non-verified, may serve to inform or affect tag extraction 426, tag preprocessing 428, and selection of tags 430. For example, if a tag for a particular image includes the text “recipe” but the document classification returns a high probability (through a document type hypothesis 440) that the image 420 is that of a letter, then during tag selection 430, the method can discard or omit the tag for “recipe” as a candidate for renaming the file (image or document) as a “recipe.”
In one embodiment, the system comprises an imaging device connected to a computer programmed with specially designed OCR (ICR) software, functionality, algorithms or the like. The system is used to scan a paper-based document (source document) or to make a digital photo of it so as to produce a document image thereof. In another embodiment, such document image may be made with a digital camera (or mobile phone, smart phone, tablet computer and the like), received through a medium such as e-mail, captured from or with a software application, or obtained from an online OCR Web-based service.
Any given document may have several specific fields and form elements. For example, a document may have several titles, subtitles, headers and footers, an address, a registration number, an issue date field, a reception date field, page numbering, etc. Some of the titles may have one of several pre-defined specified values, for example: Invoice, Credit Note, Agreement, Assignment, Declaration, Curriculum Vitae, Business Card, etc. Other documents may include such identifying words as “Dear . . . ”, “Sincerely yours” or “Best regards.” The presence of these words coupled with their characteristic location on a page will often allow the system to classify the document as belonging to a particular type (e.g., personal letter, business letter).
Apart from the unique features typical of the given document type, the document may include unique values corresponding to respective unique features, for example: invoice number, credit note number, a date of the agreement, signatories to the assignment, the name of the person submitting the curriculum vitae, or the name of the holder of the business card person, etc. In one embodiment, the OCR software compares a value with descriptions of possible types available to the software in order to generate a hypothesis about the type of the source document. Then the hypothesis is verified and the recognized text is transformed to reproduce the native formatting of the source document. After processing, recognized text may be exported into an extended editable document format, for example, Microsoft Word format, rich text format (RTF), or Tagged PDF, and may be given a unique name based on the identified document type and its unique features. For example, “Invoice.sub.--#880,” “Credit Note.sub.--888,” “Agreement.sub.--543,” “Agreement.sub.--543_page—1,” “Agreement.sub.--543_page—2,” “Agreement.sub.--12.03.2009,” “Curriculum Vitae Nicole Bishop,” “Business Card of Ingerlei Renata,” “Letter to Mr Juan Valdez,” “Letter from Mr. Willy Wonka,” etc.
In another embodiment, the logical structure of the document is recognized and is used to arrive at conclusions about the style and a possible name for the recognized document. For example, the system may determine whether it is a business letter, a contract, a legal document, a certificate, an application, etc. The system recognizes the document and checks how well each of the generated hypotheses correspond to the actual properties of the document. The system evaluates each hypothesis based on a degree of correspondence between the hypothesis and the information, properties or tags extracted from the document. The hypothesis with the highest correlation with the actual properties of the document is selected.
In order to process a document image, in one embodiment, the system is provisioned with information about specific words which may be found and the possible mutual arrangement of form elements. As noted above, the form elements include elements such as columns (main text), headers and footers, endnotes and footnotes, an abstract (text fragment below the title), headings (together with their hierarchy and numbering), a table of contents, a list of figures, bibliography, the document's title, the numbers and captions of figures and tables, etc.
Some embodiments of the invention include integrating automatic file naming into equipment and processes including scanners, digital cameras, hard drives, flash memory drives, servers, personal computers, cloud services like email, operating systems, and internet search engines.
The hardware 800 also typically receives a number of inputs and outputs for communicating information externally. For interface with a user or operator, the hardware 800 may include one or more user input devices 806 (e.g., a keyboard, a mouse, imaging device, scanner, etc.) and a one or more output devices 808 (e.g., a Liquid Crystal Display (LCD) panel, a sound playback device (speaker).
For additional storage, the hardware 800 may also include one or more mass storage devices 810, e.g., a floppy or other removable disk drive, a hard disk drive, a Direct Access Storage Device (DASD), an optical drive (e.g. a Compact Disk (CD) drive, a Digital Versatile Disk (DVD) drive, etc.) and/or a tape drive, among others. Furthermore, the hardware 800 may include an interface with one or more networks 812 (e.g., a local area network (LAN), a wide area network (WAN), a wireless network, and/or the Internet among others) to permit the communication of information with other computers coupled to the networks. It should be appreciated that the hardware 800 typically includes suitable analog and/or digital interfaces between the processor 802 and each of the components 804, 806, 808, and 812 as is well known in the art.
The hardware 800 operates under the control of an operating system 814, and executes various computer software applications, components, programs, objects, modules, etc. to implement the techniques described above. Moreover, various applications, components, programs, objects, etc., collectively indicated by reference 816 in
In general, the routines executed to implement the embodiments of the invention may be implemented as part of an operating system or a specific application, component, program, object, module or sequence of instructions referred to as “computer programs.” The computer programs typically comprise one or more instructions set at various times in various memory and storage devices in a computer, and that, when read and executed by one or more processors in a computer, cause the computer to perform operations necessary to execute elements involving the various aspects of the invention. Moreover, while the invention has been described in the context of fully functioning computers and computer systems, those skilled in the art will appreciate that the various embodiments of the invention are capable of being distributed as a program product in a variety of forms, and that the invention applies equally regardless of the particular type of computer-readable media used to actually effect the distribution. Examples of computer-readable media include but are not limited to recordable type media such as volatile and non-volatile memory devices, floppy and other removable disks, hard disk drives, optical disks (e.g., Compact Disk Read-Only Memory (CD-ROMs), Digital Versatile Disks, (DVDs), etc.).
In the previous description, for purposes of explanation, numerous specific details are set forth in order to provide an understanding of the invention. It will be apparent, however, to one skilled in the art that the invention can be practiced without these specific details. In other instances, structures and devices are shown only in block diagram form in order to avoid obscuring the invention.
Reference in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Moreover, various features are described which may be exhibited by some embodiments and not by others. Similarly, various requirements are described which may be requirements for some embodiments but no other embodiments.
While certain exemplary embodiments have been described and shown in the accompanying drawings, it is to be understood that such embodiments are merely illustrative and not restrictive of the broad invention and that this invention is not limited to the specific constructions and arrangements shown and described, since various other modifications may occur to those ordinarily skilled in the art upon studying this disclosure. In an area of technology such as this, where growth is fast and further advancements are not easily foreseen, the disclosed embodiments may be readily modifiable in arrangement and detail as facilitated by enabling technological advancements without departing from the principals of the present disclosure.
For purposes of the USPTO extra-statutory requirements, the present application constitutes a continuation-in-part of U.S. patent application Ser. No. 13/662,044 filed on 26 Oct. 2012 initially titled “Automated File Name Generation,” which is a continuation-in-part of U.S. patent application Ser. No. 12/749,525 filed on 30 Mar. 2010 initially titled “Automatic File Name Generation In OCR Systems,” which (in turn) is a continuation-in-part of U.S. patent application Ser. No. 12/236,054 titled “Model-Based Method of Document Logical Structure Recognition in OCR Systems” that was filed on 23 Sep. 2008, which is currently co-pending, or is an application of which a currently co-pending application is entitled to the benefit of the filing date. Patent application Ser. No. 12/236,054 claims the benefit of priority to U.S. 60/976,348 which was filed on 28 Sep. 2007. The United States Patent Office (USPTO) has published a notice effectively stating that the USPTO's computer programs require that patent applicants reference both a serial number and indicate whether an application is a continuation or continuation-in-part. See Stephen G. Kunin, Benefit of Prior-Filed Application, USPTO Official Gazette 18 Mar. 2003. The present Applicant Entity (hereinafter “Applicant”) has provided above a specific reference to the application(s) from which priority is being claimed as recited by statute. Applicant understands that the statute is unambiguous in its specific reference language and does not require either a serial number or any characterization, such as “continuation” or “continuation-in-part,” for claiming priority to U.S. patent applications. Notwithstanding the foregoing, Applicant understands that the USPTO's computer programs have certain data entry requirements, and hence Applicant is designating the present application as a continuation-in-part of its parent applications as set forth above, but expressly points out that such designations are not to be construed in any way as any type of commentary and/or admission as to whether or not the present application contains any new matter in addition to the matter of its parent application(s). All subject matter of the Related Applications and of any and all parent, grandparent, great-grandparent, etc. applications of the Related Applications is incorporated herein by reference to the extent such subject matter is not inconsistent herewith.
Number | Date | Country | |
---|---|---|---|
Parent | 13662044 | Oct 2012 | US |
Child | 13712962 | US |