This application is based on and claims priority under 35 U.S.C. §119 from Japanese Patent Application No. 2007-118957 filed Apr. 27, 2007.
The invention relates to a computer-readable medium storing a document processing program, a document processing apparatus and a document processing system.
According to an aspect of the invention, a computer-readable medium stores a program causing a computer to execute document processing. The document processing includes: acquiring document data including one or more pieces of attribute information; and acquiring attribute extraction information of each attribute information. Each attribute extraction information includes (i) extraction method information indicating an extraction method for extracting the corresponding attribute information from the document data, and (ii) position information that indicates a position of the corresponding attribute information in the document data, and corresponds to the extraction method indicated by the extraction method information for the corresponding attribute information. The document processing further includes registering attribute information that is extracted from the document data based on the attribute extraction information, as the attribute information of the document data.
Exemplary embodiments of the invention will be described in detail below with reference to the accompanying drawings, wherein:
The “attribute information” included in a document means information for classifying a plurality of documents and easily retrieving a specific document from the plurality of documents. For example, the attribute information may be date, place, person's name and the like. Also, one document may include plural pieces of attribute information. Appellations, such as ‘date,’ ‘place,’ and ‘person's name’, which are used to distinguish the respective attribute information from each other, may be called “attribute names”. For example, if “Mar. 1, 2007” is written in a document, the date “Mar. 1, 2007” is the attribute information corresponding to the attribute name “date” of the document. Furthermore, contents of a “document” may be desired one. That is, a document may include, for example, any of a deed of contract, specifications, drawings, tables, illustrations and pictures.
In the attribute instruction sheet, described is attribute extraction information each for extracting corresponding attribute information from a document. Each “attribute extraction information” includes (i) extraction method information indicating an extraction method for extracting corresponding attribute information from document data, and (ii) position information that indicates a position of the corresponding attribute information in the document data and corresponds to the extraction method indicated by the extraction method information for the corresponding attribute information. The extraction method may be selected from a plurality of methods, and in such a case, the attribute extraction information may include selection information that indicates one extraction method selected among the plurality of methods.
The “extraction method” is to designate a method to specify a position where attribute information is written in a document. For example, the extraction method may be a coordinate designation method that specifies an rectangular area containing attribute information using (i) X and Y coordinates of the upper left point of the rectangle with the upper left point of the document being defined as the origin point, and (ii) a width and a height indicating the X-direction length and the Y-direction length each starting from the upper left point of the rectangle.
Further, the “position information” corresponding to the extraction method is information that designates a position, an area, a page and the like where the attribute information included in a document is written in the document. In the case of the coordinate designation method described above, the X and Y coordinates, the width and the height correspond to the position information.
The network 10 is a local area network such as wired LAN and/or wireless LAN. It may also be a network connected to the Internet.
Each of the scanners 2A, 2B includes a reading unit that optically reads originals of documents and attribute instruction sheets as image data using a photoelectric converting device, and a transmitting unit that transmits the image data to the document processing server 3A via the network 10. Although
The computing unit 30 functions as an acquiring unit 300, an extracting unit 301 and a registering unit 302 by executing operation in accordance with the document processing program 310 and the first to fourth attribute extraction programs 311A to 311D, which are stored in the storage device 31.
The acquiring unit 300 acquires document data including attribute information from the scanners 2A, 2B, receives attribute-instruction-sheet data including attribute extraction information for extracting attribute information from the document data. The acquiring unit 300 executes a character recognition process so as to acquire, from the attribute-instruction-sheet data, the attribute extraction information for extracting the attribute information. The character recognition process includes: extracting a character pattern in an area that is determined in advance, based on the attribute-instruction-sheet data; comparing the character pattern with a character recognition dictionary by a pattern matching method or the like; and determining one having the highest similarity as recognition result.
The extracting unit 301 selects, from among the first to fourth attribute extraction programs 311A to 311D, an attribute extraction program corresponding to the extraction method included in the attribute extraction information acquired by the acquiring unit 300. The extracting unit 301 extracts attribute information from the document data by sending document data and position information to the selected attribute extraction program and receiving an attribute extraction result obtained by the attribute extraction program.
The registering unit 302 generates the attribute-containing document data 312 to which the attribute information extracted by the extracting unit 301 from the document data is attached as attribute information of the document data, and registers the generated attribute-containing document data 312 in the storage device 31. The registering unit 302 may register the document data and the extracted attribute information, in association with each other, in a database which manages plural pieces of document data. The registering unit 302 may register, in the storage device 31, the attribute-containing document data 312 in a certain file format that application software such as word-processing software can edit.
The first to fourth attribute extraction programs 311A to 311D are programs to extract attribute information by receiving document data and position information via the extracting unit 301 and by executing the character recognition for the document data based on the position information.
The first attribute extraction program 311A is a program to execute the character recognition for an area that is in a document and that is designated by the coordinate designation method, that is, an area designated by the four parameters, i.e. X coordinate, Y coordinate, width and height.
The second attribute extraction program 311B is a program to implement an invisible-pen mark method for executing character recognition for an area that is in a document and that is marked with an invisible pen which is invisible to human's eyes but appears in image data read by the scanners 2A, 2B. The marking may be made to surround a character string to be extracted, underline the character string to be extracted, or trace the character string to be extracted. It should be noted that the marking is not limited to these examples.
The third attribute extraction program 311C is a program to execute character recognition process for an area that is sandwiched between (i) a start keyword representing a separator provided at the head of a character string to be extracted, such as (, ┌, {, and (ii) an end keyword representing a separator provided at the end of the character string to be extracted, such as ), ┘, }. Each of the start keyword and the end keyword may be a character string of two or more characters.
The fourth attribute extraction program 311D is a program to extract a page, to which a sticky note is attached, from a document having a plurality of pages, according to whether or not the page has a protruding part (a part corresponding to the attached sticky note), and to execute character recognition process for the entire extracted page. Position information is designated by a sticky-note ID indicating the number of attached sticky notes.
The attribute extraction program is not limited to the four programs. The attribute extraction program may be another attribute extraction program employing another extraction method, or may be selected from among more than four attribute extraction programs. Furthermore, the attribute extraction program may also be selected from two or three attribute extraction programs.
Next, an example of the operation of the document processing system 1A according to the first exemplary embodiment will be described with reference to
The attribute instruction sheet 11 includes: a plurality of attribute name entry boxes 110A to 110E for in which the plurality of attribute names are entered; check boxes 111 used to indicate an extraction method selected from among the four extraction methods, that is, the coordinate designation method, the invisible-pen mark method, the keyword designation method and the sticky note designation method, for designating position information indicating attribute information corresponding to the attribute name entered in the attribute name entry boxes 110A to 110E; and a plurality of underlines 112 in which the position information corresponding to the selected extraction method is written.
The document 12 includes a title 120 of the document, a plurality of articles 121A to 121C relating to this contract, effective date 122 of this contract, and address 123 and name 124 of a seller defined as A in the contract.
An explanation will be given about the case where the title 120, the articles 121A to 121C, the effective date 122, the A's address 123 and the A's name 124 are extracted as attribute information of the document 12, and these pieces of extracted attribute information are registered as the attribute information of the document. The number of pieces of attribute information may be one or plural.
First, a user writes necessary items in the attribute instruction sheet 11. Namely, in order to extract the title 120 as attribute information, the user writes “title” in the attribute name entry box 110A of the attribute instruction sheet 11 as shown in
Next, in order to extract the article names 121A to 121C as attribute information, the user writes “article name” in the attribute entry box 110B of the attribute instruction sheet as shown in
Next, in order to extract the effective date 122, A's address 123 and A's name 124 as attribute information, the user writes “effective date”, “A's name” and “A's address,” respectively, in the attribute name entry boxes 110E, 110C and 110D of the attribute instruction sheet as shown in
Furthermore, as shown in
Here, the values entered in the mark IDs 115A to 115C of the attribute instruction sheet shown in
Next, the user reads the completed attribute instruction sheet 11 and the document 12 shown in
The scanner 2A generates attribute-instruction-sheet data and document data which are, for example, formed of bitmap data from the read-out attribute instruction sheet 11 and the read-out document 12. The scanner 2A transmits the document data and the attribute-instruction-sheet data to the document processing server 3A via the network 10.
In the document processing server 3A, upon receiving the document data and the attribute-instruction-sheet data from the scanner 2A, the acquiring unit 300 executes character recognition process for the attribute-instruction-sheet data to acquire attribute extraction information (S1).
Next, the extracting unit 301 selects, from among the attribute extraction programs 311A to 311D, an attribute extraction program that corresponds to an extraction method of the attribute extraction information acquired by the acquiring unit 300 (S2). For example, in the attribute instruction sheet 11 shown in
Next, the document data and position information are transmitted to the selected attribute extraction programs (S3). For example, integers of the X coordinate 113A, the Y coordinate 113B, the width 113C and the height 113D, which are written in the attribute instruction sheet 11, are transmitted as the position information to the first attribute extraction program 311A, which correspond the attribute name “title”. The document data 12 in which the first and third markings 125A to 125C and the round marks 126 are written is transmitted as the position information to the second attribute extraction program 311B, which corresponds to the attribute names “A's address”, “B's address” and “contract completion date”. Furthermore, the character strings of the start keyword 114A and the end keyword 114B, which are written in the attribute instruction sheet 11, are transmitted as the position information to the third attribute extraction program 311C, which correspond to the attribute name “article name”.
The selected first to third attribute extraction programs 311A to 311C each operates to extract an area corresponding to the position information from the document data, and executes the character recognition for the extracted area to extract the attribute information. For example, the first attribute extraction program 311A executes the character recognition for an area of the document data designated by the X coordinate 113A, the Y coordinate 113B, the width 113C and the height 113D, and extracts a character string of “contract of sale of goods”. The second attribute extraction program 311B extracts areas in which the respective first to third markings 125A to 125C are written, and executes the character recognition for the respective extracted areas to extract character stings of “Jun. 7, 2005”, “1-2-3, X-cho, X-ku, Tokyo” and “Taro X” as well as the numbers of round marks 126 for the respective character strings. Also, the third attribute extraction program 311C searches for an area surrounded by the start keyword 114A and the end keyword 114B, and executes the character recognition for the found area to extract character stings of “designation of goods”, “unit price and total trading value” and “agreed jurisdiction”.
Next, the extracting unit 301 receives the attribute information extracted from the document data by the selected attribute extraction program (S4). For example, the extracting unit receives, from the first attribute extraction program 311A, the character string “contract of sale of goods” as the attribute information of the attribute name “title”. Also, the extracting unit 301 receives, from the second attribute extraction program 311B, the character stings of “Jun. 7, 2005”, “1-2-3, X-cho, X-ku, Tokyo” and “Taro X” as well as the numbers of round marks 126 corresponding to the respective character strings, and renders the these character strings to be the attribute information corresponding to the attribute names “A's address”, “B's address” and “effective date” so that the integers entered as the mark IDs 115A to 115C are identical with the numbers of round marks 126, respectively. Also, the extracting unit 301 receives, from the third attribute extraction program 311C, the character stings “designation of goods”, “unit price and total trading value” and “agreed jurisdiction” as the attribute information of the attribute name “article name”.
Next, the registering unit 302 generates attribute-containing document data 312 to which plural pieces of attribute information extracted from the document data by the extracting unit 301 are added as attributes of the document data. For example, the registering unit 302 adds, to the document data, (i) the attribute information “contract of sale of goods” for the attribute name “title”, (ii) the attribute information “Taro X” for the attribute name “name”, (iii) the attribute information “1-2-3, X-cho, X-ku, Tokyo” for the attribute name “A's address”, (iv) the attribute information “Jun. 7, 2005” for the attribute name “effective date”, and (v) the attribute information “designation of goods”, “unit price and total trading value” and “agreed jurisdiction” for the attribute name “article name”. Then, the registering unit 302 registers the generated attribute-containing document data 312 in the storage device 31 (S5).
Thereafter, the user inputs, via the input unit 33 of the document processing server 3A, attribute information or an attribute name and a search key for the attribute name, for example, attribute information corresponding to he attribute name, and browses the attribute-containing document data 312 corresponding to the search key via the display unit 34.
As compared with the document processing server 3A of the first exemplary embodiment, the document processing server 3B is different in that the acquiring unit 300 receives attribute extraction information from the terminal 4 via the network 10. The remaining configuration is the same.
In addition to the input unit and the display unit, the terminal 4 includes a CPU for controlling the terminal 4; a storage unit having ROM, RAM and/or a hard disk for storing an attribute-extraction-information input program for inputting and editing attribute extraction information, to be executed by the CPU as well as various kinds of data; and a communication unit (for example, a network interface card) connected to the network 10. The terminal 4 is, for example, a personal computer (PC) and a personal digital assistance (PDA).
Next, an example of an operation of the document processing system 1B according to the second exemplary embodiment will be described with reference to
A user executes the attribute-extraction-information input program by the terminal 4, and displays the attribute-instruction-sheet input screen 13 on the display unit of the terminal 4. Then, the user inputs an attribute name in a text box 130 on the attribute-instruction-sheet input screen 13, designates an extraction method corresponding to the input attribute name by checking a text box 131, and inputs position information corresponding to the extraction method in an integer input box 132 and a character string input box 133.
Next, when the user inputs attribute extraction information and presses an “OK” button 134A, the terminal 4 transmits the input attribute extraction information to the document processing server 3B via the network 10. If the user presses a “cancel” button 134B, the terminal 4 interrupts the input of the attribute extraction information.
Furthermore, when the user reads out with the scanner 2 a document from which attribute information are to be extracted according to the attribute extraction information, the scanner 2 transmits the read document data to the document processing server 3A via the network 10.
The document processing server 3B receives the attribute extraction information from the terminal 4, receives the document data from the scanner 2, and transmits the document data and the attribute extraction information to the acquiring unit 300.
Thereafter, in the same manner as in the first exemplary embodiment, attribute information are extracted, attribute-containing document data 312 is generated, and the generated attribute-containing document data 312 is registered in the storage device 31.
As compared with the document processing server 3B of the second exemplary embodiment, the document processing server 3C is different only in that the registering unit 302 registers the attribute-containing document data 312 in the storage unit of the document storage server 5 via the network 10. The remaining configuration is the same.
As compared with the terminal 4 of the second exemplary embodiment, the terminal 4 of this exemplary embodiment is different only in that the attribute-containing document data 312 stored in the document storage server 5 is searched and browsed via the network 10. The remaining configuration is the same.
In addition to the memory unit and the communication unit, the storage server 5 includes: a CPU for controlling respective portions of the document storage server 5; an input unit having a key board and a mouse each for accepting data input and operational instructions; and a display unit having an LCD (liquid crystal display) for displaying thereon input screens. The document storage server 5 may be a personal computer (PC), a work station (WS) and the like, in place of a server.
The CPU 60 operates according to the document processing program 610 and the first to fourth attribute extraction programs 611A to 611D, which are stored in the storage device 61, so as to function as an acquiring unit 600, an extracting unit 601 and a registering unit 602 in the same manner as the document processing server 3A in the first exemplary embodiment.
Next, a description will be made of an example of an operation of the document processing system 1D according to the fourth exemplary embodiment.
First, a completed attribute instruction sheet 11 and a document 12, which are the same as those in the first exemplary embodiment, are read our by a user with the reading unit 62 of the multifunction device 6. Instead of reading out the completed attribute instruction sheet 11, the user may input attribute extraction information in an attribute designation input screen 13 displayed on the display unit of the terminal 4 or the operation display unit 64 of the multifunction device 6.
The multifunction device 6 transmits, to the acquiring unit 600, the document data and the attribute-instruction-sheet data read out by the data reading unit 62.
Next, the acquiring unit 600 performs the character recognition process for the attribute-instruction-sheet data to acquire attribute extraction information for extracting attribute information from the document data.
Next, the extracting unit 601 selects, from among the first to fourth attribute extraction programs 311A to 311D, an attribute extraction program corresponding to an extraction method designated by the attribute extraction information acquired by the extracting unit 600.
Subsequently, the extracting unit 601 transmits the document data and position information to the selected attribute extraction program, and receives attribute information extracted from the document data by the selected extraction program.
Next, the registering unit 602 generates attribute-containing document data 612 to which the attribute information are attached as attributes of the document data, and registers the generated attribute-containing document data 612 in the storage device 61.
Thereafter, using the attribute information or the attribute name and other attribute information corresponding thereto as a search key, the user searches for document data through the terminal 4, and browses the attribute-containing document data 612 corresponding to the search key. Alternatively, the operation display unit 64 of the multifunction device 6 may be used for search and browsing.
The invention is not limited to the foregoing exemplary embodiments, and may be modified without departing from the scope of the invention. For example, in the first to third exemplary embodiments, the document processing servers 3A to 3C receive the document data and the attribute-instruction-sheet data read out by the scanners 2A, 2B via the network 10. However, those exemplary embodiments may receive image data via a telephone line network 14, or may receive a part of image data via the network 10 and then the remaining of the image data via the telephone line network 14.
Furthermore, in each of the foregoing exemplary embodiments, the document processing servers 3A to 3C and the acquiring unit, the extracting unit and the registering unit of the multifunction device 6 are implemented by the computing unit or CPU and the document processing program and the attribute extraction programs. However, a part or all of them may be implemented by hardware such as application specific integrated circuits (ASIC).
The document processing program used in each of the foregoing exemplary embodiments may be read from a storage medium as CD-ROM into the storage unit within the apparatus, or may be downloaded from a server connected to the network like the Internet into the storage unit of the apparatus.
Furthermore, the document processing program used in each of the foregoing exemplary embodiments may include some or all of the first to fourth attribute extraction programs 311A to 311D.
Still further, the component elements of the foregoing exemplary embodiments may be optionally combined without departing from the scope of the invention.
| Number | Date | Country | Kind |
|---|---|---|---|
| 2007-118957 | Apr 2007 | JP | national |