The present invention relates to the extraction of information from documents, and in particular to systems that aid a user to extract information from documents that they are reading.
It has been demonstrated that when a reader reads a document they take in the information in the document more effectively if they read interactively. This includes marking the document as it is read, for example by underlining relevant words or passages or highlighting them in other ways. This also means that the marked document, when referred to again, will be easier to read as the words or passages of interest will be highlighted.
The present invention therefore provides a system for assisting a user in extracting information from a document set including at least one original document having content, the system comprising: a pen arranged to be moved over a representation of the original document to define pen strokes, a recording system arranged to record the position of the pen strokes on the representation, and a processor arranged to interpret the pen strokes as identifying selected parts of the content and to produce a reference document relating to the document set, the content of the reference document being dependent on the selected content.
The processor can comprise any suitable processing system, and may comprise a number of processing units arranged to operate together to process the pen strokes. The pen may be arranged to mark the representation of the original document, or it may comprise a simple pointing device such as a stylus. It may comprise part of a more complex system, for example being a light pen.
The content can be in any of a number of forms. For example, it may comprise text, images, drawings, or tables of figures or symbols.
The reference document may be human readable, either directly or by being representable or reproducible in a human readable form. For example the reference document may be an electronic document that can be displayed on screen or printed, or it may be a hard copy document.
The representation may comprise a hard copy of the document, or it may comprise a display of the document, for example on a display screen.
The reference document may include a copy of the document set with additional content, or links to additional content, or an index or summary added to aid re-reading of the document. Alternatively it may comprise a separate document, such as a summary or index of the original document set.
The processor may be arranged to search for other documents using a search strategy determined by the selected content, and to include the other documents in the set. In this case the reference document may simply identify the documents in the set, or it may include an indication of the relevance of at least one of the documents in the set.
The system may be arranged for use by a single user, or it may be arranged to identify a plurality of users, and to produce one reference document for each of the users, using pen strokes made by the respective user. The system may be arranged to identify each user on the basis of the identity of the pen that made the pen strokes, or by other methods such as the use of user names.
The present invention further provides a system for extracting information from a document set including at least one original document having content, the system comprising: a position determining means arranged to receive data defining the position of pen strokes made on a representation of the document by a pen, and processing means arranged to interpret the pen strokes as identifying selected parts of the text on the document and to produce a reference document relating to the document set, the content of the reference document being dependent on the selected content.
The present invention further provides a system for assisting a user in extracting information from an original document having content, the system comprising a manually operable selecting device, operable in conjunction with a representation of the original document to select portions of the content, and a processor arranged to produce a reference document relating to the original document, the content of the reference document being dependent on the selected portions.
The selecting device may be a hand held device. It may also be arranged to be placed in contact with, or close to, the representation in order to select the content. In this case the selecting device may be arranged either to make marks on the representation or simply to move over it. Alternatively the selecting device may be arranged to interact with the representation in some other way, for example by directing a light beam at the representation such that the light beam can be detected. Where the representation is a display, for example on a display screen, the selecting device may be arranged to operate by moving a cursor or other highlighting or selecting device on the screen.
The present invention further provides corresponding methods, and also a data carrier carrying data arranged to control relevant systems to operate as a system according to the invention and to perform the methods of the invention. The data carrier can comprise, for example, a floppy disk, a CDROM, a DVD ROM/RAM (including +RW, −RW), a hard drive, a non-volatile memory, any form of magneto optical disk, a wire, a transmitted signal (which may comprise an internet download, an ftp transfer, or the like), or any other form of computer readable medium.
Preferred embodiments of the present invention will now be described by way of example only with reference to the accompanying drawings.
Referring to
The position identifying pattern 6 can be detected by a sensing system mounted in a pen, as will be described below, so that the position of marks made on the document 2 by the pen can be detected.
Referring to
In order to produce the printed documents 2 the processor 210 retrieves an electronic document 218 from the memory 216 and sends it to the printer driver 214. The printer driver 214 allocates a unique document identification code to the document to be printed and requests the required pattern area from the pattern allocation module 212, which communicates the details of the pattern including the positions of all the required dots, back to the printer driver 214. The printer driver 214 then adds the pattern 6 to the electronic document to form an image which includes the pattern 6 and the content 4, converts the document including the pattern 6 to a format suitable for the printer 202, and sends it to the printer 202 which prints the document 2 including the pattern area 6. The exact position of the text on the printed document can change each time the document is printed out. The pattern allocation module 212 therefore stores details of each printed instance of the document including the position on the printed document of all of the content features of the document.
In practice the various components of the system can be spread out over a local network or the internet. For example the pattern allocation module 212 can be provided on a separate internet connected server so that it can be accessed by a number of users.
Referring still to
Referring back to
In use, a user creates one or more documents 218 in electronic form using the application 226, which will be stored in the PC's memory 216. These electronic documents 218 include definitions of written content 4, and may also comprise definitions of other forms of content such as drawings. The documents 218 can be displayed on the screen 204 of the PC and read directly from the screen. However, in this case, a hard copy of the document 2 is printed together with the position identifying pattern 6 as described above. When printing, the printer driver 214 identifies the layout of the printed document, and communicates that layout information to the pattern allocation module 212.
As the user reads the document 2, he can mark it in various ways using the tip of the pen nib 231 to select or highlight various parts of the text. These might be individual words, passages, sentences, paragraphs or sections. In the example shown in
It will be appreciated that the pen strokes can be made in a number of different ways depending on the nature of the pen. For example the pen could be arranged to act as a highlighter pen so that simply passing it over a word or part of the content would select that word or part.
As the marks 20, 22, 24, 26 are made, the pen 201 identifies the position and shape of the marks in pattern space and records this information as pen stroke data. When the document 2 has been read and marked by the user, the pen 201 is arranged to transmit the pen stroke data defining the marks 20, 22, 24, 26 to the PC. The transmitting of the data can be initiated in a number of ways, for example by marking a specific area of the document 2 that can be recognised by the pen 201 as a ‘send box’ causing the transmission of the data, or by making a mark of a particular shape, that is recognized by the pen as an instruction to transmit the data.
When the PC receives the pen stroke data, the pattern allocation module determines from the position in pattern space of the marks 20, 22, 24, 26, which document they have been made on, in this case the document 2, and the position on that document in which the marks have been made. The application 226 then retrieves the electronic copy 218 of the document 2 from the memory 210, and the definition stored in the pattern allocation module 212 of the printed document. This definition includes data defining all of the text and other content on the document and its position on the document. By combining the content data and the pen stroke data, the application 226 can determine which words, phrases, sentences, paragraphs or passages, or which drawings, diagrams or tables, of the document 2 have been highlighted, and in what manner.
When the highlighted content of the document has been identified, the application 226 can use this information in a number of ways, which can be selected by the user from a suitable menu. One option is for the application 226 to produce a modified electronic version of the document 2 in which the selected content is highlighted. The highlighting can be selected to correspond to the marks made on the original document 2, being made up of lines underlining, circling, or marking in the margin the selected text or drawings. Alternatively the highlighting can be selected to take a different form. For example highlighted text can be converted to a different font, having a different font size, being underlined or in bold, having a different colour, and highlighted drawings or diagrams can be shrunk or simplified. This modified document can then be saved and either viewed on the screen 204 of the PC 200, or printed again for re-reading.
Another option that can be selected is for a summary of the document 2 to be produced, taking into account the selected content. Referring to
In the modification to the weightings, any word, phrase or sentence that has been selected is given a higher weighting in the summarising process, so that it is more likely to appear in the summary. Where a whole paragraph is selected, then each sentence and each word in it is given a higher weighting. Where a sentence or phrase is selected, the weighting of both the whole of that sentence or phrase and of each word in it is increased.
Where a single word is selected its weighting is increased by a greater factor than if it just part of a selected phrase or sentence. The weighting accorded to each word, sentence or paragraph is also dependent on the manner in which it has been selected by the pen 201. For example, where a word is circled it is given a higher weighting than if it is only underlined, and a double underlining or a double line in the margin results in a higher weighting than a corresponding single mark. When the summary has been produced, it can either be saved as a separate document, with or without links to the original document or appended to the original document, with or without navigation links back to the original position of the selected text.
The content can include features other than text, and the summary may also include copies of, or simplified or modified versions of, selected drawings, diagrams or tables, or any other selected content. For example, the original document may contain drawings of a large number of items, for example in the form of a catalogue, together with the name of each item and a description of each item. In this case, if the title or a part of the description is selected, then the drawing, either alone or with the title or part of the description, can be incorporated into the summary. Alternatively if the drawing is selected, then part of the description or the title, either with or without the drawing, can be in incorporated into the summary. Another example of an original document including drawings is a technical description that includes graphs, drawings and tables. In this case, where the reference document includes a summary of a section of the description then it can be arranged also to include any graphs, drawings or tables associated with that section.
A further option that can be selected is the production of a modified document in which definitions or translations of the selected terms are added to the document. In this case the PC 200 needs access to suitable dictionaries, either single language dictionaries giving definitions of words in the language in which the document is written, or foreign language dictionaries giving translations from the language of the document 2 into another language. These dictionaries may be available on the PC 200 or a local network, but in this example, as shown in
Another option that can be selected is the creation of an index to the selected terms. In this case, referring to
The indexed terms are then ordered in the required manner at step 608, for example alphabetically to form the final index. This index can either be appended to the original document 218 or saved as a separate document.
Another option that can be selected is for the selected text to be interpreted as defining a purchase list indicating parts of the document 218 that the user would like to purchase one or more electronic copies of. This is particularly relevant where a user can obtain hard copies of a document free of charge, but can only obtain electronic copies for payment. The selected text can be identified, for example, by highlighting one or more headings which selects the sections or chapters under the headings. Alternatively the selected text can be identified by simply marking in the margin the required text. In either case the ordering can be completed by making payment to the owner of the document and downloading the required electronic copy.
A further option that can be selected is based on the indexing process described above, but is extended to form an information summary covering many documents that the user has read and marked with the pen 201. In this case the summary also acts as an aid to the retrieval of information from all the documents that have been read. As the index is built up it includes not only the page and line references of the selected text, but also the identity of the document in which it was selected. The summarising function described above is also included in this option, so that the index includes, for some of the indexed terms selected by the user, a summary of the passage in which they originally occurred. The extent of the passage that is summarised can also be selected by the user, for example using a line in the margin similar to the line 24 in
An extension to the multiple document summary described above is also provided whereby the summary is extended to cover not only documents that the user has read, but also documents that they have not read. Referring to
It will be appreciated that in the example just described, the index or summary serves not only as an index but also as a summary of documents read by the user and as a search tool to enable the user to find and read further documents that may be of interest. A further option which is available is for the application 226 to carry out an advanced search function. If the advanced search is selected, the search is carried out not on each selected term individually, but on a combination of a number of selected terms. In this case the documents identified by the search are listed in a search results document and ranked in order of the number of the selected terms that occurs in them. A summary of each of the selected documents, or passages from them, can also be included in the search results document.
Referring to
For each user, as identified by the user ID or by the pen 301 that they use, the server 303 can provide a summary, index, or searching facility as in the first embodiment of the invention described above. However, the server 303 is also arranged to produce similar summaries, indexes and search facilities jointly for groups of two or more of the users, or indeed all of the users. For example, where all of the users are working on a joint project and therefore reading documents relating to that project, a single index is built up based on the pen strokes recorded by all of the users. As described above, this index can include a list of relevant terms, summaries of passages and documents read, and lists and summaries of further documents that have not been read but that might be relevant or of interest.
It will be appreciated that the different users can be identified in a number of different ways for example using writing style analysis or using a biometric identification system linked to the network, such as a fingerprint or iris recognition system.
A further option that is available in the multiple-user system, is for the pen stroke data from all of the users to be combined to form a record, stored on the server 303, of which documents, and which parts of which documents, have been read by which users, and at what times. This data can be combined to produce a summary of the levels of reader interest in each of the documents, indicating for example which are the documents of most interest, which are the documents of least interest, and which groups of readers have shown the most and least interest in any particular document or group of documents. This summary acts as an aid to the users to help them identify the most relevant documents and to extract the most relevant information from those documents.
It will be appreciated that, in the embodiments described above, the position of the pen strokes on the printed copy of the document can be determined in any of a number of ways. For example the printed document can be placed on a detection system that is arranged to track movements of the pen relative to sensors within the detection system, such as infra-red or magnetic sensors.
In a further modification to the embodiments described above, the document is not printed out at all, but is viewed on a screen, and the pen is replaced by a light pen. The light pen includes a photo sensor, and when it is held at a point on the cathode ray tube (CRT) screen, it detects when light is emitted from that point. This information is transmitted to the CRT controller, which controls the position of the CRT electron beam and hence can determine when light will be emitted from each point on the screen.
This enables the CRT controller to determine the position of the pen on the screen. This system therefore enables the user to read the document on screen and make pen strokes on the screen using the light pen. These pen strokes are then interpreted in the same way as the pen strokes in the embodiments described above, using data in the CRT controller that indicates the position on the screen of the content features of the document. In such a system the pen, as that in the previous embodiments, has a tip that can be brought into contact with the representation of the document, and moved over the representation of the document to make the pen strokes. This allows the user to interact closely and directly with the document, in a manner that is familiar to users of conventional pen and paper.
In a further modification, the document is displayed on a tablet PC or other device having a touch sensitive screen. In this case the pen comprises a simple pointer or stylus that can be brought into contact with, and moved across the surface of, the touch sensitive screen, to make the pen strokes. The pen stroke data is then captured by the touch sensitive screen and processed as in the previous embodiments.
Number | Date | Country | Kind |
---|---|---|---|
0409073.4 | Apr 2004 | GB | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US05/51779 | 4/21/2005 | WO | 00 | 10/18/2010 |