Embodiments described herein relate generally to a handwritten document processing apparatus and method.
A technique for allowing the user to record a voice simultaneously with a handwriting input to create a note, conference minutes, or the like with voice data in a handwritten document processing apparatus such as a tablet computer including a pen input interface, has been proposed.
In general, according to one embodiment, a handwritten document processing apparatus includes a stroke input unit, a voice recording unit, a stroke structuration unit, a cue time calculation unit, and a playback control unit. The stroke input unit inputs stroke information indicating strokes and times of the strokes. The voice recording unit records voice information, a playback operation of which is configured to be started from a designated time. The stroke structuration unit structures the stroke information into a row structure by combining a plurality of strokes in a row direction. The cue time calculation unit calculates a cue time of the voice information associated with the row structure. The playback control unit controls to play back the voice information from the cue time in accordance with an instruction to the row structure.
Embodiments will be described hereinafter with reference to the drawings.
A handwritten document processing apparatus according to this embodiment is applied to a notebook application of, for example, a tablet computer including a pen input interface and voice input interface. This application allows the user to input note contents by handwriting and to collect and record voices of speakers and the user himself or herself via a microphone. This application can display a handwritten document and can play back recorded voices by reading out note data which associates handwriting-input strokes and recorded voice data. This embodiment is directed to improvement of operability of a cue playback operation of voice data associated with a handwritten document.
The stroke input unit 1 inputs stroke information via a pen input interface. “Stroke” is a handwriting-input stroke image. More specifically, “stroke” represents a locus from when a pen or the like is brought into contact with an input surface until it is released. For example, stroke information is associated with each stroke image from when the pen is brought into contact with a touch panel until it is released. The stroke information includes identification information required to identify a stroke, a start time T as a time of an initial point where the pen was in contact with the touch panel, and a time series of coordinates of a plurality of points which define a locus formed when the pen which contacted the touch panel was moved.
The voice recording unit 2 records voice information via a voice input interface. Voice information may have an arbitrary format, such as that which allows control of its playback operation, and is required to allow to at least start, pause, and end of the playback operation and allow to start the playback operation from a designated playback start time (to be referred to as “cue playback” hereinafter). Also, the voice information may be structured by voice interval detection, speaker recognition, and keyword extraction. The structuration of the voice information will be explained in the second embodiment.
The stroke structuration unit 3 structures stroke information into a row structure by combining a plurality of strokes in a row direction. To have this row structure as a unit, a cue playback start time (to be referred to as a “cue time” hereinafter) is associated with the row structure.
The cue time calculation unit 4 calculates a cue time of voice information to be associated with the row structure of stroke information. The display unit 5 displays handwriting-input strokes on the touch panel. The voice playback unit 6 is controlled to play back voice information from a cue time calculated by the cue time calculation unit 4 in response to an instruction operation to the row structure of strokes displayed on the touch panel.
After the user launches the notebook application, he or she starts to create and record a new note with voice data. Thus, the user can make a handwriting input by operating the pen on the touch panel. When the user starts a recording button, voice recording is started. Parallel to recording, the user makes a handwriting input to the note. When the user ends the recording, he or she can subsequently make a handwriting input but cannot associate a cue position of voice data with strokes input after the end of recording.
The stroke input unit 1 inputs stroke information to the handwritten document processing apparatus according to this embodiment via the pen input interface, and the voice recording unit 2 acquires voice information recorded via the voice input interface.
The stroke structuration unit 3 structures stroke information into a row structure by combining a plurality of already input strokes in a row direction.
As shown in
The cue time calculation unit 4 calculates a cue time of voice information recorded together with the stroke information for each of the row structures 1 to 3. For example, a stroke having an earliest input time of a plurality of strokes included in the row structure, that is, a start time of the first stroke in that row structure is set as a cue time. As shown in
Note that the cue times of the respective row structures may be adjusted. For example, a time of an α time period before the cue time based on the stroke information is set as a cue time (T1-α, T8-α, and T16-α are respectively set). Thus, a delay when the user hears a certain voice and starts a handwriting input in response to this can be absorbed. In other words, a playback operation from the adjusted cue time can prevent an opening sentence of the voice contents from being partially omitted.
After the cue times are calculated for the respective row structures, as described above, a playback operation of recorded voice contents can be started from a corresponding cue position when the user gives an instruction by tapping a desired row structure by the pen.
For example, when the user taps a position P1 or P2, as shown in
Note that a symbol mark indicating that a cue of voice information is associated may be displayed in the vicinity of a stroke, and an instruction may be given via this cue mark (step S4).
According to the aforementioned first embodiment, a cue playback operation of voice information can be attained in association with a row structure of strokes. Note that a display mode may be changed to allow the user to identify a corresponding row structure of strokes when a cue playback operation is started by tapping. For example, a display color of the corresponding row structure may be changed or that row structure may be highlighted.
Also, a time bar which indicates progress of a voice playback operation may be displayed, or a display color of strokes may be changed according to a voice playback time period between row structures. The user may be allowed to set to end a cue playback operation. In this case, a cue time of the next row structure may be set as an end time. It is also preferable to identifiably display (the row structure of) strokes with which no voice information is associated, that is, strokes for which (a cue position of) voice information is not available even when the stroke is tapped.
Since the voice structure includes the time information, as described above, it is used to calculate a cue time described in the first embodiment. In this embodiment, by comparing a cue time of a row structure with respective times of a detected voice interval, a cue time is calculated. For example, assume that as a result of interval detection of voice information, a voice structure between times T101 and T102, that between times T102 and T103, that between times T103 and T104, and that between times T104 and T105 are obtained, as shown in
A cue time calculation unit 4 sets a time which is before a time of each row structure and is closest to that time as a cue time. As for a row structure 1, the closest time T101 before a time T1 is set as a cue time. As for a row structure 2, the closest time T102 before the time T8 is set as a cue time. As for a row structure 3, the closest time T104 before the time T16 is set as a cue time.
Note that this embodiment has exemplified the structuration of voice information by voice interval detection. However, the present embodiment is not limited to this, and structuration may be attained by, for example, time equal division. Also, various structuration methods may be combined.
According to the second embodiment, the same effects as in the first embodiment can be provided, and the cue precision can be improved based on the structuration of the voice information.
Note that a voice interval detection technique may use a method using two thresholds described in [Nimi, “Speech Recognition” (KYORITSU SHUPPAN CO., LTD) p. 68-69]. Alternatively, a method described in Japanese Patent No. 2989219 may be used.
Visual information of a voice structure may be displayed before a cue position is selected (before the start of a cue playback operation) or that of a corresponding voice structure may be displayed when a cue position is selected. Also, visual information may be partially displayed according to the progress of a playback operation of voice information from the selected cue position.
As in the second embodiment, a cue time may be calculated using information of a voice structure (step S3). However, in this embodiment, step S3 may be omitted.
When the voice playback operation further progresses, and reaches a next row structure 41 (screen 33), the row structure 41 is identifiably displayed. Below the row structure 41, a voice structure time bar 61 corresponding to this row structure 41 is displayed (screen 34). Note that by tapping the cue mark 50 or 51 during the playback operation, the playback operation can be repeated by returning to a cue position.
The playback time bar is extended according to the granularity of visualization. A time bar 90 is displayed in the case of one cue mark 80, and indicates that the progress of the playback operation is about 60%. A time bar 91 is displayed in the case of four cue marks 81 to 84, and indicates that the playback operation is nearly completed, and is about to transit to the next row structure. By tapping any of the cue marks 81 to 84, the playback operation can be started from the tapped position.
Note that a symbol mark which visualizes a keyword extracted from voice information may be used in place of a cue mark.
How to decide the contents of visual information of a voice structure according to the number of cue marks (granularity) will be described below. For example, when the number of cue marks is one, visual information at an intermediate time during a time period between playback start and end times may be displayed, and a keyword having a highest frequency of occurrence may be displayed in case of keyword extraction. For example, when the number of cue marks is two, pieces of visual information close to two times obtained by dividing a time period between playback start and end times into three may be selected.
Also, as shown in
According to the third embodiment, a voice structure can be visualized and displayed, and a cue playback operation for a time period (voice interval) in which no stroke input is made can also be performed. Therefore, operability of a cue playback operation can be further improved.
Note that there are two basic types of speaker recognition using voice information: speaker identification and speaker collation. A literature (J. P. Campbell, “Speaker Recognition: A Tutorial,” Proc. IEEE, Vol. 85, No. 9, pp. 1437-1462 (1997)) may be used as a reference. As for keyword extraction from voice information, NEC Corporation, “Keyword extraction by optimization of degree of keyword matching” (CiNii), Internet URL: www.nec.jp/press/ja/1110/0603.html may be used as a reference.
For example, some components shown in
For example,
Note that in this example, the client 301 is connected to the network 300 via wireless communications, and the client 302 is connected to the network 300 via wired communications.
The clients 301 and 302 are normally user apparatuses. The server 303 may be arranged on, for example, a LAN such as an office LAN, or may be managed by, for example, an Internet service provider. Alternatively, the server 303 may be a user apparatus, so that a certain user provides functions to other users.
Various methods of distributing the components in
Instructions of the processing sequence described in the aforementioned embodiments can be executed based on a program as software. A general-purpose computer system pre-stores this program, and loads the program, thereby obtaining the same effects as those of the handwritten document processing apparatus of the aforementioned embodiments. Instructions described in the aforementioned embodiments are recorded in a recording medium such as a magnetic disk (flexible disk, hard disk, etc.), optical disk (CD-ROM, CD-R, CD-RW, DVD-ROM, DVD±R, DVD±RW, etc.), a semiconductor memory, and the like as a program that can be executed by a computer. The storage format of such recording medium is not particularly limited as long as the recording medium is readable by a computer or embedded system. The computer loads the program from this recording medium, and controls a CPU to execute instructions described in the program based on the program, thereby implementing the same operations as the handwritten document processing apparatus of the aforementioned embodiments. Of course, the computer may acquire or load the program via a network.
Also, an OS (Operating System) or MW (middleware) such as database management software or network, which run on a computer may execute some of processes required to implement this embodiment based on instructions of the program installed from the recording medium into the computer or embedded system.
Furthermore, the recording medium of this embodiment is not limited to a medium separate from the computer or embedded system, and includes a recording medium which stores or temporarily stores a program downloaded via a LAN or Internet.
The number of recording media is not limited to one, and the recording medium of this embodiment includes a case in which the processes of this embodiment are executed from a plurality of media. Hence, the configuration of the medium may use an arbitrary configuration.
Note that the computer or embedded system of this embodiment is required to execute respective processes of this embodiment, and may adopt any of arrangements such as a single apparatus such as a personal computer or microcomputer or a system in which a plurality of apparatuses are connected via a network.
The computer of this embodiment is not limited to a personal computer, includes an arithmetic processing device, microcomputer and the like included in an information processing apparatus, and collectively means a device and apparatus which can implement the functions of this embodiment based on the program.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2012-210874 | Sep 2012 | JP | national |
This application is a Continuation Application of PCT Application No. PCT/JP2013/076458, filed Sep. 24, 2013 and based upon and claiming the benefit of priority from Japanese Patent Application No. 2012-210874, filed Sep. 25, 2012, the entire contents of all of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2013/076458 | Sep 2013 | US |
Child | 14667528 | US |