The present invention relates generally to manual transcription of sound, in particular human speech. More specifically, the present invention relates to synchronization of sound data and text data, the text data being obtained by manual transcription of the sound data during playback of the latter, in view of subsequent synchronous playback of sound and text data, e.g. for correction purposes.
When sound, e.g. human speech, is transcribed to text automatically by means of a speech recognition system, it is generally and easily possible to associate each word or even smaller lexical subunit, referred to as a text datum in the following, with the corresponding sound segment (also referred to as a sound datum), for instance by automatically including timing data derived from the sound data in the text file which is produced by the speech recognition system. The timing data can then be used to directly access a text datum corresponding to a given sound datum, and vice versa. Such an association is in particular required for commonly known features such as synchronous playback, wherein a segment of text (text datum) such as a word or a syllable corresponding to a currently played sound segment is shown to a user, for instance by highlighting the text segment in question on a display. Such a feature is especially useful for correction of transcriptions established through speech recognition as well as for review and quality assurance.
However, when sound is transcribed manually, which is the case frequently due to the well-known imperfections of current speech recognition systems, e.g. when dealing with sound data of poor quality or a highly specialized jargon, such an association is generally not available automatically. Therefore, in the prior art synchronization of text and sound has to be done manually by marking sound segments with a precision of the order of a few milliseconds and subsequent entering the corresponding text. Such an approach is very time consuming, thus representing an important matter of expenses. Nevertheless, it constitutes an important feature of transcription for further analysis, e.g. in the field of psychology, marketing, etc. A similar approach has been published by Bainbridge, D., and Cunningham, S. J.: “Making oral history accessible over the World Wide Web”, History and Computing, vol. 10, no. 1-3, pp. 73-81 (1998).
Thus, there is a need in the art to be able to cost-effectively synchronize sound and text in connection with the manual transcription of sound data.
It is the object of the present invention to provide a method for synchronizing sound data and text data, said text data being obtained by manual transcription of said sound data during playback of the latter, which obviates the above-mentioned disadvantages. It is also an object of the present invention to provide a method for synchronized playback of sound data and corresponding text data, which incorporates the inventive method for synchronizing sound data and text data, thus obviating the common disadvantage of the prior art of synchronous playback being exclusively reserved to systems using speech recognition. Furthermore, the present invention has for its object to provide a system adapted to translate into action the respective inventive methods mentioned above.
According to a first aspect of the invention there is provided a method for synchronizing sound data and text data, said text data being obtained by manual transcription of said sound data during playback of the latter, comprising the steps of repeatedly querying said sound data and said text data to obtain a current time position corresponding to a currently played sound datum and a currently transcribed text datum, correcting said current time position by applying a time correction value in accordance with a transcription delay, and generating at least one association datum indicative of a synchronization association between said corrected time position and said currently transcribed text datum.
Here and in the following specification, the term “sound data” refers to audio data, e.g. human speech, that has been recorded and subsequently stored, preferably as a data file in a suitable digital data format, for subsequent manual transcription by a user (transcription), in particular a secretary who listens to the sound, which can be re-generated from the sound data, and who enters (types) text corresponding to the sound in the form of a stream of characters, typically by means of a keyboard. In this context, the term “sound datum” designates a segment of the sound data, the smallest possible sound datum being a single sound data bit.
Correspondingly, the term “text data” refers to text entered during the transcription session, i.e. a succession of characters, which is also preferably stored as a data file in a suitable digital data format. In this context, the term “text datum” designates a segment of the text data, the smallest possible text datum obviously being a single text data bit, i.e. a single text character.
The term “playback” refers to the act of generating a respective output corresponding to any one of the above-mentioned types of data, e.g. generating audible physical sound perceptible by the user from the sound data by means of a suitable output system, e.g. a sound card and an associated speaker system, or by displaying text corresponding to the text data on a display screen. During playback of the sound data, a given sound datum to be played corresponds to a “time position” in the sound data, i.e. the audio stream derived therefrom. For instance, the time position of a given sound datum could indicate the start time of said sound datum measured from the beginning of the audio stream.
Within the scope of the invention, the term “repeatedly” designates an action that is carried out a number of times, on an essentially regular basis, e.g. with a repetition frequency of 1 to 10 Hz (one query every 0.1-1 s). For instance, “querying” the sound data and the text data, respectively, is an exemplary action which is carried out repeatedly within the scope of the present invention, i.e. the sound data and the text data are frequently addressed during a transcription session to obtain a current audio time position and a recently entered text datum, respectively, as a query value wherein an actual length of the text datum depends on the querying frequency.
Further in this context, the term “correcting” refers to changing a query value by applying a predetermined correction value, e.g. subtracting a delay time value from a time position query value to obtain a corrected time position.
Finally, the term “association datum” in the present specification designates a segment of data, which contains/establishes an association between sound and text. Such a datum preferably contains information as to the sound time position at which a particular text datum should have been entered by the user to match the sound data perfectly thus creating a synchronization association between said text datum, e.g. a word or any succession of characters, and the sound data.
In this way, the invention method automatically achieves a close association between sound and text, in the case that sound is being transcribed manually. In other words: During manual transcription, according to the invention an association can be created between sound that has already been played back and text that is currently being entered, e.g. typed, based on the assumption that the text segment which is being entered at a given moment is closely related to the played sound as far as timing is concerned. This timing relation between played sound and entered text is governed predominantly by the transcription delay, which is due to the finite reaction speed of the transcriptionist.
By predetermining the transcription delay, i.e. the time correction value, in accordance with the transcription skill and/or typing speed of a user, i.e. the transcriptionist, the inventive method can easily be customized in order to provide satisfactory results for any kind of user, which uses the method for manually transcribing sound.
Advantageously, the time correction value takes the form of a time delay, i.e. a duration in time, corresponding to a “lag” of the user behind the sound when entering the transcription text. Such a delay time could easily be determined by playing back known sound data to the user prior to a transcription session and subsequently measuring and statistically processing the time it takes the user to enter the corresponding text data. The result could be stored in a system using the inventive method as a user profile for later utilization. As an alternative to the above-described approach, which requires user enrollment, it is conceivable to allow adjustment of the delay during synchronous playback and/or to manual correction of the association at the beginning and at the end of a transcribed section by the user, while interpolating the delay for the rest of the section.
In order to further improve synchronicity between the sound and the text, according to a further development of the inventive method characteristic speech-related information in said sound data, in particular pauses in said sound data corresponding to punctuation in said text data, are used for generating additional association data between time positions in said sound data corresponding to said speech related information and related text data. A typical example for such characteristic features would be speech pauses at the end of sentences, which correspond to a full stop or other punctuation characters in the transcribed text. In a preferred embodiment of the inventive method, said approach is part of the transcription delay computation logic. Pauses in the sound data are used to adjust the transcription delay, and based on this, compute corrected sound time positions related to the corresponding text.
According to a variant of the inventive method, said association data are stored together with said text data in a common synchronized text data file. In this way, storage of the association data, which may advantageously take the form of time stamps, i.e. numbers indicative of a time position in the sound data, e.g. elapsed time measured from the beginning of a corresponding sound data file, is achieved in analogy with transcriptions generated by means of a speech recognizer, such that in principle known synchronous playback methods/systems can be used to provide synchronous playback of the associated sound-text data, which are obtained in accordance with the inventive method.
Alternatively, if suitable for further data processing, said association data can be stored separately from said text data in a synchronization file.
According to a second aspect of the invention there is provided a method for synchronized playback of sound data and corresponding text data, comprising the steps of repeatedly playing back a respective sound datum at a given point of time, and showing a text datum associated with that sound datum at substantially said same point of time, said associated text datum being obtained in accordance with association data obtained according to any one of the above-described variants of the inventive synchronizing method. In this way, synchronous playback is readily available even when transcribing sound manually, e.g. for correction purposes.
According to a third aspect of the invention there is provided a system for synchronizing sound data and text data, comprising:
Such a system is particularly suited for translating into action the inventive method according to the first aspect of the invention as described above.
In another embodiment of the inventive system said data processing means is adapted for identifying characteristic speech related information in said sound data, in particular pauses in said sound data corresponding to punctuation in said text data, and for improving the time correction value in accordance with corresponding time positions in said sound data and related text data. This helps to further improve synchronicity between the sound and the text, for instance by generating an additional association datum linking a speech pause at the end of a phrase to a corresponding punctuation character entered in temporal vicinity thereto, e.g. a full stop or a comma.
In order to be compatible with known realizations of synchronous playback the inventive system may be devised such that said association data are stored together with said text data in a common synchronized text data file, as is generally the case in known systems, which rely on speech recognition for generating text data. This is of particular interest since it allows mixing of recognized and transcribed text in a single document, e.g. if a speech recognizer could not process a longer section of sound data which therefore had to be transcribed manually. However, for full flexibility of realization said association data can alternatively be stored separately from said text data in a synchronization file.
According to a fourth aspect of the invention there is provided a system for synchronized playback of sound data and corresponding text data, comprising playback means for playing back a respective sound datum at a given point of time, and showing means for showing a text datum associated to that sound datum at substantially said same point of time, wherein the system further comprises a sub-system according to any one of the above-described variants of the inventive system for synchronizing sound data and text data. By this means, the inventive system according to said fourth aspect of the invention can easily incorporate a synchronous playback feature even when transcribing sound manually, e.g. for correction purposes.
The synchronized playback system according to said fourth aspect of the invention is particularly useful for the transcription of sound to text, in particular for medical transcription.
More generally, the synchronized playback system according to said fourth aspect of the invention is particularly useful as part of a correction stage in a document creation workflow, the latter comprising at least the stages of dictation, speech recognition, and correction, optionally a further stage of review/quality assurance. Further advantages and characteristics of the present invention can be gathered from the following description of a preferred embodiment with reference to the enclosed drawings. The features mentioned above as well as below can be used in accordance with the invention either individually or in conjunction. The embodiments mentioned are not be understood as an exhaustive enumeration but rather as examples with regard to the underlying concept of the present invention.
The following detailed description of the invention refers to the accompanying drawings. The same reference numerals may be used in different drawings to identify the same or similar elements.
In order to be able to perform the specific actions defined above, all of the aforementioned system components 2-5 are connected to a central control unit in the form of data processing means 6, e.g. a microprocessor, including at least one timer unit 6a. In this way, the inventive system 1 is preferably devised as a PC-based system 7 as indicated by the dash-dotted box in
According to the basic concept of the invention, for creating association data indicative of a synchronization association between said sound data and said text data, said data processing means 6 comprise query means 8 for repeatedly querying said sound data SD and said text data to obtain a current time position corresponding to a currently played sound datum and a currently entered text datum. Furthermore, the data processing means 6 comprise correcting means 9 for correcting said current time position by applying a time correction value in accordance with a transcription delay, and data generating means 10 for generating an association datum indicative of a synchronization association between said corrected time position and said currently entered text datum. The aforementioned components 8-10 of the data processing means 6 are preferably implemented in software form. In this context the data processing means with reference to
Text data TD entered by means of the input means 5 can also be stored in the storage means 3, preferably together with said association data (see below), as a text data file TDF. An exemplary file format will be explained below with reference to
For synchronized playback of sound data SD and corresponding text data TD, the system 1 in addition to the audio playback means 4 for playing back the sound data SD, i.e. a respective sound datum at a given point of time, also comprises showing means 14 in connection with the data processing means 6 for showing a text datum associated to a sound datum that is being played back at substantially the same point of time, thus achieving said synchronous playback. Said showing means 14 advantageously take the form of a standard PC display screen, on which said text datum can be shown either be simply writing it on the screen or by highlighting it, or the like. In this way, by successively playing back the whole contents of the sound data file, the entire corresponding transcription text data TD is displayed in synchronous fashion.
Optionally, the system 1 according to the invention, i.e. the data processing means 6, further comprises monitoring means 15, 16, e.g. a sound data level monitor 15 comprising a timer 17, and a text input monitor 16, illustrated with dashed lines in
According to the invention, both the sound data SD and the entered text data TD are repeatedly queried, preferably on a regular temporal basis, by means of the query means 8 (
Due to the finite hearing and reaction speed of the transcriptionist, the text data TD generally lags behind the sound data SD as illustrated in
In order to create an association between the sound data SD and the entered text data TD for later synchronous playback by the inventive system 1 (
According to the invention, a currently entered text datum TDj, e.g. text datum TD7=“is . . . a” at Q7, is stored in the data buffer 12 (
In order to further improve the synchronicity between sound data and text data characteristic speech related information in the sound data, in particular pauses in the sound data which correspond to punctuation in the text data, are used for generating additional association data between time positions in the sound data corresponding to said speech related information and related text data. To this end, according to a variant of the inventive system 1, the sound data level monitor 15 (
In this way, the level monitor 15 (
The inventive system 1 (
Optionally a further stage of review/quality assurance can be provided, which may also make use of the inventive methods described in detail above.
Number | Date | Country | Kind |
---|---|---|---|
05107861 | Aug 2005 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IB2006/052865 | 8/18/2006 | WO | 00 | 2/25/2008 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2007/023436 | 3/1/2007 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
6260011 | Heckerman et al. | Jul 2001 | B1 |
6282510 | Bennett et al. | Aug 2001 | B1 |
6338038 | Hanson | Jan 2002 | B1 |
6360237 | Schulz et al. | Mar 2002 | B1 |
6636238 | Amir et al. | Oct 2003 | B1 |
7010489 | Lewis et al. | Mar 2006 | B1 |
7058889 | Trovato et al. | Jun 2006 | B2 |
20020010916 | Thong et al. | Jan 2002 | A1 |
20020049595 | Bennett et al. | Apr 2002 | A1 |
20020143534 | Hol | Oct 2002 | A1 |
20020163533 | Trovato et al. | Nov 2002 | A1 |
20040006481 | Kiecza et al. | Jan 2004 | A1 |
20040111265 | Forbes | Jun 2004 | A1 |
20040138894 | Kiecza et al. | Jul 2004 | A1 |
Entry |
---|
David Bainbridge, et al: Making Oral History Accessible Over the World Wide Web, History and Computing, vol. 10, No. 1-3, 1998, pp. 73-81. |
Office Action dated Sep. 8, 2010 from corresponding Chinese patent application No. 200680031279.3. |
Number | Date | Country | |
---|---|---|---|
20080195370 A1 | Aug 2008 | US |