The present application is based upon and claims the benefit of priority of European patent application No. 15161466.6, filed on Mar. 27, 2015, the contents of which are incorporated herein by reference in their entirety.
1. Field of the Invention
The present invention relates to a computer-implemented method, device and system for converting text data into speech data.
2. Description of the Related Art
Text-to-speech technology enables text data to be converted into synthesized speech data. An example of such technology is the BrightVoice technology developed by IVONA Software of Gdansk, Poland.
One use of text-to-speech technology is disclosed in EP 0 457 830 B1. This document describes a computer system that is able to receive and store graphical images from a remote facsimile machine. The system includes software for transforming graphical images of text into an ASCII encoded file, which is then converted into speech data. This allows the user to review incoming faxes from a remote telephone.
The inventors of the present invention have developed a use of text-to-speech technology that involves scanning a document, extracting the text from the document and converting the text to speech data (scan-to-voice). The speech data produced from the scanned document can then be sent (in the form of an audio file, for example) to a particular location by email or other methods via a network, or to external storage means such as an SD card or USB drive, for example. However, the size of speech data is typically large (approximately 3-5 MB per 1000 characters of text) and a problem arises in that a user may face difficulty in sending the data to a particular location. This is because email services usually limit the size of file attachments, and large speech data will increase network load and require more storage space on a server or other storage means.
It is an aim of the present invention to at least partially solve the above problem and provide more information and control over speech data produced from text data.
According to an embodiment of the present invention, there is provided a computer-implemented method for converting text data into speech data, the method including: obtaining a predetermined speech data size limit; determining whether or not converting the text data into speech data will produce speech data with a size greater than the speech data size limit; and converting the text data into speech data such that the size of the speech data is equal to or lower than the speech data size limit.
According to an embodiment of the present invention, there is provided a device for converting text data in speech data including: a processor configured to obtain a predetermined speech data size limit and determine whether or not converting text data into speech data will produce speech data with a size greater than the speech data size limit; and a text-to-speech controller configured to convert the text data into speech data such that the size of the speech data is equal to or lower than the speech data size limit.
According to an embodiment of the present invention, there is provided a system including: a scanner configured to scan a document to produce a scanned document image; a service configured to extract text data from the scanned document image; the above device for converting the text data into speech data; and a distribution controller configured to transmit the speech data, optionally with at least one of the scanned document image and the text data, to a particular location.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
Exemplary embodiments of the invention are described below with reference to the accompanying drawings.
A description will be given of embodiments with reference to the accompanying drawings.
A system according to an embodiment of the invention is depicted in
After converting the text data 107 into speech data 108, the speech data 108 may then be conveniently transmitted to a particular location. This includes sending the speech data 108 (e.g. in the form of an audio file 109) to a user or another recipient via email, storing the speech data 108 on a document server, or storing the speech data 108 on external storage means.
The speech data 108 may be transmitted on its own, but may also be transmitted together with the original scanned document image 106 and/or the text data 107 extracted from the scanned document image 106.
An example of an application for sending speech data 108 in the form of an audio file 109 produced from a scanned document 105 is shown schematically in
The present invention is not limited to transmitting the scanned document image 109 and/or the speech data 108 to a recipient via email. A user 103 may also transmit the scanned document image 109 and/or the speech data 108 to a recipient using another communication application, such as an instant messaging application that is capable of transferring files between the user 103 and the recipient. Furthermore, any combination of the scanned document image 106, the text data 107 and the speech data 108 can be sent to the recipient.
A method according to the present invention is depicted as a process diagram in
In step S103, the scan-to-voice controller 307 determines whether or not converting the extracted text data 107 into speech data 108 will produce speech data 108 with a size greater than a predetermined speech data size limit 115 (step S103). The predetermined speech data size limit 115 may be manually set by the user 103 or system administrator. If a speech data size limit 115 is not manually set, then a default value may be automatically set by the application 301. The user 103 may change the speech data size limit 115 as and when required, by changing the value of the speech data size limit 115 in a settings menu of the application 301, or by setting a speech data size limit 115 at the beginning of a scanning job.
There are multiple different approaches to determining whether or not converting the extracted text data 107 into speech data 108 will produce speech data 108 with a size greater than a predetermined speech data size limit 115, which are discussed below.
Table 1 shows an example of some parameters that are stored in the application. For a specified language and speech speed, the length of time required for a voice generated by the text-to-speech engine to speak a specified number of a type of text unit is stored as a speech duration parameter. The term “type of text unit” encompasses characters, words and paragraphs. The term “characters” includes at least one of the following: letters of an alphabet (such as the Latin or Cyrillic alphabet), Japanese hiragana and katakana, Chinese characters (hanzi), numerical digits, punctuation marks and whitespace. Some types of characters such as punctuation marks are not necessarily voiced in the same way as letters, for example, and therefore some types of characters may be chosen not to count as a character. In Table 1, and the following examples, the type of text unit that will be used is characters.
In an embodiment of the present invention, the determining step S103 comprises estimating the size of speech data 108 that would be produced by converting text data 107.
In an example of the present embodiment, the text data 107 contains 1500 characters and the text-to-speech engine is set to the English language at normal speed. The text-to-speech engine is also set to output the speech data 108 as a WAV file (44.1 kHz sample rate, 16 bits per sample, stereo). Using the parameters in Table 1, the speech duration of the generated voice can be estimated in the following manner:
An estimated file size of the output WAV file can then be determined based on the data rate (kB/s) of the WAV file and the estimated speech duration as follows:
The estimated file size can then be compared to the predetermined speech data size limit 115. If the estimated file size is greater than the speech data size limit 115, step S103 determines that converting text data 107 would result in speech data 108 having a size greater than that of the predetermined speech data size limit 115.
In an alternative embodiment, the determining step 103 comprises estimating the number of characters that can be converted into speech data 108 within the predetermined size limit 115.
The estimated number of characters that can be converted to speech data 108 within the speech data size limit 115 can be determined based on an estimated speech duration per character and the duration of a WAV file with a file size equal to the speech data size limit 115.
In an example of the present embodiment, the text-to-speech engine is set to the English language at normal speed and is set to output the speech data 108 as a WAV file (44.1 kHz sample rate, 16 bits per sample, stereo). The speech data size limit 115 has been set as 3 MB. The estimated number of characters that can be converted to speech data 108 within the speech data size limit 115 can be calculated in the following manner:
The estimated number of characters can then be compared to the actual number of characters in the text data 107 extracted from the scanned document image 106. If the estimated number of characters is less than the actual number of characters, then step 103 determines that converting text data 107 would result in speech data 108 having a size greater than that of the predetermined speech data size limit 115.
The present invention is not limited to using regular types of text units (e.g. characters, words, paragraphs) to determine whether or not converting the text data into speech data will produce speech data with a size greater than the speech data size limit. For example, a text buffer size may be used instead, with an associated speech duration.
The calculations described above may be performed in real time by the application 301.
Alternatively, the calculations may be performed in advance and the results stored in a lookup table. For example, estimated file sizes can be stored in association with particular numbers of characters or ranges of numbers of characters. For a given number of characters, an estimated file size can be retrieved from the lookup table.
In step S104, if it was determined in step S103 that converting the text data 107 would result in speech data 108 having a size greater than that of the predetermined speech data size limit 115, then the method proceeds to step S105.
If it was instead determined in step S103 that converting text data 107 would result in speech data 108 under the speech data size limit 115, then step S104 proceeds to step S106 without carrying out step S105.
In step S105, the text data 107 extracted from the scanned document image 106 may be modified such that the text-to-speech engine produces speech data 108 with a size equal to or lower the speech data size limit 115.
In one embodiment, the user 103 is shown an alert 116 on the user interface that informs the user 103 that the text data 107 will result in speech data 108 over the predetermined speech data size limit 115.
The alert 116 shown in
The application can automatically cut the text data 107 in variety of different ways. For example, the application may delete characters from the end of the text data until the text data 107 contains the maximum number of characters. Preferably, the application adjusts the cutting of the text data 107 so that the text data 107 ends at a whole word, rather than in the middle of a word. Other ways of modifying the text data 107 include deleting whole words, punctuation marks and abbreviating or contracting certain words. The application may also use a combination of different ways to cut the text data 107.
In another embodiment, the text data 107 may be modified by the user 103 before converting the text data 107 into speech data 108.
In another embodiment, if it was determined in step S103 that converting the text data 107 would result in speech data 108 having a size greater than that of the predetermined speech data size limit 115, then the conversion produces speech data 108 as several files, each file having a size lower than the speech data size limit 115.
This is achieved by dividing the text data 107 into blocks before conversion, such that the conversion of each block will produce separate speech data files, each having a size lower than the speech data size limit 115. Division of the text data 107 into appropriate blocks is achieved by dividing the text data 107 such that each block contains a number of characters equal to or less than the maximum number of characters, for example.
The user 103 can choose to carry out this processing through an alert or prompt similar to alert 116, where the user 103 is provided with the option to divide the speech data 108 (by dividing the text data 107, as described above). If the user 103 chooses to proceed, then the application may carry out the dividing process automatically, or the user 103 may be presented with an interface that allows the user 103 to manually select how the text data 107 is divided into each block.
In a further embodiment, in step S105, a conversion parameter 118 of the text-to-speech engine is changed before converting the text data 107 into speech data 108. For example, a ‘speech sound quality’ parameter which determines the sound quality of the speech data produced by the text-to-speech engine can be changed to a lower quality to reduce the size of the speech data 108 produced from the text data 107. A ‘speech speed’ parameter of the text-to-speech engine could also be changed to allow more characters/words to be voiced as speech within the speech data size limit 115.
A parameter of the audio file 109 output by the text-to-speech engine, such as bitrate or sampling rate, or audio file format, may also be changed in order to produce an audio file with a lower size.
The application may change any of the conversion parameters 118 or audio file parameters automatically after alerting the user in a similar manner to alert 116. Alternatively, the user 103 may change a conversion parameter 118 manually, through a screen prompt, for example.
Once the text data 107 has been modified or a conversion parameter 118 has been changed so that speech data 109 produced by the text-to-speech engine will have a size equal to or lower than the speech data size limit, the method proceeds to step S106.
In step S106, the text data 107 is converted into speech data 108 having a size equal to or lower than the speech data size limit 115. The conversion is carried out using known text-to-speech technology. The text-to-speech engine is configurable by changing conversion parameters 118 such as speech sound quality and speech speed. The text-to-speech engine preferably outputs the speech data 108 as an audio file 109.
After the conversion of the text data 107 into speech data having a size equal to or lower than the speech data size limit 115, the method proceeds to step S107.
In step S107, the speech data 108 is transmitted with the scanned document image 106 to a particular location. The location and method of transmission is not limited and includes, for example, sending to a recipient via email, to a folder on a document server, to external memory (e.g. SD card or USB drive) etc. Furthermore, the invention is not limited to sending the speech data 108 with the scanned document image 106. Instead, the speech data 108 may be sent on its own, or with text data 107, or with both the text data 107 and the scanned document image 106.
For example, in the case of transmitting via email, the speech data 108, the scanned document image 106 and/or the text data 107 can be sent as separate files attached to the same email. In the case of storing on a document server, the speech data 108, the scanned document image 106 and/or the text data 107 can be saved together as separate files within the same folder or saved together in a single archive file. In addition, the files may be associated with one another using metadata. In a specific embodiment, the files are handled by an application which organizes the files together in a “digital binder” interface. An example of such an application is the gDoc Inspired Digital Binder software by Global Graphics Software Ltd of Cambridge, United Kingdom.
The present invention is not limited to the arrangement of the image processing device 101, server 102 and user 103 described thus far.
Although in each of the above-described embodiments the extraction of text data 107 from the scanned image document 106 is performed by the image processing device 101, the text extraction could also be performed by an OCR engine at a remote server.
Furthermore, the smart device 119 may replace the image processing apparatus 101 for the steps of scanning and/or extraction of text data in any of the above described embodiments. For example, if the smart device 119 has a camera, an image 106 of a paper document 105 can be obtained and image processed to improve clarity if necessary (“scanning”) and then text data 107 may be extracted from the document image 106 using an OCR engine contained in the smart device 119.
The embodiments of the invention thus allow a speech data size limit 115 to be specified and text data 107 to be converted into speech data 108 such that the size of the speech data is equal to or lower than the speech data size limit 115. The user 103 therefore does not waste time waiting for a text-to-speech conversion that will produce speech data 108 that the user 103 cannot send.
In some embodiments of the invention the user 103 is also informed, in advance of a text-to-speech conversion, whether or not converting the text data 107 into speech data 108 will produce speech data 108 with a size greater than the speech data size limit 115. The user 103 is therefore provided with useful information relating to the size of the speech data 108 that will be produced.
Furthermore, some embodiments of the invention allow the text data 107 to be automatically modified so that a text-to-speech conversion of the text data 107 will result in speech data 108 with a size equal to or lower than the speech data size limit 115. The user 103 therefore is able to quickly and conveniently obtain speech data 108 with a size equal to or below the speech data size limit 115 from a paper document 105. The user 103 does not have to spend time inconveniently modifying and rescanning the paper document 105 itself to obtain speech data 108 with a size equal to or below the speech data size limit 115.
Other embodiments of the invention allow the text data 107 to be modified by the user 103 so that a text-to-speech conversion of the text data 107 will result in speech data 108 with a size equal to or lower than the speech data size limit 115. This conveniently gives the user 103 more control over the speech data 108 to be produced from the text data 107. The user 103 also does not have to spend time inconveniently modifying and rescanning the paper document 105 itself to obtain speech data 108 with a size equal to or below the speech data size limit 115.
Some embodiments of the invention allow separate speech data files to be produced from the text data 107, each file having a size equal to or below the speech data size limit 115. In this way, all of the text data 107 can be converted to speech data 108 in the same session without abandoning any of the text content.
Some embodiments of the invention also allow conversion parameters 118 to be changed automatically or manually by the user 103 before text-to-speech conversion takes place, so that a text-to-speech conversion of the text data 107 will result in speech data 108 with a size equal to or lower than the speech data size limit 115. This allows speech data 108 of a suitable size to be produced, without needing to modify the text data. This also provides similar advantages to those identified above, namely saving the user 103 time and providing convenience, as the user 103 does not have to modify and rescan the paper document 105 itself multiple times in order to obtain speech data 108 with a size equal to or below the speech data size limit 115.
Having described specific embodiments of the present invention, it will be appreciated that variations and modifications of the above-described embodiments can be made. The scope of the present invention is not to be limited by the above description but only by the terms of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
15161466.6 | Mar 2015 | EP | regional |