The present invention relates to a voice data playback speed conversion method and a voice data playback speed conversion device.
In case of playbacking voice signals recorded in a recording medium, e.g., CD, cassette tape, video tape, a playback speed is sometimes converted from the standard playback speed. For example, in case of listening a prescribed amount of voice in a short time, the playback speed is increased; in case that it is hard to listen voice due to, for example, rapid speech, the playback speed is reduced. To convert the playback speed, a revolution speed of CD or a running speed of a tape is increased or reduced. However, in this playback method, frequency of voice signals read from the recording medium, e.g., CD, is changed according to change of the playback speed, so tone of the voice must be changed and it is hard to listen the changed voice.
Thus, a method for converting a playback speed without changing tone, which comprises a step of dividing original voice signals into a plurality of voice blocks An (n is a natural number) having a predetermined time length and a step of changing combination of the voice blocks, has been proposed. For example, in case of playbacking at double-speed, the voice blocks An are alternately playbacked (e.g., A1-A3-A5 . . . ), so that a playback time can be reduced to a half, and the voice can be playbacked without substantially changing tones because the frequency of the original voice signals are maintained to some extent.
Note that, the voice block is divided by a basic cycle, which is an inverse number of a basic frequency being the lowest frequency of frequency components included in the voice block of the original voice signals. Since the voice signals are always varied, the basic frequency is also varied and the time lengths between the adjacent voice blocks are usually different.
However, if the original voice signals are divided into a plurality of the voice blocks An by an improper time length, the signals of one voice block are discontinued to those of the voice block having the improper time length when combination of the voice blocks is changed to convert the playback speed, so rasping noises will be generated.
In another method, suitable dividing points of the voice blocks An of the original voice signals are defined on the basis of zero cross points of the original voice signals, and connecting points of the voice blocks are the zero cross points, so that noises can be reduced. Technologies for dividing voice signals at zero cross points are disclosed in, for example, Patent Documents 1-3.
To perform the technologies for converting voice data playback speed disclosed in Patent Documents 1-3, calculation amount for extracting voice blocks, which have a suitable time length, from original voice data must be huge. Thus, the process of converting voice data playback speed is usually performed by a high-performance personal computer. However, a portable dedicated playback device, other than a personal computer, is desired, but it is difficult to realize the portable dedicated playback device in terms of battery capacity and thermal design of a high-performance CPU, which has been used in a personal computer, etc. There is a problem that a low-performance CPU takes a long time to perform the process of converting voice data playback speed, so real time processing cannot be performed.
Further, basic frequencies of voices, i.e., human voice of men and women of all ages, are widely varied from 70-350 Hz, so it is difficult to calculate a basic frequency for defining a time length of voice blocks by simply uniformly processing the original voice signals, so complicated calculation is required and processing voice data must be more difficult.
Thus, a first object of the present invention is to provide a voice data playback speed conversion method and a voice data playback speed conversion device, which are capable of enabling a process of converting voice data playback speed even in a voice data playback device having a low-performance CPU.
A second object of the present invention is to provide a voice data playback speed conversion method and a voice data playback speed conversion device, which is capable of highly reducing deterioration of voice data, by suitably calculating basic cycles of the voice data, when the voice data playback speed is converted.
The inventor of the present invention has studied and conceived the following structures.
Namely, the voice data playback speed conversion method for converting voice data playback speed comprises: a step of removing DC components, wherein DC components of original voice data being a playback object are removed; a step of extracting basic voice signals constituted by a basic frequency of the voice data, from which DC components have been removed, by setting a cutoff frequency at an intermediate value of the basic frequency and low-pass filtering so as to extract the basic frequency; a step of extracting rising zero cross points of the basic voice signals; a step of setting a reference zero cross point, which is an arbitrary reference zero cross point selected from the rising zero cross points; a step of selecting a plurality of the rising zero cross points temporally after the reference zero cross point within a first predetermined time range; a step of selecting a reference waveform temporally after the reference zero cross point until a second predetermined time; a step of selecting comparison object waveforms from each of the zero cross points, which has been selected in said step of selecting the rising zero cross points, until the second predetermined time; a step of calculating an autocorrelation value between the reference waveform and the reference waveform by using a correlation function; a step of calculating correlation values between the reference waveform and the comparison object waveforms by using a correlation function; a step of calculating voice blocks each of which is segmented by a start point of the voice data and an end point thereof, wherein the autocorrelation value is compared with the correlation values, the zero cross point of the comparison object waveform which is used for calculating the correlation value whose concordance rate with respect to the autocorrelation value is highest is defined as a second reference zero cross point, the start point of the voice data corresponds to the reference zero cross point, and the end point of the voice data corresponds to the second reference zero cross point; and a step of expanding and contracting the voice data in basic cycle units so as to convert the playback speed of the voice data.
With this method, calculation amount for converting the playback speed of the voice data can be highly reduced, so that the process of converting voice data playback speed can be performed even in a voice data playback device alone. Further, the voice blocks, which are basic units of the voice data, can be always correctly extracted when the process of converting the playback speed of the voice data is performed, so that playback quality of the voice data after converting the playback speed can be made significantly higher than ever before.
The voice data playback speed conversion device for converting voice data playback speed comprises: means for removing DC components, wherein DC components of original voice data of being a playback object are removed; means for extracting basic voice signals constituted by a basic frequency of the original voice data, from which DC components have been removed, by setting a cutoff frequency at an intermediate value of the basic frequency and low-pass filtering so as to extract the basic frequency; means for extracting rising zero cross points of the basic voice signals; means for setting a reference zero cross point, which is an arbitrary reference zero cross point selected from the rising zero cross points; means for selecting a plurality of the rising zero cross points temporally after the reference zero cross point within a first predetermined time range; means for selecting a reference waveform from the reference zero cross point until a second predetermined time; means for selecting comparison object waveforms from each of the zero cross points, which has been selected by the means for selecting the rising zero cross points, until the second predetermined time; means for calculating an autocorrelation value between the reference waveform and the reference waveform by using a correlation function; means for calculating correlation values between the reference waveform and the comparison object waveforms by using a correlation function; means for calculating voice blocks each of which is segmented by a start point of the voice data and an end point thereof, wherein the autocorrelation value is compared with the correlation values, the zero cross point of the comparison object waveform which is used for calculating the correlation value whose concordance rate with respect to the autocorrelation value is highest is defined as a second reference zero cross point, the start point of the voice data corresponds to the reference zero cross point, and the end point of the voice data corresponds to the second reference zero cross point; and means for expanding and contracting the voice data in basic cycle units so as to convert the playback speed of the voice data.
With this structure, calculation amount for converting the playback speed of the voice data can be highly reduced, so that the process of converting voice data playback speed can be performed even in a voice data playback device alone. Further, the voice blocks, which are basic units of the voice data, can be always correctly extracted when the process of converting the playback speed of the voice data is performed, so that playback quality of the voice data after converting the playback speed can be made significantly higher than that of the conventional technologies.
In the present invention, the calculation amount for converting the playback speed of the voice data can be highly reduced, so that the process of converting the voice data playback speed can be performed even in the voice data playback device alone. Further, the voice blocks, which are the basic units of the voice data, can be always correctly extracted when the process of converting the playback speed of the voice data is performed, so that the conversion of the playback speed of the voice data can be performed, without deteriorating playback quality of the voice data, even in the voice playback device whose performance is much lower than that of a conventional device.
Embodiments of a voice data playback device and a voice data playback speed conversion method of the present invention will now be described in detail with reference to the accompanying drawings.
As shown in
A structure of each of the sections of the voice data playback device 10 and a processing flow of the method for converting playback speed of the voice data collected by the data input/output section 20 will be explained, in parallel, with reference to
Firstly, the voice data playback device 10 collects the voice data of 100 msec by voice data collecting means 22 of the data input/output section 20, the collected data are stored in the data memory section 30 (a step of collecting voice data). Namely, the data are buffering-inputted in the unit time of 100 msec. In the second or later voice data collecting step, if a part of the voice data collected in the previous data collecting step are unprocessed and left in holdover data storing means 39, the left voice data are added to a head of the voice data collected in the current data collecting step and stored together. The voice data may be collected from a recording medium, e.g., optical disk, semiconductor memory, or through a network, etc.
The voice data are stored in voice data storing means 31 of the data memory section 30, as original data (original voice data), in a state where the voice data are stored with lapsed time data from a head of the voice data.
A waveform of the original voice data stored in the voice data storing means 31 is shown in
The original voice data D00 shown in
The primary-processed voice data D01, which has been obtained by the above described manner, are stored in primary-processed voice data storing means 32 of the data memory section 30 in a state where the processed voice data are stored with lapsed time data from the beginning of the data collection.
Since the primary-processed voice data D01 shown in
The waveform (graph) of the secondary-processed voice data D02, in which the DC components have been removed by the low-pass filter, is shown in
By performing the low-pass filtering, the secondary-processed voice data D02, which have been processed (filtered) by removing the exempted frequency components, are stored, in secondary-processed voice data storing means 33 of the data memory section 30, with the lapsed time data from the head of the voice data. In this step, the primary-processed voice data stored in the primary-processed voice data storing means 32 may be deleted after applying a high-pass filter.
Next, rising zero cross points, at each of which a value of the graph shown in
Firstly, in the graph of the secondary-processed voice data D02, the graph always begins from the zero cross point, so a position of the head of the data is basically extracted as first zero cross point.
In the graph of the secondary-processed voice data D02 shown in
Further, in case that the amplitude of −42 dB or less is continued for 10 msec (441 samples) in the graph, such part is regarded as a silent block and segmented at an end point even if the end point is not a zero cross point. By segmenting the silent part by every 10 msec, a length of the sound blocks is approximately equal to that of the silent blocks, so that block combination processing, etc. of the voice data can be easily performed.
There are amplitudes of more than −42 dB in the graph, but if no zero cross points exist within 20 msec (882 samples), the part is regarded as the silent block and segmented at an end point even if the end point is not a zero cross point. Even if sound exists in the graph, its waveform having a cycle of 20 msec or more is considered as a back noise which could not be removed by the filtering process.
In the present embodiment, the sound blocks and the silent blocks basically have the same time length, so the above described voice data of the exceptional waveform are also segmented by 20 msec and regarded as the silent block. A first zero cross point found after such silent block is regarded as a rest part of the segmented block for easily treating data and extracted as a zero cross point of the silent block. Namely, in this case, even if the amplitude of the graph is −42 dB or less, the zero cross point is exceptionally extracted.
Further, in case that a zero cross point is found, immediately after the silent block, in a waveform whose amplitude is more than −42 dB, the zero cross point is segmented into the silent block and the sound block and extracted. This rule is applied when satisfying the following two conditions: a new zero cross point having amplitude of −42 dB is found after the zero cross point segmenting the silent block; and at least one zero cross point, which is not extracted due to amplitude of less than −42 dB, exists between the two zero cross points. More precisely, the silent block is terminated at the previously extracted zero cross point, and the currently found zero cross point is extracted as the sound block. Namely, two zero cross points are extracted. This manner is performed to always start the sound block from the zero cross point.
In the present embodiment, the threshold value of the amplitude of the graph, e.g., −42 dB, is set so as not to incorrectly extract zero cross points of the waveform in the silent parts, i.e., silent blocks, but the threshold value is not limited to −42 dB. Other threshold values may be used according to characteristics of voice data.
The extracted rising zero cross points are shown in
Tertiary-processed voice data D03, in which the DC components have been removed, the low-pass filtering has been performed and the rising zero cross points have been extracted, are stored in tertiary-processed voice data storing means 34 of the data memory section 30 in a state where the voice data are stored with lapsed time data from the head of the voice data. In the tertiary-processed voice data storing means 34, the time data at the arrowed points shown in
In this step, the voice data (the primary-processed voice data and/or the secondary-processed voice data) stored in the primary-processed voice data storing means 32 and/or the secondary-processed voice data storing means 33 may be deleted.
Next, the first zero cross point of the rising zero cross points shown in
After setting the reference zero cross point KZ, a plurality of the rising zero cross points are selected temporally after the reference zero cross point KZ, within a first predetermined time range, by zero cross point selecting means 53 of the calculating section 50 (a step of selecting zero cross points).
Considering calculation amount of data handled and reliability of calculation results, the first predetermined time range is defined as, for example, 2-20 msec. As described above, ordinary basic frequency of human voice is 70-350 Hz, and one cycle corresponding to said frequency is about 2.86-14.29 msec, so the first predetermined time range including a safety margin is 2-20 msec because zero cross points within at least one cycle must be searched.
In the present embodiment, three rising zero cross points which meet the above described conditions are detected. The detected rising zero cross points are stored in the zero cross point storing means 35 of the data memory section 30, as comparative zero cross points MZ1, MZ2 and MZ3 which are start point candidate positions of a second reference zero cross point, with time data as well as the reference zero cross point KZ.
Successively, a waveform is selected temporally after the reference zero cross point KZ within a second predetermined time range, as a reference waveform of the voice data, by reference waveform selecting means 54 of the calculating section 50 (a step of selecting the reference waveform). In the present embodiment, the second predetermined time is 10 msec. Time of at least a half cycle is required for obtaining complete characteristics of the waveform to be used for a waveform comparing process described later, the first predetermined time is defined as 2-20 msec on the basis of the basic frequency of human voice as described above, so the second predetermined time is defined as 10 msec, which is a half of the maximum value of 20 msec, for the same reason.
The selected reference waveform is stored in reference waveform storing means 36 of the data memory section 30.
Next, comparison object waveform selecting means 55 of the calculating section 50 selects waveform data temporally after each of the comparative zero cross points MZ-MZ3 within the second predetermined time range (a step of selecting the comparison object waveforms). The comparison object waveforms selected by the comparison object waveform selecting means 55 are stored, in order of being selected by the comparison object waveform selecting means 55, in comparison object waveform storing means 37 of the data memory section 30.
Next, autocorrelation value calculating means 56 and correlation value calculating means 57 of the calculating section 50 calculate concordance rates of values of functions in which time is variable (concordance rates of correlation values) between the reference waveform and the comparison object waveforms respectively stored in the reference waveform storing means 36 and the comparison object waveform storing means 37, and select the comparison object waveform whose concordance rate is highest. A concrete manner for calculating the concordance rates of function values will be explained.
The autocorrelation value calculating means 56 segments a time axis of the reference waveform into prescribed time ranges by using the reference waveforms (i.e., functions in which time is variable) and performs product-sum calculation of amplitudes corresponding to the segmented time throughout the entire time axis. The result of the product-sum calculation is stored in autocorrelation value storing means 38 of the data memory section 30 as an autocorrelation value (a step of calculating the autocorrelation value and a step of storing the same).
Next, the correlation value calculating means 57 segments time axes of the reference waveform and the comparison object waveforms into prescribed time ranges by using the reference waveform and the comparison object waveforms (i.e., functions in which time is variable) and performs product-sum calculation of amplitudes corresponding to the segmented time ranges throughout the entire time axis. The result of the product-sum calculation is stored in correlation value storing means 39 of the data memory section 30 as an autocorrelation value (a step of calculating the correlation value and a step of storing the same).
Second zero cross point selecting means, not shown, is provided in the calculating section 50, and it calculates percentage of the concordance rate of the correlation values by using the correlation value stored in the correlation value storing means 39 and the autocorrelation value stored in the autocorrelation value storing means 38, and selects the comparison object waveform whose concordance rate is highest from the comparison object waveform storing means 37. In the present embodiment, as shown in
As described above, the start point of the reference waveform is limited to the zero cross point, so it is only necessary to calculate the correlation values of the waveforms whose start points are zero cross points, although the correlation values of waveforms which are started from positions of all samples within the first predetermined time range are calculated primarily; therefore, number of execution of correlation functions can be significantly reduced and calculation amount can be significantly reduced. Further, the wave data, whose correlation values will be calculated, have been low-pass filtered, so that the waveforms vary smoothly. Therefore, even if the time length for segmenting waveforms for calculating the correlation values is set relatively long with respect to the time length of one sample and points for performing the product-sum calculation are decimated, the correlation values between the waveforms are mostly not influenced. In the present embodiment, said time length is set about 0.2 msec so as to perform the calculation once every 10 samples, so that the calculation amount can be further reduced.
Next, as shown in
Note that, by consecutively segmenting the voice data into voice blocks, odd data which cannot constitute one voice block are sometimes left at the end. A handling manner of the odd data will be explained in a step of carrying over termination data described later.
The calculated voice blocks are applied to the graph of
After the voice blocks of the voice data are calculated as described above, playback speed converting means 59 of the calculating section 50 converts a playback speed by using the original voice data stored in the data memory section 30 (a step of converting the playback speed).
A concrete method for converting a playback speed of voice data will be explained.
The concrete method for converting a playback speed of voice data will be explained with reference to
In case of converting the playback speed to the half-speed, as shown in
In case of converting the playback speed to the double-speed, the voice blocks are combined as shown in
As to the silent parts, in case of increasing the playback speed, data whose length is defined according to a speech speed are respectively retrieved from a head side and a rear side of data in the silent part as voice blocks. On the other hand, in case of reducing the playback speed, the voice data are segmented, by a constant minute unit time, into a plurality of minute voice blocks, and the minute voice blocks are combined so as to extend the silent part.
In the present embodiment, the voice data stored in the voice data storing means 31 of the data memory section 30 of the voice data playback device 10 are segmented every 100 msec, but the voice blocks of the voice data are not always have the time length of exactly 100 msec. Therefore, in each of the segments of the voice data, voice data having an insufficient time length which is not capable of constituting one voice block, will exist in an end part of the voice data.
Thus, in the present embodiment, termination data carry-over means 500 of the calculating section 50 retrieves termination data TD, which are included in the end part of each of the voice data and which have insufficient time lengths being not capable of constituting one voice block, and stores said data in the termination data carry-over means 500 (a step of carrying over the termination data).
The termination data TD carried over as described above are added to a head of the voice data of 100 msec to be inputted next time. Since it is clear that the head of the voice data is the start point (zero cross point) of the voice block, the reference zero cross point setting means 52 can unqualifiedly select the head of the voice data as the new reference zero cross point.
The above described steps from the reference zero cross setting step to the voice block calculating step are repeatedly performed until no voice data for calculating a next voice block exist in the data memory section 30, so that calculating the voice blocks included in the voice data stored in the data memory section 30 can be continuously performed.
In case that the next voice data of 100 msec, to which the termination data TD will be carried over, is not inputted, the termination data TD retrieved by the data carry-over means 500 are deleted, and the playback speed conversion process by the voice data playback device 10 is terminated.
In case of employing the above described process too, most voice data included in the termination data TD are a silent part, or very small amount of voice data less than a basic cycle of the voice data even if the termination data TD are a sound part, so that the playback quality after the playback speed conversion process is hardly influenced by deleting the termination data TD.
The playback speed conversion of voice data was performed and the processed voice data were playbacked according to the method of the above described embodiment, the playback speed of the voice data could be suitably converted without changing pitch of voice of a reader. Further, no unnatural noises were included in the processed voice data, and the voice data could be listened comfortably.
By employing the above described voice data playback device 10 and the voice data playback speed conversion method, the voice data playback speed conversion can be suitably performed even in the voice data playback device 10 having a low-performance CPU (the calculating means).
Namely, in the present invention, a high-performance CPU, which is mounted in a personal computer and used for the conventional technologies, need not be mounted in the voice data playback device. Therefore, the present invention is very useful technology for reducing production cost of the voice data playback device.
In the above described embodiment, voice data, especially human voice data, are the object voice data whose playback speed will be converted, so the voice data playback speed conversion can be performed by the voice data playback device 10 only. Voice data having a complicated basic cycle, which are voice data of, for example, recitation with background music, are not target, but such voice data are scarcely included in DAISY book data processed in the above described embodiment, so that no practical problems will occur.
In the above described embodiment, the voice data are segmented by 100 msec and stored, and the reference correlation function and the correlation functions to be compared are set every 10 msec, but the unit time for collecting voice data and the first and second predetermined time ranges used for setting the reference waveform and the comparison object waveforms are not limited to those used in the above described embodiment.
A time range for collecting voice data and the first and second predetermined time ranges used for setting the reference waveform and the comparison object waveforms may be values, which are properly inputted by a user through input means of the data input/output section 20 not shown. In this case, if a maximum and/or a minimum value of the input value is previously defined, voice blocks which are basic units of voice data can be correctly calculated, increasing capacity of data to be calculated can be prevented, and occurring a case where the voice data playback device 10 cannot solely process can be suitably prevented.
Number | Date | Country | Kind |
---|---|---|---|
2013-013628 | Jan 2013 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2014/051042 | 1/21/2014 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2014/115696 | 7/31/2014 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
20110224990 | Hosokawa | Sep 2011 | A1 |
Number | Date | Country |
---|---|---|
S60-158500 | Aug 1985 | JP |
S62-183500 | Aug 1987 | JP |
9-198092 | Jul 1997 | JP |
10-187188 | Jul 1998 | JP |
2002-313015 | Oct 2002 | JP |
2006-227110 | Aug 2006 | JP |
2007-94004 | Apr 2007 | JP |
2008-20870 | Jan 2008 | JP |
2009042573 | Feb 2009 | JP |
Entry |
---|
International Search Report for PCT/JP2014/051042 dated Apr. 17, 2014. |
Number | Date | Country | |
---|---|---|---|
20150371660 A1 | Dec 2015 | US |