The present invention relates to a music extracting apparatus which extracts music portion from broadcasting signals such as radio broadcast or television broadcast, and a music recording apparatus which records the extracted music portion.
In music program provided on radio or TV broadcasting, most of them are constituted from talk section, such as MC (Master of Ceremony) or DJ (Disc Jockey), and music section. In these programs, talk sections usually exist between music sections. Sometimes the voice of DJ overlaps in the starting or ending portion of the music sections.
In JP 2005-518560 A1, an apparatus, which extracts music portion from the broadcasting waves, is disclosed. In the apparatus, the starting and the ending position of music section is detected only by stereophonic information. Specifically, it determines that the starting position is detected when the difference value between the audio signals of left and right channels exceeds a first predetermined value, and determines that the ending position is detected when the difference value lowers the second predetermined value (1).
However, in the conventional method, it sometimes mistakenly determines that the ending position of the music section is detected when the music section has a non stereo-like portion in its midstream.
A first music extracting apparatus of the present invention comprises a receiving unit which receives a broadcast signal having a plurality of channels of audio signals; a detecting unit which detects a variation of voice power from the audio signal; a computing unit which computes a difference of amplitude or power between the audio signals of each channel, and a specifying unit which specifies the starting or the ending position of a music section based on the variation detected by the detecting unit, and the difference computed by the computing unit.
A second music extracting apparatus of the present invention comprises a receiving unit which receives a broadcast signal having a left and right channels of audio signals; a detecting unit which detects a transition point where variation of voice power of the audio signal exceeds predetermined value; a computing unit which computes an amplitude difference between the audio signals of each channel, and a specifying unit which specifies the starting or the ending position of a music section based on the amplitude difference in the vicinity of the transition point.
A music recording apparatus of the present invention comprises a receiving unit which receives a broadcast signal having a plurality of channels of audio signals; a detecting unit which detects a variation of voice power from the audio signal; a computing unit which computes a difference of amplitude or power between the audio signals of each channel; a specifying unit which specifies the starting and the ending position of a music section based on the variation detected by the detecting unit, and the difference computed by the computing unit, and a recording unit which records the music section specified by the specifying unit.
The present invention embodied in a music extracting apparatus or music recording apparatus is specifically described below with the reference to the drawings.
The FM tuner unit 2 tunes in a broadcast wave chosen by user among the FM broadcast wave inputted from the antenna 1. Then, the unit 2 demodulates the tuned wave and outputs analog audio signals (i.e. the audio signal of the left channel and the right channel). The A/D conversion unit 3 converts the analog signal acquired by the unit 2 to the digital audio signal. The MP3 codec 4 encodes the digital audio signal to a data compressed by MP3 format. Further, the codec 4 decodes the MP3 compressed data readout from the HDD 8 to a digital audio signal. The HDD-IF 7 interfaces with the HDD 8. The HDD 8 is a mass storage device for example.
The DSP 9 detects a transition point from an inputted audio data. The DSP 9 also computes stereo likelihood. Here the transition point is a point where the variation of the power of the audio signal is larger than a predetermined value. The stereo likelihood is expressed by a difference value between the audio data of the left channel and the right channel. The DSP 9 computes the variation of the power of the audio data in order to detect the transition point.
CPU 10 controls each part of the music recording and reproducing apparatus. The memory 11 operates as a work memory of the CPU 10. A program for CPU 10 is stored in ROM (not illustrated). In HDD 8, a data, which is compressed and encoded in MP3 format by the MP3 codec 4, is recorded. The D/A conversion unit 5 converts a digital audio signal, which is acquired by a decoding function of the codec 4, to an analog audio signal. The speaker unit 6 outputs the analog audio signal acquired by the D/A conversion unit 5.
Further, during the recording process, the DSP 9 keeps computing the amplitude difference value of the audio data between the left and right channels. Then the computed value is stored in the third predetermined area of the memory 11. In the third area, the amplitude difference value for the recent 10 seconds is stored, for example.
The CPU 10 starts the recording process triggered by a user's instruction. When the process has started, the CPU 10 activates the FM tuner unit 2, and controls the unit 2 so that the broadcast station selected by the user is tuned. Further, the CPU 10 controls DSP 9 so that the amplitude difference of the left and right channel is computed, and then the computed value is stored in the third area of the memory 11 (step S1). The output of FM tuner unit 2 is transmitted to the A/D conversion unit 3, and is converted to digital audio data. This audio data is then transmitted to the DSP 9 as well as to the memory 11. Thereby storing processes of the audio data to the first and the second area of the memory 11 are started.
Then, when the amount of the data stored to the first area has reached the first predetermined amount, the oldest stored data is deleted from the area while the newest data is stored in turn. Similarly, when the amount of the data stored to the second area has reached the second predetermined amount, the oldest stored data is deleted from the second area while the newest data is stored in turn.
The DSP 9 starts a computing process of the amplitude difference between the audio data of the left and the right channels inputted to the DSP 9, and store the result to the third area of the memory 11. Then, the DSP 9 and CPU 10 perform detecting process of the transition point, and the computing process of the stereo likelihood in vicinity of the transition point (step S2).
The CPU 10 determines whether the target audio data regards to the transition point or not based on the variation of the power information of the audio signal inputted from the DSP 9 (step S23). When the variation is larger than a threshold value Th1, it is determined that the target audio data regards to the transition point. When determined that it does not regards to the transition point, it goes back to step S21 and process of the steps S21 to S23 are processed again.
When it is determined that the target audio data regards to the transition point in the step S23, the amplitude difference value stored in the third area of the memory 11 is read out. Specifically, the value corresponding to ten second long audio data centered by the transition point is read out. Then the average value of the ten second long data is computed as a stereo likelihood evaluation value. Thereby, computing process of the stereo likelihood is performed.
Again referring to
When the evaluation value is less than Th2 in the step S3, it is determined that the target audio data is a talk section such as MC or DJ. In this case, since there is a possibility that the music section may exist afterwards, the time stamp information of the target audio data is memorized as a music starting time Ps (step S4). Then, the process proceeds to step S5. In the step S5, stereo likelihood in vicinity of the transition point is computed in similar manner as step S2.
When the computation of step S5 is finished, it is determined that whether the evaluation value computed at step S5 is less than Th2 or not (step S6). When evaluation value is equal to or more than Th2, the target audio data is determined as a music section. Then, it returns to step S5.
When the evaluation value is less than Th2 in the step S6, the target audio data is determined to be a talk section such as MC or DJ, and is not a music section. Then, it is determined whether the interval between the music starting time Ps and the target audio data is equal to or more than the predetermined time ΔT (step S7). In other word, it is determined whether the interval between a transition point currently determined as a talk section and the transition point previously determined as a talk section is equal to or more than ΔT or not.
When the interval is less than ΔT, then it determines that the this section is not long enough for the music section and updates the music starting time Ps to the time of the target audio data (step S8). Then it returns to step S5. When the interval is determined to be equal to or longer than ΔT, the time of the target audio data is memorized as a music ending time Pe (step S9). Then the audio data existing between the time Ps and Pe is extracted from the audio data stored in the first area of the memory 11 as a music data. The extracted data is then compressed by the MP3 codec 4, and is recorded on HDD 8 (step S10). Then, Ps is updated to a time memorized as Pe (Step S11), and returns to step S5.
The music recording process is terminated when directed by the user's operation. Here it is presumed that a music section 100, a first DJ section 101, a music section 102, and a second DJ section 103 appears in this order, as shown in
Next, when an audio data of the first DJ section 101 is read out from the second area of the memory 11, a transition point is detected in the step S2. Further, since the stereo likelihood evaluation value at the transition point would be less than Th2, it is determined “yes” in the step S3. Therefore, the time of this transition point is recorded as a music starting time Ps in step S4. Then, it proceeds to step S5.
When a transition point is detected in the step S5, since it is likely that the evaluation value is less than Th2, it proceeds to step S7. However, the interval between the time memorized as Ps and the target audio data is less than ΔT, thus it is determined “no” in step S7 and Ps is updated in step S8. Thereby, the processes of step S6 to S8 are iterated.
Next, when an audio data of the music section 102 is read out from the second area of the memory 11, a transition point may not be detected in the step S5. Even if the transition point is detected, since the stereo likelihood evaluation value would be equal to or more than Th2, it is determined “no” in the step S6. Thus, the process of step S5 is carried on or the process of steps S5 and S6 are iterated.
Next, when an audio data of the second DJ section 103 is read out from the second area of the memory 11, a transition point may be detected in the step S5. Further, since the stereo likelihood evaluation value at the transition point would be less than Th2, it is determined “yes” in the step S6 and proceeds to step S7. Since an interval of time memorized as Ps, and the target audio data is equal to or more than ΔT, it is determined “yes” in step S7 and proceeds to step S9. In the step S9, the time corresponding to the target audio data is memorized as Pe. Then, the audio data existing in a period between Ps and Pe is extracted as a music section data from the data memorized in the first area of the memory 11. Then the extracted data is compressed and recorded to the HDD 8.
In order to raise the detection accuracy of the starting or ending position of the music section, it is desirable to set the threshold low so that many transition points can be detected. However, if the threshold is set too low, the numbers of the transition point detected inside the music section tends to increase. In such case, it may mistakenly detect that the ending point has appeared, when there is low stereo likelihood part in the music section. Therefore, it is desirable to detect the starting and ending point of the music section further considering a frequency characteristic in vicinity of a transition point.
In other words, in the above embodiments, first, it is determined whether the audio data regards to talk section or music section based on the average value of the difference of the left and right channel signals. Then, the starting and the ending positions are specified. However, it may determine further considering frequency characteristics as well.
An example of frequency characteristics may be MFCC (Mel Frequency Cepstrum Coefficient). Specifically, the likelihood between the MFCC detected in the vicinity of the transition point and the MFCC of the prepared standard data is computed. Then it is determined that the audio data in the vicinity of the transition point is music section when the likelihood is equal to or more than Th3 and the stereo likelihood evaluation value is equal to or more than Th2.
The present invention is not limited to the foregoing embodiment but can be modified variously by one skilled in the art without departing from the spirit of the invention as set forth in the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
2008-032067 | Feb 2008 | JP | national |
This application is a continuation-in part application of Patent Cooperation Treaty Patent Application No. PCT/JP2009/000556 (filed on Feb. 12, 2009), which claims priority from Japanese patent application JP 2008-032067 (filed on Feb. 13, 2008). All of which are hereby incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2009/000556 | Feb 2009 | US |
Child | 12855995 | US |