This application claims the benefit under 35 U.S.C. § 371 as a U.S. National Stage Entry of International Application No. PCT/JP2020/028423, filed in the Japanese Patent Office as a Receiving Office on Jul. 22, 2020, which claims priority to Japanese Patent Application Number JP2019-172688, filed in the Japanese Patent Office on Sep. 24, 2019, each of which is hereby incorporated by reference in its entirety.
The present disclosure relates to a signal processing apparatus, a signal processing method, and a program.
A sound source separation technology is known in which a signal for a sound of a target sound source is extracted from a mixed sound signal including sounds from a plurality of sound sources (see, for example, PTL 1). Additionally, a frequency band extension (expansion) technology has been proposed in which high frequency components are generated from a signal with low frequency components and in which the resultant high frequency components are added to the signal with the low frequency components to generate a signal with a wider frequency band (see, for example, PTL 2).
[PTL 1]
In this field, appropriate frequency band extension processing or the like is desired to be executed.
An object of the present disclosure is to provide a signal processing apparatus, a signal processing method, and a program that execute appropriate frequency band extension processing or the like.
The present disclosure provides, for example, a signal processing apparatus including a sound source separation section configured to apply sound source separation processing to a mixed sound signal including a mixture of signals of a plurality of sound sources, and band extension sections configured to apply frequency band extension processing to respective sound source separation signals obtained by separation by the sound source separation section.
The present disclosure provides, for example, a signal processing method including, by a sound source separation section, applying sound source separation processing to a mixed sound signal including a mixture of signals of a plurality of sound sources and, by band extension sections, applying frequency band extension processing to respective sound source separation signals obtained by separation by the sound source separation section.
The present disclosure provides, for example, a program causing a computer to execute a signal processing method including, by a sound source separation section, applying sound source separation processing to a mixed sound signal including a mixture of signals of a plurality of sound sources and, by band extension sections, applying frequency band extension processing to respective sound source separation signals obtained by separation by the sound source separation section.
Embodiments and the like of the present disclosure will be described below with reference to the drawings. Note that the description is made in the following order.
The embodiments and the like described below are suitable specific examples of the present disclosure, and the contents of the present disclosure are not limited to the embodiments and the like.
First, to facilitate understanding of the present disclosure, problems to be considered in the embodiments will be described. As described above, an apparatus is known in which frequency band extension processing (hereinafter simply referred to as band extension processing) is executed. When a limited band of a sound source is to be extended, correctly executing band extension processing is difficult because a frequency envelope (spectrum envelope) varies depending on the type of a sound source such as a musical instrument. For example, cymbals and other percussion instruments, and traditional Japanese musical instruments such as a shakuhachi, a shamisen, and a koto make sounds containing up to extremely high frequency components, whereas musical instruments such as a piano and a violin have a property that attenuation increases consistently with frequency. In a case where sound sources do not temporally overlap one another, the types of the sound sources can be estimated at each point of time and behavior of the band extension processing (contents of the processing) can be varied depending on the type. However, for music or the like, typically, a plurality of types of sound sources simultaneously makes sounds, and thus it is difficult to execute appropriate band extension processing depending on the type of the sound source.
Additionally, in recent years, high-resolution audio having a sampling rate of more than 48 kHz (hereinafter referred to as a high-resolution sound source as appropriate) has spread. When high-resolution sound sources are to be produced, some sounds such as vocals are recorded as high-resolution sound sources, but sounds of many musical instruments may be recorded as standard-resolution audio having a sampling rate of 48 kHz or less (hereinafter referred to as standard-resolution sound sources as appropriate). Thus, in such a case, there is a demand to make the sounds of all the musical instruments have a high-resolution during a repeated mastering step (remastering). At this time, band extension processing is preferably applied only to sound sources not recorded at a high resolution, without editing sound sources recorded at a high resolution. However, the sounds of all the sound sources are mixed during a mixing step, posing a problem in that whether or not to execute the band extension processing fails to be selected for each sound source during a repeated mastering step. The present disclosure has been developed in view of these circumstances. The present disclosure will be described below in detail.
The sound source separation section 11 applies sound source separation processing to the mixed sound signal x to generate sound source separation signals s1, s2, . . . , and sN corresponding to the types of the respective sound sources. The sound source separation signal s1 is supplied to the band extension section 121. The sound source separation signal s2 is supplied to the band extension section 122. The sound source separation signal sN is supplied to the band extension section 12N.
The sound source separation processing executed by the sound source separation section 11 is not limited to particular processing. For example, in addition to MWF (Multi Channel Wiener Filter) based sound source separation processing using DNN (Deep Nature Networks), sound source separation processing described in PTL 1 listed above can be applied. The sound source separation processing described in PTL 1 is, roughly speaking, processing in which amplitude spectra are estimated using different sound source separation schemes having outputs with temporally different properties (specifically, DNN and LSTM (Long Short Term Memory)) and in which estimation results are concatenated using a predetermined concatenation parameter to generate sound source separation signals. Needless to say, the sound source separation section 11 may execute sound source separation processing different from the sound source separation processing described above.
The band extension section 12 applies band extension processing to each of the sound source separation signals s obtained by separation by the sound source separation section 11. The band extension section 12 uses, as input signals, for example, sound source separation signals s corresponding to low frequency signal components, applies the band extension processing to the sound source separation signals s, and outputs resultant output signals as output signals j containing low frequency components and also containing high frequency components with extended bands (output signal j1, output signal j2, . . . , and output signal jN). The band extension section 12 applies, to the sound source separation signals s, well-known band extension processing, for example, band extension processing described in PTL 2 listed above. Note that the individual band extension sections 12 are associated with the respective types of the sound source separation signals s to be input to the corresponding band extension sections 12.
Note that an extension start band hereinafter refers to a lowest-frequency-side end of frequency components to be extended by the band extension processing and that high frequency components refer to signals with frequency bands higher than the extension start band, whereas low frequency components refer to signals with frequency bands lower than the extension start band.
The addition section 13 adds together the output signals j output from the band extension sections 12 (specifically, the output signal j1, the output signal j2, . . . , and the output signal jN) to generate a synthesized output signal S, and outputs the synthesized output signal S. In the present embodiment, a band extended sound source signal corresponding to an output of the signal processing apparatus 1 is assumed to be the synthesized output signal S.
Now, an example of operations performed by the signal processing apparatus 1 will be described. The mixed sound signal x is input to the sound source separation section 11. The sound source separation section 11 applies the sound source separation processing to the mixed sound signal x to generate sound source separation signals s, and outputs the sound source separation signals s. The band extension sections 12 apply the band extension processing to the sound source separation signals s to generate output signals j, and output the output signals j. The addition section 13 adds the output signals j together to generate a synthesized output signal S, and outputs the synthesized output signal S.
Incidentally, the band extension processing described in PTL 2 listed above is based on a mixed sound, and does not take into account execution of the optimum band extension processing depending on attributes of a sound source, specifically, the type of the sound source. For example, cymbals as percussion instruments and the like involve an envelope extending up to high frequencies without attenuation. Thus, in the present embodiment, for execution of the optimum band extension processing for each type of sound source, a frequency envelope of high frequency components (high frequency band) to be estimated is set for each type of sound source. Specifically, a parameter for the band extension processing corresponding to the type of the sound source is set, and the band extension processing is executed using the parameter. Equipment that estimates a high frequency band may be applied as the band extension section, the equipment having been caused to learn only the type of the sound source (for example, a cymbal sound) as training data.
Now, a second embodiment of the present disclosure will be described. Note that the matters described in the first embodiment can also be applied to the second embodiment unless otherwise noted. Additionally, components identical or equivalent to the corresponding components in the first embodiment are denoted by identical reference symbols, and duplicate descriptions are omitted as appropriate.
In a case where the band extension processing is executed independently for each sound source separation signal, the high frequency components of the synthesized output signal S may be unnaturally emphasized depending on an algorithm for the band extension processing. For example, in a case where the algorithm for the band extension processing estimates only amplitude spectra or envelopes of the amplitude spectra and duplicates a phase in a certain manner (for example, uses a phase same as that of low frequency components (low frequency band)), and where a sound source separation algorithm also involves a phase not varying significantly for each separation sound source, the high frequency signals of sound source separation signals with extended bands all have similar phases. Thus, even with the amplitude spectrum of each sound source separation signal or the envelope of the amplitude spectrum correctly estimated, the high frequency components of the synthesized output signal S may be unnaturally emphasized because all the high frequency signals have similar phases. The present embodiment is a signal processing apparatus having a configuration addressing the matters described above.
The frequency envelope shaping section 21 shapes the frequency envelope of the synthesized output signal S output from the addition section 13. For example, in a case where predetermined discontinuity is detected between a portion of the frequency envelope preceding the extension start band (lower limit of the frequencies extended by the band extension processing) f1 and a portion of the frequency envelope succeeding the extension start band f1, the frequency envelope of the synthesized output signal S is shaped. In the present embodiment, the predetermined discontinuity is detected by the frequency envelope shaping section 21. However, the detection may be performed by another functional block. When the frequency envelope shaping section 21 shapes the frequency envelope, the amplitudes of the extended high frequency components are suppressed, allowing the high frequency components to be prevented from being unnaturally emphasized.
In the present embodiment, the discontinuity is detected in a case where a difference between a signal energy preceding the extension start band f1 and a signal energy succeeding the extension start band f1 is equal to or greater than a predetermined value. A specific example will be described with reference to
In
For example, as depicted in
(eH/eL)>Th (1)
In the example illustrated in
On the other hand, in the example illustrated in FIG. 4, in a case where the high frequency components of the synthesized output signal S form one of the frequency envelopes FE4 to FE6, Formula 1 is not satisfied, leading to determination of absence of discontinuity. In this case, the high frequency components are unlikely to be unnaturally emphasized, and thus the frequency envelope shaping section 21 executes no processing, with the synthesized output signal S output from the frequency envelope shaping section 21.
According to the second embodiment described above, in a case where the band extension processing is executed, the high frequency components succeeding the extension start band can be prevented from being unnaturally emphasized.
Now, a modified example of the signal processing apparatus according to the second embodiment will be described.
The signal processing apparatus 2A does not include the frequency envelope shaping section 21 but instead includes a phase rotation section 22. The phase rotation section 22 is provided between the band extension section 12 and the addition section 13. Specifically, the signal processing apparatus 2A includes phase rotation sections 22 (phase rotation section 221, 222, . . . , and 22N) the number of which corresponds to the number of the band extension sections 12. Output signals from the phase rotation sections 22 are added together by the addition section 13.
The phase rotation sections 22 rotate (change) phases of the high frequency components of the output signals j with the bands extended by the band extension sections 12 such that the high frequency components of the output signals j have different phases depending on the sound sources. The phase rotation sections 22 each include, for example, a filter that can shift the phase without affecting the amplitude, specifically, an all-pass filter.
The phase rotation sections 22, for example, randomly rotate the phases, thus allowing the high frequency components of the band extended sound source signal to be prevented from being unnaturally emphasized. Additionally, human auditory characteristics are insensitive to a change in phase in high frequencies, and thus the high frequency components of the band extended sound source signal can be prevented from being unnaturally emphasized, without providing auditorially uncomfortable feeling to a user.
Now, a third embodiment of the present disclosure will be described. Note that the matters described in the first and second embodiments can also be applied to the third embodiment unless otherwise noted. Additionally, components identical or equivalent to the corresponding components in the first and second embodiments are denoted by identical reference symbols, and duplicate descriptions are omitted as appropriate.
As described above, among sound sources (hereinafter referred to as a mixed sound source as appropriate) including high-resolution sound sources (for example, sound sources containing high frequency components succeeding the extension start band f1) and standard-resolution sound sources (for example, sound sources containing no high frequency components succeeding the extension start band f1), there is a demand to apply the band extension processing only to the standard-resolution sound sources. The present embodiment addresses such a demand. Note that the band of the mixed sound source includes high frequencies succeeding the extension start band f1.
Now, an operation example of the signal processing apparatus 3 will be described. The mixed sound source signal x1 is separated into signals for the respective sound source types by the sound source separation section 11, thus generating sound source separation signals s. Among the sound source separation signals s for the respective sound source types, only the sound source separation signals not recorded at a high resolution (sound source separation signals s1 and s2 in the present example) are respectively supplied to the corresponding band extension sections 121 and 122. The band extension section 121 executes the band extension processing to extend the band of the sound source separation signal Si. Further, the band extension section 122 executes the band extension processing to extend the band of the sound source separation signal s2.
For the output signal obtained by applying the band extension processing, the band extension section 121 outputs, to the addition section 13, an extended band signal p1 included in the output signal and containing only the high frequency components succeeding the extension start band f1. Further, for the output signal obtained by applying the band extension processing, the band extension section 122 outputs, to the addition section 13, an extended band signal p2 included in the output signal and containing only the high frequency components succeeding the extension start band f1. In this regard, the band extension sections 121 and 122 output only the extended band signals to the addition section 13 because the low frequency components of the sound source separation signals s1 and s2 are included in the mixed sound source signal x1 input to the addition section 13.
The addition section 13 adds the extended band signals p1 and p2 and the mixed sound source signal x1 together to generate a band extended sound source signal, and outputs the band extended sound source signal.
According to the third embodiment described above, the sound source signals not recorded at a high resolution can exclusively be subjected to the band extension with no change in the high frequency components of the sound source signals recorded at a high resolution. Note that, in the above description, the sound source separation signals s1 and s2 are illustrated as sound source separation signals not recorded at a high resolution, but that the mixed sound source signal x1 may include more sound source separation signals not recorded at a high resolution.
In this case, as illustrated in
In the signal processing apparatus 3B, the mixed sound source signal x1 is supplied only to the sound source separation section 11 and not to the addition section 13. The sound source separation section 11 executes sound source separation processing on the mixed sound source signal x1 to generate sound source separation signals s1 and s2 and a sound source separation signal hm corresponding to the sound source signals recorded at a high resolution. The determination section 11B determines whether or not to apply, in a succeeding stage, the band extension processing on each sound source separation signal. In a case where the sound source separation signal contains high frequency components, the determination section 11B determines that the band extension processing need not be applied to the sound source separation signal, and outputs the sound source separation signal to the addition section 13. In the present modified example, the determination section 11B determines that the band extension processing need not be applied to the sound source separation signal hm, and the sound source separation section 11 supplies the sound source separation signal hm to the addition section 13.
Further, in a case where the sound source separation signal contains no high frequency components, the determination section 11B determines that the band extension processing needs to be applied to the sound source separation signal, and outputs the sound source separation signal to the band extension section 12. In the present modified example, the determination section 11B determines that the band extension processing needs to be applied to the sound source separation signals s1 and s2, and the sound source separation signals s1 and s2 are respectively supplied to the band extension sections 121 and 122.
The band extension section 121 applies the band extension processing to the sound source separation signal s1 to generate an output signal j1. In the configuration according to the signal processing apparatus 3B, the mixed sound source signal x1 is not supplied to the addition section 13, and thus the band extension section 121 outputs, to the addition section 13, the output signal j1 containing low frequency components, instead of an extended band signal. Further, the band extension section 122 applies the band extension processing to the sound source separation signal s2 to generate an output signal j2. In the configuration according to the signal processing apparatus 3B, the mixed sound source signal x1 is not supplied to the addition section 13, and thus the band extension section 122 outputs, to the addition section 13, the output signal j2 containing low frequency components, instead of an extended band signal. The addition section 13 adds the sound source separation signal hm, the output signal j1, and the output signal j2 together.
According to the signal processing apparatus 3B according to the present modified example, effects can be produced that are similar to those obtained on the basis of the configuration of the signal processing apparatus 3 described above. Additionally, according to the signal processing apparatus 3B according to the present modified example, whether or not to apply the band extension processing is automatically determined, thus, for example, eliminating the need for the user to learn in advance to which of the sound source separation signals the band extension processing is to be applied and select whether or not to apply the band extension processing during the remastering step.
The plurality of embodiments of the present disclosure has been described. However, the present disclosure is not limited to the embodiments described above, and various modifications can be made to the embodiments without departing from the scope of the present disclosure.
In the embodiments described above, the type of the sound source is used as an attribute of the sound source. However, another attribute such as a signaling property of the sound source may be used.
In a case where DNN or LSTM is applied as the sound source separation section, typically, an input to a network is considered to be an amplitude spectrum of a mixed sound signal, and training data is considered to be an amplitude spectrum of a sound of a target sound source. However, sound source separation signals obtained by sound source separation may be used as the training data in learning.
The present disclosure can also adopt a configuration of cloud computing in which a plurality of apparatuses executes processing of one function in a shared and cooperative manner via a network.
The present disclosure can also be implemented in any form such as an apparatus, a method, a program, or a system. For example, by providing a downloadable program that executes the functions described above in the embodiments and downloading and installing the program in an apparatus not having the functions described above in the embodiments, the control described in the embodiments can be performed in the apparatus. The present disclosure can also be implemented by a server that distributes such a program. Further, the matters described in the embodiments and the modified examples can be combined as appropriate. In addition, the effects illustrated herein do not make the contents of the disclosure interpreted in a limited manner.
The present disclosure can adopt the following configurations.
(1)
A signal processing apparatus including:
The signal processing apparatus according to (1), in which
The signal processing apparatus according to (1) or (2), including:
The signal processing apparatus according to (3), in which,
The signal processing apparatus according to (4), in which
The signal processing apparatus according to (1) or (2), including:
The signal processing apparatus according to (6), in which
The signal processing apparatus according to (1), in which
The signal processing apparatus according to (8), including:
The signal processing apparatus according to (1), including:
The signal processing apparatus according to (10), including:
The signal processing apparatus according to (11), in which
A signal processing method including:
A program causing a computer to execute a signal processing method including:
Number | Date | Country | Kind |
---|---|---|---|
2019-172688 | Sep 2019 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2020/028423 | 7/22/2020 | WO |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2021/059718 | 4/1/2021 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
10347258 | Sasaki | Jul 2019 | B2 |
20110075832 | Tashiro | Mar 2011 | A1 |
20120099741 | Gotoh | Apr 2012 | A1 |
20160249138 | Van der Werf | Aug 2016 | A1 |
20170374478 | Jones | Dec 2017 | A1 |
20190110135 | Jensen | Apr 2019 | A1 |
Number | Date | Country |
---|---|---|
2011-075728 | Apr 2011 | JP |
WO 2015079946 | Jun 2015 | WO |
WO 2018047643 | Mar 2018 | WO |
WO 2018177611 | Oct 2018 | WO |
Entry |
---|
International Search Report and English translation thereof mailed Sep. 1, 2020 in connection with International Application No. PCT/JP2020/028423. |
Number | Date | Country | |
---|---|---|---|
20220375485 A1 | Nov 2022 | US |