This application claims the benefit of Japanese Priority Patent Application JP 2013-239187 filed Nov. 19, 2013 the entire contents of which are incorporated herein by reference.
The present disclosure relates to a signal processing apparatus, a signal processing method, and a program.
Signal processing technologies for audio signals have been disclosed in these days. For example, a technology has been disclosed by which an audio signal is analyzed to calculate an speech score indicating a similarity to speech signal characteristics and a music score indicating a similarity to music signal characteristics, and by which the sound quality is adjusted based on the audio and music scores (see JP 2011-150143A, for example).
However, it is desirable to provide technology capable of providing a listener with such higher presence that the listener feels like directly listening to audio emitted in a live music venue.
According to an embodiment of the present disclosure, there is provided a signal processing apparatus including a feature detection unit configured to detect, from an input signal, a detection signal including at least one of audience-generated-sound likelihood and music likelihood, and a vicinity-sound generation unit configured to generate vicinity sound based on the detection signal.
According to another embodiment of the present disclosure, there is provided a signal processing method including detecting, from an input signal, a detection signal including at least one of audience-generated-sound likelihood and music likelihood, and causing a processor to generate vicinity sound based on the detection signal.
According to another embodiment of the present disclosure, there is provided a program for causing a computer to function as a signal processing apparatus including a feature detection unit configured to detect, from an input signal, a detection signal including at least one of audience-generated-sound likelihood and music likelihood, and a vicinity-sound generation unit configured to generate vicinity sound based on the detection signal.
According to the embodiments of the present disclosure described above, it is possible to provide a listener with such higher presence that the listener feels like directly listening to audio emitted in a live music venue. Note that the aforementioned advantageous effects are not necessarily limited, and any of advantageous effects described in the specification or other advantageous effects known from the specification may be exerted in addition to or instead of the advantageous effects described above.
Hereinafter, preferred embodiments of the present disclosure will be described in detail with reference to the appended drawings. Note that, in this specification and the appended drawings, structural elements that have substantially the same function and structure are denoted with the same reference numerals, and repeated explanation of these structural elements is omitted.
In addition, in this specification and the appended drawings, a plurality of structural elements that have substantially the same function and structure might be denoted with the same reference numerals suffixed with different letters or numbers to be discriminated from each other. However, when not having to be particularly discriminated from each other, the plurality of structural elements that have substantially the same function and structure are denoted with the same reference numerals only.
Note that description will be provided in the following order.
1. First Embodiment
2. Second Embodiment
3. Third Embodiment
4. Fourth Embodiment
5. Combination of Embodiments
6. Hardware Configuration Example of Signal Processing Apparatus
7. Conclusion
As to be described below, a signal processing apparatus 1 is supplied with an input signal. The input signal can include an audio input signal detected in a live music venue. A person (such as a vocal) who utters a voice (hereinafter, also referred to as “center sound”) to the audience is present in the live music venue. Meanwhile, sounds uttered by the audience in the live music venue are hereinafter collectively referred to as audience-generated sound. The audience-generated sound may include voices uttered by the audience, applause sounds, whistle sounds, and the like. Firstly, a first embodiment of the present disclosure will be described.
The sound-quality adjustment unit 200 adaptively adjusts the sound quality based on the detection signal supplied from the feature detection unit 100A.
Subsequently, description is given of a detailed configuration example of the feature detection unit 100A according to the first embodiment of the present disclosure.
The audience-generated-sound detection unit 110 detects audience-generated-sound likelihood indicating how much an input signal includes audience-generated sound, and outputs the detected audience-generated-sound likelihood. The music detection unit 120 also detects music likelihood indicating how much the input signal includes music, and outputs the detected music likelihood. The tone detection unit 130 further detects a tone of music in the input signal, and outputs the detected tone.
Note that when the feature detection unit 100A includes both the music detection unit 120 and the tone detection unit 130, the tone detection unit 130 may detect the tone only in the case where the music detection unit 120 judges the likelihood as music likelihood.
Subsequently, description is given of a detailed configuration example of the audience-generated-sound detection unit 110 according to the first embodiment of the present disclosure.
The spectral analysis unit 111 performs a spectral analysis on an input signal and supplies the feature-amount extraction unit 112 with a spectrum obtained as the analysis result. A method for the spectral analysis is not particularly limited, and may be based on a time domain or a frequency domain. The feature-amount extraction unit 112 extracts a feature amount (such as a spectral shape or the degree of a spectral peak) based on the spectrum supplied from the spectral analysis unit 111, and supplies the discrimination unit 113 with the extracted feature amount.
Subsequently, description is further given of a detailed configuration example of the feature-amount extraction unit 112 according to the first embodiment of the present disclosure.
Note that a scene in which music is played is hereinafter simply referred to as a “music scene”. Also, a scene in which audience-generated sound is uttered between one music scene and another music scene is simply referred to as a “cheer scene”.
Firstly, a low-band level of the spectrum supplied from the spectral analysis unit 111 is LV0. The low-band feature-amount extraction unit 112-1 can calculate a low-band feature amount FV0 as an example of the spectral shape in accordance with the following Formula (1).
FV0=w0(LV0−th0) (1)
Here, th0 may be a threshold defined by preliminary learning. Specifically, the learning may be performed in such a manner that LV0 exceeds th0 in a non-cheer scene such as a music scene and does not exceed th0 in a cheer scene.
Likewise, a high-band level of the spectrum supplied from the spectral analysis unit 111 is LV1. The high-band feature-amount extraction unit 112-2 can calculate a high-band feature amount FV1 as an example of the spectral shape in accordance with the following Formula (2).
FV1=w1(LV1−th1) (2)
Likewise, a middle-band level of the spectrum supplied from the spectral analysis unit 111 is LV2. The middle-band feature-amount extraction unit 112-3 can calculate a middle-band feature amount FV2 as an example of the spectral shape in accordance with the following Formula (3).
FV2=w2(LV2−th2) (3)
Here, th2 may be a threshold defined by preliminary learning. Specifically, the learning may be performed in such a manner that LV2 exceeds th2 in a cheer scene and does not exceed th2 in a non-cheer scene such as a music scene.
The peak-level feature-amount extraction unit 112-4 may also calculate a peak-level feature-amount FV3 as an example of the degree of spectral peaks, by using the sum of spectral peak levels (differences each between a maximum-value level and a minimum-value level adjacent to the maximum-value level). For example, when the spectral analysis unit 111 supplies a spectrum as illustrated in
FV3=w3(LV3−th3) (4)
Here, th3 may be a threshold defined by preliminary learning. Specifically, the learning may be performed in such a manner that LV3 exceeds th3 in a non-cheer scene such as a music scene and does not exceed th3 in a cheer scene.
Note that w0, w1, w2, and w3 are weighting factors depending on reliability of the feature amounts, respectively, and may be learned so that the discrimination unit 113 has the most appropriate result. For example, a plus or minus sign of each of w0 to w3 may be determined in the following manner. Specifically, when audience-generated-sound likelihood Chrlh to be described later takes on a positive value, the discrimination unit 113 judges the likelihood as audience-generated-sound likelihood. When the audience-generated-sound likelihood Chrlh takes on a negative value, the discrimination unit 113 judges the likelihood as not audience-generated-sound likelihood.
The discrimination unit 113 discriminates the audience-generated-sound likelihood based on the feature amount supplied from the feature-amount extraction unit 112. For example, the discrimination unit 113 discriminates the audience-generated-sound likelihood by using the following conditions based on the spectral shape. The conditions are: the low-band level is lower than a threshold; the high-band level is lower than a threshold; and the middle-band level (a voice-band level) is high. If at least one of the conditions is satisfied, it can be judged that musical instrument sound of low-tone musical instruments (such as a bass and a bass drum) and many high-tone musical instruments such as cymbals is fainter than other sounds and that sound in the middle-band level is louder. Accordingly, the discrimination unit 113 may judge the likelihood as audience-generated-sound likelihood in this case.
Meanwhile, audience-generated sound is considered to have lower spectral peak density than music. Hence, when the spectral peak density is lower than a threshold, the discrimination unit 113 may judge the likelihood as audience-generated-sound likelihood. For example, the discrimination unit 113 can calculate the audience-generated-sound likelihood Chrlh by using the feature amounts FV0 to FV3 in accordance with the following Formula (5).
For example, when the audience-generated-sound likelihood Chrlh takes on a positive value, the discrimination unit 113 may judge the likelihood as audience-generated-sound likelihood. In contrast, when the audience-generated-sound likelihood Chrlh takes on a negative value, the discrimination unit 113 may judge the likelihood as not audience-generated-sound likelihood.
Subsequently, description is given of a detailed configuration example of the music detection unit 120 according to the first embodiment of the present disclosure.
The spectral analysis unit 121 performs a spectral analysis on the input signal and supplies the feature-amount extraction unit 122 with a spectrum obtained as the analysis result. A method for the spectral analysis is not particularly limited, and may be based on a time domain or a frequency domain. The feature-amount extraction unit 122 extracts a feature amount (such as a spectral shape, the degree of a spectral peak, the density of large time variations of the low-band level, or the density of zero crosses of a ramp of the low-band level) based on the spectrum supplied from the spectral analysis unit 121, and supplies the discrimination unit 123 with the extracted feature amount.
Subsequently, description is further given of a detailed configuration example of the feature-amount extraction unit 122 according to the first embodiment of the present disclosure.
Firstly, a low-band level of the spectrum supplied from the spectral analysis unit 121 is LV0. The low-band feature-amount extraction unit 122-1 can calculate a low-band feature amount FVm0 as an example of the spectral shape in accordance with the following Formula (6).
FVm0=wm0(LV0−thm0) (6)
Here, thm0 may be a threshold defined by preliminary learning. Specifically, the learning may be performed in such a manner that LV0 exceeds thm0 in a music scene and does not exceed thm0 in a non-music scene such as a cheer scene.
Likewise, a high-band level of the spectrum supplied from the spectral analysis unit 121 is LV1. The high-band feature-amount extraction unit 122-2 can calculate a high-band feature amount FVm1 as an example of the spectral shape in accordance with the following Formula (7).
FVm1=wm1(LV1−thm1) (7)
Likewise, a middle-band level of the spectrum supplied from the spectral analysis unit 121 is LV2. The middle-band feature-amount extraction unit 122-3 can calculate a middle-band feature amount FVm2 as an example of the spectral shape in accordance with the following Formula (8).
FVm2=wm2(LV2−thm2) (8)
Here, thm2 may be a threshold defined by preliminary learning. Specifically, the learning may be performed in such a manner that LV2 exceeds thm2 in a music scene and does not exceed thm2 in a non-music scene such as a cheer scene.
In addition, the peak-level feature-amount extraction unit 122-4 may calculate the peak-level feature-amount FV3 as an example of the degree of spectral peaks, by using the sum of spectral peak levels (differences each between a maximum-value level and a minimum-value level adjacent to the maximum-value level). For example, when the spectral analysis unit 121 supplies a spectrum as illustrated in
FVm3=wm3(LV3−thm3) (9)
Here, thm3 may be a threshold defined by preliminary learning. Specifically, the learning may be performed in such a manner that LV3 exceeds thm3 in a non-music scene such as a cheer scene and does not exceed thm3 in a music scene.
In addition, the low-band-level change-amount extraction unit 122-5 can calculate the density of large time variations of the low-band level in the following manner. Firstly, a low-band level at time t is LV0(t), and a low-band level at time t−Δt is LV0(t−Δt). The low-band-level change-amount extraction unit 122-5 can calculate a flag fig in accordance with the following Formulae (10) and (11).
when LV0(t)×LV0(t−Δt)>th,flg(t)=1 (10)
others,flg(t)=0 (11)
However, th is a threshold, and may be set so that LV0(t)−LV0(t−Δt) can exceed th, for example, when an input signal includes sound of beating a bass drum. The low-band-level change-amount extraction unit 122-5 can calculate a time average f# of flg(t) as an example of the density of large time variations of the low-band level in accordance with the following Formula (12).
Here, reference numeral T denotes an average time. The low-band-level change-amount extraction unit 122-5 can calculate a low-band-level variation amount FVm4 by using the time average f# of flg(t) in accordance with the following Formula (13).
FVm4=wm4(f#−thm4) (13)
Note that wm0, wm2, wm3, and wm4 are weighting factors depending on reliability of the feature amounts, respectively, and learning may be performed in such a manner that the discrimination unit 123 has the most appropriate result. For example, a plus or minus sign of each of wm0 to wm4 may be determined in the following manner. Specifically, when music likelihood Msclh to be described later takes on a positive value, the discrimination unit 123 judges the likelihood as music likelihood. When the music likelihood Msclh takes on a negative value, the discrimination unit 123 judges the likelihood as not music likelihood.
The discrimination unit 123 discriminates the music likelihood based on the feature amount supplied from the feature-amount extraction unit 122. For example, the discrimination unit 123 judges the music likelihood by using the following conditions based on the spectral shape. The conditions are: the low-band level is higher than the threshold; the high-band level is higher than the threshold; and the middle-band level (voice-band level) is low. If at least one of the conditions is satisfied, it can be judged that musical instrument sound of the low-tone musical instruments (such as the bass and the bass drum) and many high-tone musical instruments such as the cymbals is louder than other sounds and that sound in the middle-band level is fainter. Accordingly, the discrimination unit 123 may judge the likelihood as music likelihood in this case.
In addition, music is considered to have higher spectral peak density than audience-generated sound. Hence, when the spectral peak density is higher than the threshold, the discrimination unit 123 may judge the likelihood as music likelihood.
Meanwhile, when the input signal includes sound of beating the bass drum, the low-band level changes sharply and largely. Accordingly, when a low-band-level change amount per unit time is larger than a threshold, the discrimination unit 123 can judge that the input signal is highly likely to include sound of beating the bass drum. For this reason, when how frequently the low-band-level change amount per unit time exceeds the threshold exceeds an upper limit value, the discrimination unit 123 can judge that music including the sound of the bass drum is continuously played, and thus may judge the likelihood as music likelihood.
For example, the discrimination unit 123 can calculate the music likelihood Msclh by using the feature amounts FVm0 to FVm4 in accordance with the following Formula (14).
For example, when the music likelihood Msclh takes on a positive value, the discrimination unit 123 may judge the likelihood as music likelihood. In contrast, when the music likelihood Msclh takes on a negative value, the discrimination unit 123 may judge the likelihood as not music likelihood. Note that a music scene generally lasts for a relatively long time, the discrimination unit 123 may use a time average of the music likelihood Msclh for the discrimination.
Subsequently, description is further given of a detailed configuration example of the tone detection unit 130 according to the first embodiment of the present disclosure.
The spectral analysis unit 131 performs a spectral analysis on the input signal, and supplies the feature-amount extraction unit 132 with a spectrum obtained as the analysis result. A method for the spectral analysis is not particularly limited, and may be based on a time domain or a frequency domain. The feature-amount extraction unit 132 extracts a feature amount (such as a long-time average of the low-band level or the density of zero crosses of a ramp of the low-band level) based on the spectrum supplied from the spectral analysis unit 131, and supplies the discrimination unit 133 with the extracted feature amount.
The discrimination unit 133 discriminates a tone based on the feature amount. Examples of a tone include a moderate tone (a tone such as a ballad or reciting to the singer's own accompaniment) including almost no sound of the low-tone musical instrument such as the bass or the bass drum, a tone having distorted bass sound, other ordinary tones (such as rock and pop), a not music-like tone, and the like. For example, the moderate tone generally has a low low-band level. However, it is assumed that an ordinary tone also might have a low low-band level because sound of the low-tone musical instrument is temporarily missing. Thus, an average of a long time may be used for the low-band level.
Hence, when the long-time average of the low-band level falls below a threshold, the discrimination unit 133 may judge the tone as a moderate tone. In contrast, when the long-time average of the low-band level exceeds the threshold, the discrimination unit 133 may judge the tone as an aggressive tone. At this time, for example, when a tone quickly switches between the moderate tone and the aggressive tone, simply using the long-time average of the low-band level might cause delay in following change of the tone.
Hence, when the audience-generated-sound likelihood exceeds the threshold, the discrimination unit 133 can also reduce time for averaging the low-band level to quickly follow change of the tone. As described above, the time for averaging the low-band level is not particularly limited.
Meanwhile, as illustrated in
Hence, when the density of zero crosses of a ramp of the low-band level exceeds a threshold, the discrimination unit 133 may discriminate a tone having undistorted bass sound. In contrast, when the density of zero crosses of a ramp of the low-band level falls below the threshold, the discrimination unit 133 may discriminate a tone having distorted bass sound.
Subsequently, description is given of a detailed configuration example of the sound-quality adjustment unit 200 according to the first embodiment of the present disclosure.
Note that in the example in
The sound-quality adjustment unit 200 may adjust the sound quality based on a detection signal at least by controlling a dynamic range. More specifically, each bandsplitting filter 220 divides the input signal to have a signal in the corresponding band. The gain-curve calculation unit 210 calculates change (a gain curve) of a coefficient by which each band level is multiplied based on a tone. Each dynamic-range controller 230 adjusts the sound quality by multiplying the band level divided by the bandsplitting filter 220 by the coefficient. The adder 240 adds up signals from the dynamic-range controllers 230 and outputs a resultant signal.
Each dynamic-range controller 230 can operate as a compressor for generating input signals having such a high (a narrow dynamic range) sound-volume impression that is experienced in a live music venue. The dynamic-range controller 230 may be a multiband compressor or a single-band compressor. When being a multiband compressor, the dynamic-range controller 230 can also boost the low-band and high-band levels to thereby generate signals having such frequency characteristics that are exhibited in music heard in the live music venue.
Meanwhile, in a live music venue, the compressor is often set low for a moderate tone to produce a free and easy sound. Accordingly, when the tone detection unit 130 discriminates a moderate tone, the gain-curve calculation unit 210 can reproduce the sound produced in the live music venue by calculating such a gain curve that causes lower setting of the compressor.
In addition, when the tone detection unit 130 discriminates a tone having a largely distorted low-band level, the gain-curve calculation unit 210 can prevent generation of an unpleasant sound with emphasized distortion, by calculating such a gain curve that causes lower setting of the compressor. Moreover, the audience-generated sound does not pass through a public address (PA), and thus does not have to be subjected to the compressor processing. Thus, when the audience-generated-sound detection unit 110 judges the likelihood as not audience-generated-sound likelihood, the gain-curve calculation unit 210 can prevent change of the sound quality of the audience-generated sound by calculating such a gain curve that causes lower setting of the compressor.
However, when the bass sound is largely distorted, increasing a gain as shown by the gain curve 1 further amplifies the distortion and thus might produce unpleasant sound. For this reason, when the tone detection unit 130 judges the tone as the tone having largely distorted bass sound, control may be performed to prevent the tone from being distorted by calculating such a gain curve as a gain curve 2 and by changing a boost amount. With reference to
In contrast, when the tone detection unit 130 discriminates a moderate tone, the gain-curve calculation unit 210 may change the setting to that of the gain curve 2 and thus perform control to prevent excessive sound quality change. This is because priority might be given to a sound quality over a sound-volume impression.
Further, more advanced signal processing may be performed in cooperation with servers.
For example, a case is assumed in which the feature detection unit 100A of the reproducer 20 detects tune information (such as information for identifying a tune or tune genre information) from the content. In this case, the sound-quality adjustment unit 200 may acquire sound-quality adjustment parameters for the tune information from the parameter-delivery server 30 and adjust the sound quality according to the acquired sound-quality adjustment parameters.
Another case is also assumed in which a server has functions of the feature detection unit 100A and the sound-quality adjustment unit 200. In this case, the reproducer 20 may provide the server with content, and the server may acquire the content having undergone sound-quality adjustment and reproduce the content. At this time, the reproducer 20 may transmit, to the server, performance information (such as a supporting frequency or a supporting sound pressure) of the reproducer 20 together with the content and may cause the server to adjust the sound quality so that content meeting the performance information of the reproducer 20 can be obtained.
According to the first embodiment of the present disclosure as described above, it is possible to detect a tone while adaptively changing the degree of compressor setting according to the tone. For this reason, sound of many tunes such as rock and pop can be adjusted to such sound with a large-sound-volume impression that is heard in a live music venue. In contrast, for a tune desired to be moderate, free, and easy, it is possible to automatically lower the compressor setting and thereby to prevent distortion from causing loss of the easiness. Moreover, when bass sound recorded in content is originally distorted, it is possible to prevent influence by the compressor from causing the distortion to be further increased, thereby preventing unpleasant sound generation.
Subsequently, description is given of a second embodiment of the present disclosure. Structural elements in the second embodiment of the present disclosure that have substantially the same function and structure as in the first embodiment of the present disclosure are denoted with the same reference numerals, and repeated explanation of these structural elements is omitted.
The signal extraction unit 300 adaptively extracts predetermined sound as extracted sound based on the detection signal supplied from the feature detection unit 100A. The predetermined sound as extracted sound may include at least one of surround sound and center sound. The surround sound is a signal obtained by reducing sound localized mainly in the center in the input signal.
Subsequently, description is given of a detailed configuration example of the signal extraction unit 300 according to the second embodiment of the present disclosure.
The center-sound extraction unit 310 adaptively extracts center sound from an input signal according to the detection signal. The center-sound extraction unit 310 may add the extracted center sound to the input signal. The center sound made unclear due to reverberation addition, the sound-quality adjustment, or the like can thereby be made clear.
For example, the center-sound extraction unit 310 may be configured to extract the center sound when the music detection unit 120 judges the likelihood as music likelihood, and configured not to extract the center sound when the music detection unit 120 judges the likelihood as not music likelihood. The center sound is extracted according to the music likelihood in this way. In the case of not music likelihood (in a cheer scene), the extraction of the center sound is prevented, and thus deterioration of a spreading feeling can be prevented.
In addition, for example, the center-sound extraction unit 310 may be configured not to extract the center sound when the audience-generated-sound detection unit 110 judges the likelihood as audience-generated-sound likelihood, and configured to extract the center sound when the audience-generated-sound detection unit 110 judges the likelihood as not audience-generated-sound likelihood. As described above, also when the center sound is extracted according to the audience-generated-sound likelihood, the same function can be implemented.
The surround-sound extraction unit 320 adaptively extracts surround sound from the input signal according to the detection signal. The surround-sound extraction unit 320 may add the extracted surround sound to the input signal (a surround channel of the input signal). This can further enhance the presence in a cheer scene or the spreading feeling.
For example, when the music detection unit 120 judges the likelihood as music likelihood, the surround-sound extraction unit 320 may extract surround sound to such an extent that the clearness of the music is not deteriorated, so that the presence can be provided. When the music detection unit 120 judges the likelihood as not music likelihood, the surround-sound extraction unit 320 may extract the surround sound to a larger extent. The surround sound is extracted in this way according to the music likelihood. In the case of the music likelihood (in a music scene), the extraction of the surround sound is reduced, and thus deterioration of the clearness of the music can be prevented.
In addition, for example, when the audience-generated-sound detection unit 110 judges the likelihood as audience-generated-sound likelihood, the the surround-sound extraction unit 320 may extract surround sound to such an extent that the clearness of the music is not deteriorated, so that the presence can be provided. When the audience-generated-sound detection unit 110 judges the likelihood as not audience-generated-sound likelihood, the center-sound extraction unit 310 may extract the surround sound to a larger extent. The surround sound is extracted in this way according to the audience-generated-sound likelihood. As described above, also when the surround sound is extracted according to the audience-generated-sound likelihood, the same function can be implemented.
Subsequently, description is given of a detailed configuration example of the center-sound extraction unit 310 according to the second embodiment of the present disclosure.
The adder 311 adds up input signals through an L channel and an R channel. The bandpass filter 312 extracts a signal in a voice band by causing a signal resulting from the addition to pass the voice band. The gain calculation unit 313 calculates a gain by which the signal extracted by the bandpass filter 312 is multiplied, based on at least one of the music likelihood and the audience-generated-sound likelihood. The amplifier 314 outputs, as center sound, a result of multiplying the extracted signal by the gain.
Subsequently, description is given of a detailed configuration example of the surround-sound extraction unit 320 according to the second embodiment of the present disclosure.
The surround sound can correspond to a signal obtained by subtracting one of input signals through an L channel and an R channel from the other one thereof by the corresponding one of the subtractors 323 and 324. However, a low-band component is often localized mainly in the center and has a low localization impression in audibility. For this reason, a low-band component of one of the signals which is to be subtracted from the other is removed by using the highpass filter 321, and then the one signal is subtracted from the other. This enables the surround sound to be generated without deteriorating the low-band component of the other signal from which the one signal is subtracted.
The gain calculation unit 322 calculates a gain based on at least one of music likelihood and audience-generated-sound likelihood. Each of the amplifiers 325 and 326 outputs, as the extracted sound, a result of multiplying the subtraction result by the gain. For example, as illustrated in
According to the second embodiment of the present disclosure as described above, presence appropriate for the scene of content and clear center sound are obtained. Since music arrives mainly at the front of the audience in the live music venue, sound to be supplied to a surround speaker for a music scene may be relatively faint sound to the extent of reflected sound. However, since the audience can be present in any orientation, relatively loud sound is preferably supplied to the surround speaker for a cheer scene. According to the second embodiment of the present disclosure, an amount of supplying the surround component can be increased for the cheer scene, and thus such presence that the listener feels like the listener is surrounded by a cheer in a live music venue can be obtained.
Meanwhile, processing for enhancing the presence such as reverberation addition or sound-quality adjustment might make the center sound unclear. For this reason, the center sound is extracted in the music scene, and is not extracted in the cheer scene. It is thereby possible to enhance the clearness of the center sound without deteriorating the spreading feeling in the cheer scene.
Subsequently, description is given of a third embodiment of the present disclosure. Structural elements in the third embodiment of the present disclosure that have substantially the same function and structure as in the first and second embodiments of the present disclosure are denoted with the same reference numerals, and repeated explanation of these structural elements is omitted.
Based on the detection signal supplied from the feature detection unit 100B, the vicinity-sound generation unit 400 generates sound uttered by the audience near an audio-input-signal detection location in the live music venue (such as voices, whistling sounds, and applause sounds). Hereinafter, the sound uttered by the neighboring audience is also referred to as vicinity sound.
Subsequently, description is given of a detailed configuration example of the feature detection unit 100B according to the third embodiment of the present disclosure.
Subsequently, description is given of a detailed configuration example of the audience-generated-sound analysis unit 140 according to the third embodiment of the present disclosure.
The spectral analysis unit 141 performs a spectral analysis on the input signal and supplies the feature-amount extraction unit 142 with a spectrum obtained as the analysis result. A method for the spectral analysis is not particularly limited, and may be based on a time domain or a frequency domain. The feature-amount extraction unit 142 extracts a feature amount (such as a voice-band spectral shape) based on the spectrum supplied from the spectral analysis unit 141, and supplies the discrimination unit 143 with the extracted feature amount.
The discrimination unit 143 discriminates a type of audience-generated sound based on the feature amount (such as a voice-band spectral shape) extracted by the feature-amount extraction unit 142. The following describes a specific example. For example, when a spectral peak in the voice band is present in a male-voice band (about 700 to 800 Hz) as in a spectrum 1 in
In contrast, when a spectral peak in the voice band is present in a female-voice band (about 1.1 to 1.3 kHz) as in a spectrum 2 in
In addition, when a peak has a sharper shape than a threshold shape as in the spectrum 1 in
The vicinity-sound generation unit 400 generates vicinity sound based on the detection signal. For example, suppose a condition that audience-generated-sound likelihood is higher than a threshold and a condition that music likelihood is lower than a threshold. When at least one of the conditions is satisfied, the vicinity-sound generation unit 400 may generate vicinity sound. In contrast, suppose a condition that the audience-generated-sound likelihood is lower than the threshold and a condition that the music likelihood is higher than the threshold. When at least one of the conditions is satisfied, the vicinity-sound generation unit 400 does not have to generate vicinity sound to avoid unnatural addition of the vicinity sound to a tune (or may generate fainter vicinity sound).
When the type of audience-generated sound is a male cheer (or a dominant male cheer), the vicinity-sound generation unit 400 may generate vicinity sound including a male voice. In contrast, when the type of audience-generated sound is a female cheer (or a dominant female cheer), the vicinity-sound generation unit 400 may generate vicinity sound including a female voice. When the type of audience-generated sound is applause sound (or a dominant applause sound), the vicinity-sound generation unit 400 may generate vicinity sound including applause sound. In this way, it is possible to generate such vicinity sound that naturally fits in an input signal.
The vicinity sound may be added to the input signal by the vicinity-sound generation unit 400. This makes it possible to enjoy a sound field having the vicinity sound added thereto. Note that a method for generating vicinity sound used by the vicinity-sound generation unit 400 is not limited. For example, the vicinity-sound generation unit 400 may generate vicinity sound by reproducing vicinity sound recorded in advance. The vicinity-sound generation unit 400 may also generate vicinity sound in a pseudo manner, like a synthesizer. Alternatively, the vicinity-sound generation unit 400 may generate vicinity sound by removing a reverberation component from the input signal.
According to the third embodiment of the present disclosure as described above, sounds (such as voices, whistling sounds, and applause sounds) are generated which are uttered by the neighboring audience and are difficult to record in content, and thereby it is possible to provide such absorption feeling and presence that a listener feels like directly listening to music played in the live music venue. Analyzing the content and adding an easy-to-fit vicinity sound matching a cheer scene enables a natural sound field to be generated without abruptly adding vicinity sound to a non-cheer scene.
Subsequently, description is given of a fourth embodiment of the present disclosure. Structural elements in the fourth embodiment of the present disclosure that have substantially the same function and structure as in the first to third embodiments of the present disclosure are denoted with the same reference numerals, and repeated explanation of these structural elements is omitted.
The reverberation adding unit 500 adaptively adds reverberation to an input signal based on the detection signal.
The reverberation adding unit 500 may add reverberation according to a tone detected by the tone detection unit 130. For example, when a moderate tone is discriminated, the reverberation adding unit 500 may set a longer reverberation time. This makes it possible to generate a more spreading and dynamic sound field. In contrast, when an ordinary tone (such as rock or pop) is discriminated, the reverberation adding unit 500 may set a shorter reverberation time. This makes it possible to avoid loss of clearness of fast passage or the like.
In addition, when audience-generated-sound likelihood is discriminated, the reverberation adding unit 500 may set a longer reverberation time. This can generate a sound field having higher presence and thus can liven up the content. Vicinity sound may be added to the input signal by the vicinity-sound generation unit 400. This makes it possible to enjoy a sound field having vicinity sound added thereto.
According to the fourth embodiment of the present disclosure as described above, appropriately adjusting a reverberation characteristic according to a tone or a scene makes it possible to generate a clear sound field having a more spreading feeling. A characteristic having a relatively short reverberation time is set for a tune in quick tempo to prevent short passages from becoming unclear, while a characteristic having a relatively long reverberation time is set for a slow tune or a cheer scene. It is thereby possible to generate a sound field having dynamic presence.
Subsequently, description is given of a fifth embodiment of the present disclosure. Two or more of the first to fourth embodiments described above can be appropriately combined in the fifth embodiment of the present disclosure. It is thereby expected to be able to provide a listener with such further higher presence that the listener feels like directly listening to audio emitted in the live music venue. An example of combining all the first to fourth embodiments will be described in the fifth embodiment of the present disclosure.
The feature detection unit 100C detects a feature amount from an input signal and supplies the detected feature amount to the center-sound extraction unit 310, the surround-sound extraction unit 320, the sound-quality adjustment unit 200, the vicinity-sound generation unit 400, and the reverberation adding unit 500. The center-sound extraction unit 310 extracts center sound according to music likelihood supplied from the feature detection unit 100C, and supplies the sound-quality adjustment unit 200 with the extracted center sound. The sound-quality adjustment unit 200 adjusts the sound quality of each of the input signal and the center sound based on a tone supplied from the feature detection unit 100C, and supplies the surround-sound extraction unit 320 and the reverberation adding unit 500 with the input signal and the center sound that have undergone the sound-quality adjustment.
The surround-sound extraction unit 320 extracts surround sound from the input signal having undergone the audio adjustment according to audience-generated-sound likelihood supplied from the feature detection unit 100C, and supplies the reverberation adding unit 500 with the surround sound. The vicinity-sound generation unit 400 generates vicinity sound according to the feature amount (such as audience-generated-sound likelihood, the type of audience-generated sound, or music likelihood) supplied from the feature detection unit 100C, and supplies the reverberation adding unit 500 with the generated vicinity sound.
According to a tone supplied from the feature detection unit 100C, the reverberation adding unit 500 adds reverberation to an input signal supplied from each of the sound-quality adjustment unit 200, the surround-sound extraction unit 320, and the vicinity-sound generation unit 400. The adder 600 adds the vicinity sound generated by the vicinity-sound generation unit 400 to an output signal from the reverberation adding unit 500.
Meanwhile,
Subsequently, description is given of a hardware configuration example of the signal processing apparatus 1 according to the embodiments of the present disclosure.
As illustrated in
The CPU 801 functions as an arithmetic processing unit and a control unit, and controls overall operation of the signal processing apparatus 1 according to a variety of programs. The CPU 801 may also be a microprocessor. The ROM 802 stores therein the programs, operational parameters, and the like that are used by the CPU 801. The RAM 803 temporarily stores therein the programs used and executed by the CPU 801, parameters appropriately varying in executing the programs, and the like. These are connected to each other through a host bus configured of a CPU bus or the like.
The input device 808 includes: an operation unit for inputting information by a user, such as a mouse, a keyboard, a touch panel, buttons, a microphone, a switch, or a lever; an input control circuit that generates input signals based on input by the user and outputs the signals to the CPU 801; and the like. By operating the input device 808, the user of the signal processing apparatus 1 can input various data and give the signal processing apparatus 1 instructions for processing operation.
The output device 810 may include a display device such as a liquid crystal display (LCD) device, an organic light emitting diode (OLED) device, or a lamp. The output device 810 may further include an audio output device such as a speaker or a headphone. For example, the display device displays a captured image, a generated image, and the like, while the audio output device converts audio data and the like into audio and outputs the audio.
The storage device 811 is a device for storing data configured as an example of a storage unit of the signal processing apparatus 1. The storage device 811 may include a storage medium, a recorder that records data in the storage medium, a reader that reads data from the storage medium, a deletion device that deletes data recorded in the storage medium, and the like. The storage device 811 stores therein the programs executed by the CPU 801 and various data.
The drive 812 is a reader/writer and is built in or externally connected to the signal processing apparatus 1. The drive 812 reads information recorded in the removable storage medium loaded in the drive 812 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory, and outputs the information to the RAM 803. The drive 812 can also write information to the removable storage medium.
The communication device 815 is a communication interface configured of a communication device or the like for connecting to, for example, a network. The communication device 815 may be a communication device supporting a wireless local area network (LAN), a communication device supporting long term evolution (LTE), or a wired communication device that performs wired communication. The communication device 815 can communicate with another device, for example, through a network. The description has heretofore given of the hardware configuration example of the signal processing apparatus 1 according to the embodiments of the present disclosure.
According to each of the first to fourth embodiments of the present disclosure as described above, it is possible to provide a listener with such higher presence that the listener feels like directly listening to audio emitted in a live music venue. According to the fifth embodiment of the present disclosure, it is expected to be able to provide the listener with further higher presence by appropriately combining two or more of the first to fourth embodiments of the present disclosure.
It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and alterations may occur depending on design requirements and other factors insofar as they are within the scope of the appended claims or the equivalents thereof.
It is also possible to generate a program for causing the hardware such as the CPU, the ROM, and the RAM which are built in a computer to exert functions equivalent to those of the aforementioned signal processing apparatus 1. There can also be provided a computer-readable storage medium storing the program.
In addition, the advantageous effects described in the specification are merely explanatory or illustrative, and are not limited. In other words, the technology according to the present disclosure can exert other advantageous effects that are clear to those skilled in the art from the description of the specification, in addition to or instead of the advantageous effects described above.
Additionally, the present technology may also be configured as below.
(1)
A signal processing apparatus including:
a feature detection unit configured to detect, from an input signal, a detection signal including at least one of audience-generated-sound likelihood and music likelihood; and
a vicinity-sound generation unit configured to generate vicinity sound based on the detection signal.
(2)
The signal processing apparatus according to (1),
wherein the feature detection unit further detects a type of audience-generated sound from the input signal, and
wherein the vicinity-sound generation unit generates vicinity sound appropriate for the type of audience-generated sound.
(3)
The signal processing apparatus according to (2),
wherein the type of audience-generated sound includes at least one of a male cheer, a female cheer, a whistle, and applause sound.
(4)
The signal processing apparatus according to any one of (1) to (3),
wherein the vicinity-sound generation unit adds the vicinity sound to the input signal.
(5)
The signal processing apparatus according to any one of (1) to (4), further including:
a sound-quality adjustment unit configured to perform sound-quality adjustment based on the detection signal.
(6)
The signal processing apparatus according to (5),
wherein the feature detection unit further detects a tone from the input signal, and
wherein the sound-quality adjustment unit performs the sound-quality adjustment appropriate for the tone.
(7)
The signal processing apparatus according to (5) or (6),
wherein the sound-quality adjustment unit performs at least dynamic range control as the sound-quality adjustment.
(8)
The signal processing apparatus according to any one of (1) to (7), further including:
a signal extraction unit configured to extract predetermined sound as extracted sound from the input signal based on the detection signal.
(9)
The signal processing apparatus according to (8),
wherein the predetermined sound as extracted sound includes at least one of center sound and surround sound.
(10)
The signal processing apparatus according to (8) or (9),
wherein the signal extraction unit adds the extracted sound to the input signal.
(11)
The signal processing apparatus according to any one of (1) to (10), further including:
a reverberation adding unit configured to add reverberation to the input signal based on the detection signal.
(12)
The signal processing apparatus according to (11),
wherein the feature detection unit further detects a tone from the input signal, and
wherein the reverberation adding unit adds reverberation appropriate for the tone.
(13)
A signal processing method including:
detecting, from an input signal, a detection signal including at least one of audience-generated-sound likelihood and music likelihood; and
causing a processor to generate vicinity sound based on the detection signal.
(14)
A program for causing a computer to function as a signal processing apparatus including:
a feature detection unit configured to detect, from an input signal, a detection signal including at least one of audience-generated-sound likelihood and music likelihood; and
a vicinity-sound generation unit configured to generate vicinity sound based on the detection signal.
Number | Date | Country | Kind |
---|---|---|---|
2013-239187 | Nov 2013 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
4731835 | Futamase | Mar 1988 | A |
5119428 | Prinssen | Jun 1992 | A |
8098833 | Zumsteg | Jan 2012 | B2 |
20050281410 | Grosvenor | Dec 2005 | A1 |
20110035213 | Malenovsky | Feb 2011 | A1 |
Number | Date | Country |
---|---|---|
2011-150143 | Aug 2011 | JP |
Number | Date | Country | |
---|---|---|---|
20150142445 A1 | May 2015 | US |