The present invention relates to a technique for mixing collected sound signals collected by a plurality of microphones.
In recent years, interactive viewing for users has been generating interest, due to advances in VR (Virtual Reality) and AR (Augmented Reality) technologies. For example, when playing an omnidirectional (360-degree) moving image, the user himself or herself is able to selectively designate the angle of the field of view (angle interval within 360 degrees) and display the moving image on a display at that angle of view. Also, sound collected omnidirectionally through 360 degrees is played while playing the omnidirectional moving image. In this playback, an Ambisonics system, a binaural system or a surround system is generally used for omnidirectional composition of sound fields.
The conventional circular microphone array collects sound with a plurality of directional microphones facing in different directions. As shown in
There are technologies for mixing acoustic signals collected by a plurality of microphones. PTL 1 discloses a stereo width control technique for adjusting (widening or narrowing) the width of the sound field range with respect to the acoustic signals collected by two microphones. According to PTL 1, two acoustic signals of a right channel and a left channel are generated from the collected sound signals of the two microphones, based on the expansion ratio of the sound field. The sound field range is adjusted by driving a set of stereo speakers with these two acoustic signals.
Also, PTL 2 discloses a stereo width control technique for acoustic signals collected by three or more microphones.
PTL 1: Japanese Patent No. 3905364
PTL 2: Japanese Patent Laid-Open No.2019-068210
NPL 1: Chapter 2: “Sound Source Separation”, Volume 6: “Acoustic Signal Processing”, Group 2: “Images, Sounds, Languages”, Forest of Knowledge, IEICE, accessed on Feb. 15, 2020, at: http://www.ieice-hbkb.org/files/02/02gun_06hen_02.pdf
NPL 2: ZYLIA ZM-1 microphone (multitrack recording microphone array), accessed on Feb. 16, 2020, at: https://www.minet.jp/brand/zylia/zylia-music-set/
NPL 3: Insta360 Pro2, accessed on Feb. 16, 2020, at: https://hacosco.com/insta360-pro2/
When collected sound signals collected by a plurality of microphones are mixed directly, the collected sound signals are output at the same level from the speakers. In this case, even if the user is looking at the violin player, for example, the sound of the violin will be heard at the same level as the sounds of the other instruments. Thus, the user senses a divergence between the video image range that he or she is viewing and the sound field range.
The present disclosure provides a technique for mixing collected sound signals of a plurality of microphones, such that the user does not sense a divergence between the visual video image range and the auditory sound field range.
According to an aspect of the present invention, an apparatus for mixing collected sound signals, comprising: one or more processors; and one or more memory devices configured to store M collected sound signals, each of which is collected by a microphone of M microphones, and one or more computer programs executable by the one or more processors, M being 2 or more. The one or more programs, when executed by the one or more processors, cause the apparatus to function as: an angle section setting unit configured to set an angle section at a single sound collection position, selected by a user; a frequency analysis unit configured to convert each of the M collected sound signals into a frequency component; a beamforming unit configured to multiply M frequency components obtained through conversion by the frequency analysis unit by respective beamforming matrices to generate a plurality of acoustic signals of two channels; and a synthesized acoustic signal generation unit configured to synthesize the acoustic signals per channel and outputting an acoustic signal for every channel.
According to the present disclosure, collected sound signals of a plurality of microphones can be mixed, such that the user does not sense a divergence between the visual video image range and the auditory sound field range.
Other features and advantages of the present invention will be apparent from the following description taken in conjunction with the accompanying drawings. Note that the same reference numerals denote the same or like components throughout the accompanying drawings.
Hereinafter, embodiments will be described in detail with reference to the attached drawings. Note, the following embodiments are not intended to limit the scope of the claimed invention, and limitation is not made to an invention that requires a combination of all features described in the embodiments. Two or more of the multiple features described in the embodiments may be combined as appropriate. Furthermore, the same reference numerals are given to the same or similar configurations, and redundant description thereof is omitted.
A collected sound recording apparatus 2 includes a spherical microphone array 21 and an omnidirectional camera 22.
In the spherical microphone array 21, a plurality of (M) microphones are geometrically arranges to enable sound to be collected from different directions (see NPL 2). The geometrically arranged microphones are not limited to being arranged equidistantly from each other. Also, the microphones that are installed in the spherical microphone array 21 may be non-directional or directional.
As shown in
In the omnidirectional camera 22, a plurality of cameras shoot in different directions to generate a 360-degree video image obtained by combining a plurality of captured video images (see NPL 3). The 360-degree video image is a sound field video image that captures the sound field range. The omnidirectional camera 22 shoots in sync with the sound collection of the spherical microphone array 21.
The collected sound recording apparatus 2 transmits the collected sound signal of every microphone collected by the spherical microphone array 21 and the sound field video image generated by the omnidirectional camera 22 to a media playback apparatus 1.
The media playback apparatus 1 receives the collected sound signals of the microphones and the 360-degree video image from the collected sound recording apparatus 2. The media playback apparatus 1 is a terminal operable by the user and equipped with at least a display and a speaker, such as a smartphone or a tablet terminal, for example.
As shown in
The collected sound signal storage unit 101 receives the collected sound signals of the plurality of microphones from the collected sound recording apparatus 1, and stores the collected sound signals. As shown in
The sound field video storage unit 102 receives the sound field video image that captures the sound field range from the collected sound recording apparatus 1, and stores the sound field video image.
The display 103 visually plays the video image stored in the sound field video storage unit 102. For example, the display 103 may be the display of a smartphone or tablet, or may be a VR head-mounted display. The display 103 is user-operable with a touch panel device or a pointing device, and is capable of changing the display position and enlarging or reducing the display range with respect to the video image of the visual range that is displayed.
The speakers 104 play the acoustic signal that is ultimately mixed. In the case of stereo, respective synthesized acoustic signals are output by the left channel speaker and the right channel speaker.
Also, the media playback apparatus 1 has an angle range setting unit 11, a frequency analysis unit 12, a beamforming unit 130 and a synthesized acoustic signal generation unit 14. These functional constituent units are realized by causing one or more processors of a computer installed in the media playback apparatus to execute an appropriate program. The flow of processing by these functional constituent units can also be viewed as a media playback method.
The angle section setting unit 11 sets an angle section arbitrarily selected by the user at a single sound collection position. The set angle section is output to the beamforming unit 130. Also, the angle section setting unit 11 holds information on the disposition position and sound collection direction of each of the microphones.
As shown in
Also, the angle section may be a section along a straight line or a curved line on which a plurality of microphones are disposed. Even in the case of a circle or a curved line, the interval can be set by arranging the positions of the plurality of microphones on a straight line.
The frequency analysis unit 12 executes discrete Fourier transform on each of the M collected sound signals every time interval and converts the collected sound signal to a frequency component x(ω). The frequency components are output to the beamforming unit 130 as input acoustic signals (x(ω)=(x1, x2, . . . , xM)T)
The beamforming unit 130 multiplies the M input acoustic signals (x(ω)=(x1, x2, . . . , xM)T) obtained through conversion by the frequency analysis unit 12 by a beamforming matrix. Multiplying the input acoustic signal x(ω) by one beamforming matrix generates acoustic signals of two channels.
“Beamforming” refers to signal processing for controlling directionality using a microphone array, as described in NPL 1. Signals from specific directions are enhanced or reduced, by generating interference between signals whose phase or amplitude is controlled by delays and filters, based on the fact that soundwave propagation from the sound source to the microphone differs between the signals. According to the present invention, “fixed beamforming” is applied. Here, a “filter-and-sum beamformer” that changes the relationship between frequency and directionality with a filter is applied.
As will be described later with reference to
Note that, in the following description, sound collection direction is used synonymously with directionality. Namely, changing the sound collection direction by beamforming also includes changing only the beam width, while maintaining the same center direction of the beam that is collected by the microphone. Changing the sound collection direction by beamforming also includes changing only the direction of sound collection (center direction of beam) without changing the beam width. Furthermore, changing the sound collection direction by beamforming also includes changing both the beam width and the center direction of the beam.
P number of virtual microphones can be set independently of the M number of microphones that are actually being used. In the present embodiment, P is an integer of two or more, and two virtual microphones positioned adjacent to each other constitute one set.
Note that, in the following description, the positions of the virtual microphones in beamforming are the same as the positions of the M microphones that are actually disposed, and only the directions of the virtual microphones are changed from the sound collection directions of the actual microphones. Accordingly, in the example described below, the P number of virtual microphones is the same as the M number of microphones that are actually disposed. Namely, in this example, P=M=6. Also, in the present embodiment, two adjacent virtual microphones constitute one set. Accordingly, in this example, the N number of sets of virtual microphones is the same as the P number of virtual microphones, namely, N=P=M=6. Note that, in the case where the virtual microphones are disposed linearly rather than in a circle, five sets are configured for the six virtual microphones.
The beamforming unit 130 generates a two-channel acoustic signal yn(ω) of an n-th set of virtual microphones (n being an integer from 1 to N) by the following formula.
y
n(ω)=Bn(ω, bn)·x(ω) (1)
x(ω) is the input acoustic signal described above, and Bn(ω, bn) is the n-th beamforming matrix, and is a 2×M matrix as shown below.
As shown in
In
As shown in
The synthesized acoustic signal generation unit 14 synthesizes the acoustic signals of all sets of virtual microphones per channel to generate an acoustic signal y(ω) for every channel. y(ω) is represented by the following formula.
y(ω)=Σn=1Nyn(ω) (2)
The acoustic signal y(w) for every channel is output to a set of speakers.
As shown in
The description in the aforementioned embodiment focused on beamforming. A scaling unit 131, a shift unit 132 and a masking unit 133 can also be further provided in addition to the beamforming unit 130. These functional constituent units are also described in detail in PTL 2.
The scaling unit 131 performs, for every set of virtual microphones, multiplication with a scaling matrix (scaling coefficient) that is the scaling ratio of the sound field between the virtual microphones, together with the beamforming unit 130. The scaling matrix is determined from the display range of the video image that appears on the display 103 and the disposition interval of the virtual microphones that is based on the beamforming.
Kn(ω, κn): scaling matrix for enlarging or reducing the sound field scaling (enlarging/reducing) coefficient (0 to 2) for controlling the sound field range
κn=1: no change, κxn<1: reduced, κn>1: enlargedϕ(ω): principal value of deflection angle of two acoustic signals (integer where −π<ϕ(ω)≤π)
For example, if the user performs an operation to enlarge the center of the video image that is displayed on the display 103, more virtual microphones are concentrated near the center, and κn in the center is increased and κn on the left and right is reduced.
The shift unit 132 performs, for every set of virtual microphones, multiplication with a shift matrix (shift coefficient) that is the shift amount of left-right movement between the virtual microphones, together with the beamforming unit 130. The shift matrix is determined from the display range of the video image that appears on the display 103 and the disposition interval of the virtual microphones that is based on the beamforming.
Tn(ω, τn): shift matrix for moving sound field left/right
τn: shift amount (−c≤τn≤c, c: time constant)
For example, if the user performs an operation to enlarge the center of the video image that is displayed on the display 103, more virtual microphones are concentrated near the center, and τn on the left side is set to a negative value for left movement and τn on the right side is set to a positive value for right movement, without changing κn in the center.
The masking unit 133 performs, for every set of virtual microphones, multiplication with a masking matrix (attenuation coefficient) that is the attenuation of the sound field between the virtual microphones, together with the beamforming unit 130. The masking matrix is determined from the display range of the video image that appears on the display 103 and the disposition interval of the virtual microphones that is based on the beamforming.
Mn(ω, mn(ω))=diag(mn(ω), mn+1(ω))
Mn(ω, mn(ω)): masking matrix for realizing selective composition of sound fields
between a plurality of channels
mn(ω): masking attenuation coefficient (0 to 1)
As shown in
Input acoustic signal A of virtual microphone A
Input acoustic signal B of virtual microphone B
Output acoustic signal L of left channel
Output acoustic signal R of right channel
The settings of reference numeral 81 are as follows.
Masking attenuation coefficients: m1=1, m2=1
Shift amount: τ=0
Scaling coefficient: κ=1
In this case, matrices M and T do not change the input acoustic signals A and B, and the output acoustic signals will be as follows.
Output acoustic signal R=input acoustic signal A
Output acoustic signal L=input acoustic signal B
Thus, when speakers are respectively placed at the positions of the virtual microphones A and B and driven by the acoustic signals R and L, the sound field range in the direction in which the virtual microphones A and B are disposed will be equivalent to the sound collection range of the virtual microphones A and B.
The position of the center dashed line where the sound sources C and D of reference numeral 81 are located is an intermediate position between the virtual microphones A and B. In this case, the positions of the sound images of the sound source C and the sound source D that serve as the output acoustic signals are the same as the disposition positions of the sound source C and the sound source D.
The settings of reference numeral 82 are as follows.
Masking attenuation coefficients: m1=1, m2=1
Shift amount: τ=0
Here, the sound field range of the scaling coefficient κ<1 will be shorter than the sound field range of κ=1. At this time, when the speakers disposed at the positions of the virtual microphones A and B are driven with the output acoustic signals R and L, the position of the sound image of the sound source C will be the same as the disposition position of the sound source C, that is, the center dashed line. However, the position of the sound image of the sound source D will be closer to the center dashed line than the disposition position of the sound source D. Conversely, the sound field range of the scaling coefficient κ>1 will be longer than the sound field range of κ=1.
As indicated by reference numerals 81 and 82, when τ=0, the matrix T has no effect on the input acoustic signals A and B. On the other hand, when τ≠0, the matrix T causes the phases of the input acoustic signals A and B, which have the same absolute value but different signs, to change. Thus, the positions of the sound images shift in the direction of the virtual microphone A or B according to the value of τ. Note that the shift direction is determined according to whether τ is positive or negative, and the shift amount is larger as the absolute value of τ increases.
Reference numeral 83 indicates the sound field range when τ≠0 is set after having set K to obtain the sound field range of reference numeral 82. The positions of the sound images of the sound sources C and D shift more to the left than in the case of reference numeral 82.
Note that, in
Initially, it is determined whether at least one virtual microphone is included in the angle sectionl of the visual-auditory range.
1st Set: set in which both virtual microphones are included in the angle section
2nd Set: set in which both virtual microphones are not included in the angle section
3rd Set: set in which one virtual microphone is included in the angle section and the other virtual microphone is not included in the angle section
L1: overlapping range from the position of the one virtual microphone to the angle section boundary
L2: non-overlapping range from the position of the other virtual microphone to the angular section boundary
3rd Set: set in which the virtual microphones are closest to the angular section
2nd Set: set of virtual microphones other than the above third set
With regard to the first set, τ=0, κ=1, mA=0 and mB=0, for example. That is, the sound field is not scaled, shifted or attenuated. On the other hand, with regard to the third set, κ and τ are configured such that the sound field range depends on the overlapping interval. That is, the scaling coefficient κ of the third set is configured based on the length L1 of the overlapping interval. Specifically, the scaling coefficient κ for the third set is determined to obtain a scaling ratio of L1/L, where L is the distance between the two virtual microphones of the third set. The scaling coefficient κ of the third set is thereby determined such that the sound field range is shortened as the length of the overlapping interval of the third set decreases. Also, the shift coefficient i of the third set is configured such that the center position of the sound field approaches the center position of the overlapping interval. Thus, the shift coefficient of the third set is determined according to the distance between the center of the disposition position of the two virtual microphones and the center of the overlapping interval.
Furthermore, the attenuation coefficients of the two virtual microphones of the third set are set to mA=1 and mB=1. Alternatively, in the third set, the attenuation coefficient of the virtual microphone that is included in the angle section is set to the same value as the attenuation coefficients of two virtual microphones of the first set. The attenuation coefficient of the virtual microphone that is not included in the angle section is thereby configured such that the attenuation is larger than the attenuation of the virtual microphone that is included in the angle section. Alternatively, the attenuation coefficient of the virtual microphone of the third set that is not included of the angle section is configured such that the attenuation is greater as the length of the non-overlapping interval, that is, a shortest distance L2 from the disposition position of the microphone to the angle section of the visual-auditory range, increases.
With regard to the second set, τ=0 and κ=1, for example, similarly to the first set. Here, the attenuation coefficients of the two virtual microphones are set to values at which the attenuation will be greater than the attenuation coefficients set for the microphones of the first and third sets. For example, the attenuation coefficients of the two virtual microphones of the second set are set to a value at which the attenuation is maximized, that is, to 0 or a predetermined value close to 0.
In
3rd set: set of virtual microphones A and B
3rd set: set of virtual microphones A and C
2nd set: other sets
Here, since the attenuation on the virtual microphones of the second sets a high, hardly any of the acoustic signals of these sets is included in the output acoustic signals R and L.
As described in detail above, according to an apparatus, program and method of the present invention, the collected sound signals of a plurality of microphones can be mixed, such that the user does not sense a divergence between the visual video image range and the auditory sound field range.
According to the present invention, the user can be presented with interactive viewing of a 360-degree moving image in which the sound images have high localization accuracy.
The present invention is also realizable with processing that involves supplying a program that realizes one or more of the functions of the above-described embodiment to a system or apparatus via a network or storage medium, and one or more processors in a computer of the system or apparatus reading out and executing the program. The present invention is also realizable by a circuit (e.g., ASIC) that realizes one or more of the functions.
The invention is not limited to the foregoing embodiments, and various variations/changes are possible within the spirit of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2020-025587 | Feb 2020 | JP | national |
This application is a continuation of International Patent Application No. PCT/JP2021/005322 filed on Feb. 12, 2021, which claims priority to and the benefit of Japanese Patent Application No. 2020-025587 filed on Feb. 18, 2020, the entire disclosures of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2021/005322 | Feb 2021 | US |
Child | 17885825 | US |