Sound signal downmixing method, sound signal coding method, sound signal downmixing apparatus, sound signal coding apparatus, program and recording medium

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a U.S. National Stage Application filed under 35 U.S.C. § 371 claiming priority to International Patent Application No. PCT/JP2021/004642, filed on 8 Feb. 2021, which application claims priority to and the benefit of International Patent Application No. PCT/JP2020/010080, filed on 9 Mar. 2020; International Patent Application No. PCT/JP2020/010081, filed on 9 Mar. 2020; and International Patent Application No. PCT/JP2020/041216, filed on 4 Nov. 2020, the disclosures of which are hereby incorporated herein by reference in their entireties.

TECHNICAL FIELD

The present disclosure relates to a technique of obtaining a monaural sound signal from a plurality of channel sound signals for the purpose of monaural coding of a sound signal, coding of a sound signal by a combination of monaural coding and stereo coding, monaural signal processing of a sound signal, and signal processing of a stereo sound signal using a monaural sound signal.

BACKGROUND ART

A technique of obtaining a monaural sound signal from a 2-channel sound signal and performing embedded coding/decoding of the 2-channel sound signal and the monaural sound signal is disclosed in PTL 1. PTL 1 discloses a technique in which a monaural signal is obtained by averaging an input left-channel sound signal and an input right-channel sound signal for each corresponding sample, a monaural code is obtained by coding (monaural coding) the monaural signal, a monaural local decoding signal is obtained by decoding (monaural decoding) the monaural code, and the difference (predictive residual signal) of a predictive signal obtained from the monaural local decoding signal and the input sound signal is coded for each of the left channel and the right channel. In the technique disclosed in PTL 1, for each channel, the degradation of the sound quality of the decoding sound signal of each channel is suppressed by selecting a predictive signal, which is set as a signal provided with an amplitude ratio by delaying the monaural local decoding signal, with a delay and an amplitude ratio achieving a minimum error between the input sound signal and the predictive signal, or by subtracting a predictive signal from the input sound signal by using a predictive signal with a delay and an amplitude ratio that maximizes the mutual correlation between the input sound signal and the monaural local decoding signal, so as to obtain a predictive residual signal to be subjected to coding/decoding.

CITATION LIST
Patent Literature

PTL 1 WO2006/070751

SUMMARY OF THE INVENTION
Technical Problem

In the technique disclosed in PTL 1, the coding efficiency for each channel can be increased by optimizing the delay and the amplitude ratio given to the monaural local decoding signal when obtaining the predictive signal. In the technique disclosed in PTL 1, however, the monaural local decoding signal is obtained by coding/decoding the monaural signal obtained by averaging the left-channel sound signal and the right-channel sound signal. That is, the technique disclosed in PTL 1 is disadvantageous in that no contrivance is made to obtain a monaural signal useful for signal processing such as coding processing from a sound signal of a plurality of channels. An object of the present disclosure is to provide a technique for obtaining a monaural signal useful for signal processing such as coding processing from a sound signal of a plurality of channels.

Means for Solving the Problem

A sound signal downmix method according to an aspect of the present disclosure is a method of obtaining a downmix signal that is a monaural sound signal from input sound signals of N channels, N being an integer of three or greater, the sound signal downmix method including an inter-channel relationship information obtaining step of obtaining an inter-channel correlation value and preceding channel information of every pair of two channels included in the N channels, the inter-channel correlation value being a value indicating a degree of a correlation between input sound signals of the two channels, the preceding channel information being information indicating which of the input sound signals of the two channels is preceding, and a downmix step of obtaining the downmix signal by weighting and adding the input sound signals of the N channels, the input sound signal of each channel being weighted based on the inter-channel correlation value and the preceding channel information such that the larger a correlation with an input sound signal of a preceding channel that precedes the channel, the smaller a weight, whereas the larger a correlation with an input sound signal of a succeeding channel that succeeds the channel, the larger the weight, in which the inter-channel relationship information obtaining step includes a channel sorting step of sequentially performing sorting in an order from a first channel such that an adjacent channel is a channel with a most similar input sound signal among remaining channels, and obtaining a first sorted input sound signal to an Nth sorted input sound signal that are signals after the sorting of the N channels, and first original channel information to Nth original channel information of the N channels for the sorted input sound signals, the first original channel information to the Nth original channel information being channel numbers of the N channels for the input sound signals, an inter-adjacent-channel relationship information estimation step of obtaining an inter-channel correlation value and an inter-channel time difference of every pair of two channels after the sorting with adjacent channel numbers after the sorting among the first to Nth sorted input sound signals, and an inter-channel relationship information complement step including obtaining an inter-channel correlation value of every pair of two channels after the sorting with non-adjacent channel numbers after the sorting from the inter-channel correlation value of every pair of two channels after the sorting with adjacent channel numbers after the sorting, obtaining the inter-channel correlation value between the input sound signals of every pair of two channels included in the N channels by associating the inter-channel correlation value of every pair of channels after the sorting with a pair of channels for the input sound signals of the N channels by using the original channel information, obtaining an inter-channel time difference of every pair of two channels after the sorting with non-adjacent channel numbers after the sorting from the inter-channel time difference of every pair of two channels after the sorting with adjacent channel numbers after the sorting, and obtaining preceding channel information of every pair of two channels included in the N channels by establishing an association with a pair of channels for the input sound signals of the N channels by using the original channel information from the inter-channel time difference of every pair of channels after the sorting, and obtaining the preceding channel information based on whether the inter-channel time difference is positive, negative or zero, two channel numbers of every pair of two channels after the sorting with adjacent channel numbers after the sorting are denoted as i and i+1, i being an integer from 1 to N−1, the inter-channel correlation value of every pair of two channels after the sorting with adjacent channel numbers after the sorting is denoted as γ′_i(i+1), the inter-channel time difference of every pair of two channels after the sorting with adjacent channel numbers after the sorting is denoted as τ′_i(i+1), two channel numbers of every pair of two channels after the sorting with non-adjacent channel numbers after the sorting are denoted as n and m, n being an integer from 1 to N−2, m being an integer from n+2 to N, the inter-channel correlation value of every pair of two channels after the sorting with non-adjacent channel numbers after the sorting is denoted as γ′_nm, and the inter-channel time difference of every pair of two channels after the sorting with non-adjacent channel numbers after the sorting is denoted as τ′_nm, the inter-channel correlation value γ′_nmof every pair of two channels after the sorting with non-adjacent channel numbers after the sorting is a product or a geometric mean of all of one or more of the inter-channel correlation values γ′_i(i+1)including a minimum value of the inter-channel correlation values γ′_i(i+1)of the pairs of two channels with adjacent channel numbers after the sorting, i of the inter-channel correlation values γ′_i(i+1)being from n to m−1, and the inter-channel time difference τ′_nmof every pair of two channels after the sorting with non-adjacent channel numbers after the sorting is a value obtained by adding up all of the inter-channel time differences τ′_i(i+1)of the pairs of two channels with adjacent channel numbers after the sorting, i of the inter-channel time differences τ′_i(i+1)being from n to m−1.

A sound signal coding method according to an aspect of the present disclosure includes the sound signal downmix method as a sound signal downmix step, a monaural coding step of obtaining a monaural code by coding the downmix signal obtained in the downmix step, and a stereo coding step of obtaining a stereo code by coding the input sound signals of the N channels.

Effects of the Invention

According to the present disclosure, it is possible to obtain a monaural signal useful for signal processing such as coding processing from a sound signal of a plurality of channels.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a block diagram illustrating a sound signal downmix apparatus of a first example of a first embodiment.

FIG. 2 is a flowchart of processing of the sound signal downmix apparatus of the first example of the first embodiment.

FIG. 3 is a block diagram illustrating an example of a sound signal downmix apparatus of a second example of the first embodiment.

FIG. 4 is a flowchart of an example of processing of the sound signal downmix apparatus of the second example of the first embodiment.

FIG. 5 is a block diagram illustrating an example of a sound signal downmix apparatus of a first example of a second embodiment and a first example of a third embodiment.

FIG. 6 is a flowchart of an example of processing of the sound signal downmix apparatus of the first example of the second embodiment and the first example of the third embodiment.

FIG. 7 is a block diagram illustrating an example of a sound signal downmix apparatus of a second example of the second embodiment and a second example of the third embodiment.

FIG. 8 is a flowchart of an example of processing of the sound signal downmix apparatus of the second example of the second embodiment and the second example of the third embodiment.

FIG. 9 is a diagram schematically illustrating a 6-channel input sound signal input to a sound signal downmix apparatus.

FIG. 10 is a diagram schematically illustrating a 6-channel input sound signal input to a sound signal downmix apparatus.

FIG. 11 is a block diagram illustrating an example of an inter-channel relationship information estimation unit of the third embodiment.

FIG. 12 is a flowchart of an example of processing of the inter-channel relationship information estimation unit of the third embodiment.

FIG. 13 is a block diagram illustrating an example of a sound signal coding apparatus of a fourth embodiment.

FIG. 14 is a flowchart of an example of processing of the sound signal coding apparatus of the fourth embodiment.

FIG. 15 is a block diagram illustrating an example of a sound signal processing apparatus of a fifth embodiment.

FIG. 16 is a flowchart of an example of processing of the sound signal processing apparatus of the fifth embodiment.

FIG. 17 is a diagram illustrating an example of a functional configuration of a computer that implements apparatuses of the embodiments of the present disclosure.

DESCRIPTION OF EMBODIMENTS
First Embodiment

A 2-channel sound signal that is the target of signal processing such as coding processing is often a digital sound signal obtained through an AD conversion of sounds picked up by a left-channel microphone and a right-channel microphone disposed in a certain space. In this case, a left-channel input sound signal, which is a digital sound signal obtained through an AD conversion of a sound picked up by the left-channel microphone disposed in the space, and a right-channel input sound signal, which is a digital sound signal obtained through an AD conversion of a sound picked up by the right-channel microphone disposed in the space, are input to an apparatus for performing signal processing such as coding processing. The left-channel input sound signal and right-channel input sound signal each include the sound output by each sound source in the space with a given difference (so-called arrival time difference) between the arrival time at the left-channel microphone from the sound source and the arrival time at the right-channel microphone from the sound source.

In the above-described technique disclosed in PTL 1, a predictive residual signal is obtained by subtracting, from an input sound signal, a predictive signal, which is a monaural local decoding signal provided with a delay and an amplitude ratio, and the predictive residual signal is subjected to coding/decoding. That is, for each channel, the higher the similarity between the input sound signal and the monaural local decoding signal, the higher the efficiency of the coding. However, for example, in the case where only a sound output by one sound source in a certain space is included in the left-channel input sound signal and the right-channel input sound signal with a given arrival time difference, and the monaural local decoding signal is a signal obtained by coding/decoding a monaural signal obtained by averaging the left-channel sound signal and the right-channel sound signal, the similarity of the left-channel sound signal and the monaural local decoding signal is not significantly high, and the similarity of the right-channel sound signal and the monaural local decoding signal is also not significantly high, even though the left-channel sound signal, the right-channel sound signal, and the monaural local decoding signal each include only a sound output by the same single sound source. In this manner, when a monaural signal is obtained by only averaging the left-channel sound signal and the right-channel sound signal, a monaural signal useful for signal processing such as coding processing cannot be obtained in some situation.

In view of this, a sound signal downmix apparatus of a first embodiment performs downmix processing that takes into account the relationship between the left-channel input sound signal and the right-channel input sound signal so that a monaural signal useful for signal processing such as coding processing can be obtained. The sound signal downmix apparatus of the first embodiment will be described below.

First Example

First, a sound signal downmix apparatus of a first example of the first embodiment will be described. As illustrated in FIG. 1, a sound signal downmix apparatus 401 of the first example includes a left-right relationship information estimation unit 183 and a downmix unit 112. The sound signal downmix apparatus 401 obtains a downmix signal described later from an input 2-channel stereo time-domain sound signal in a frame unit of a predetermined time length of, for example, 20 ms and outputs the downmix signal. A 2-channel stereo time-domain sound signal input to the sound signal downmix apparatus 401 is, for example, a digital sound signal obtained through an AD conversion of a sound such as a voice and music picked up by each of two microphones, a digital decoded sound signal obtained by coding/decoding the digital sound signal, and a digital signal processed sound signal obtained through signal processing of the digital sound signal. The 2-channel stereo time-domain sound signal is composed of a left-channel input sound signal and a right-channel input sound signal. A downmix signal, which is a time-domain monaural sound signal obtained by the sound signal downmix apparatus 401, is input to a coding apparatus that performs coding of at least the downmix signal and a signal processing apparatus that performs signal processing of at least the downmix signal. When the number of samples per frame is T, left-channel input sound signals x_L(1), x_L(2) . . . , x_L(t) and right-channel input sound signals x_R(1), x_R(2) . . . , x_R(t) are input to the sound signal downmix apparatus 401 in a frame unit, and the sound signal downmix apparatus 401 obtains and outputs downmix signals x_M(1), x_M(2) . . . , x_M(T) in a frame unit. Here, T is a positive integer, and for example, when the frame length is 20 ms and the sampling frequency is 32 kHz, T is 640. The sound signal downmix apparatus 401 of the first example performs processing of step S183 and step S112 exemplified in FIG. 2 for each frame.

Left-Right Relationship Information Estimation Unit 183

A left-channel input sound signal input to the sound signal downmix apparatus 401 and a right-channel input sound signal input to the sound signal downmix apparatus 401 are input to the left-right relationship information estimation unit 183. The left-right relationship information estimation unit 183 obtains a left-right correlation value γ and preceding channel information from the left-channel input sound signal and the right-channel input sound signal and outputs the left-right correlation value γ and the preceding channel information (step S183).

The preceding channel information is information representing whether a sound output by a main sound source in a certain space has arrived first at the left-channel microphone disposed in the space or the right-channel microphone disposed in the space. That is, the preceding channel information is information indicating whether the same sound signal is included first in the left-channel input sound signal or the right-channel input sound signal. When the case where the same sound signal is included first in the left-channel input sound signal is referred to as “the left channel is preceding” or “the right channel is succeeding” and the case where the same sound signal is included first in the right-channel input sound signal is referred to as “the right channel is preceding” or “the left channel is succeeding”, the preceding channel information is information indicating which of the left channel and the right channel is preceding. The left-right correlation value γ is a correlation value that takes into account the time difference between the left-channel input sound signal and the right-channel input sound signal. That is, the left-right correlation value γ is a value indicating the degree of the correlation between the sample sequence of the input sound signal of the preceding channel and the sample sequence of the input sound signal of the succeeding channel shifted backward by i samples relative to the sample sequence of the preceding channel. In the following description, i is also referred to as a left-right time difference. The preceding channel information and the left-right correlation value γ are information indicating the relationship between the left-channel input sound signal and the right-channel input sound signal, and therefore can be referred to as left-right relationship information.

A case where, for example, the absolute value of a correlation coefficient is used as a value indicating the degree of the correlation will be described. For each candidate number of samples τ_candfrom τ_maxto τ_minset in advance (for example, τ_maxis a positive number and τ_minis a negative number), the left-right relationship information estimation unit 183 obtains and outputs, as the left-right correlation value γ, a maximum value of an absolute value γ_candof the correlation coefficient between the sample sequence of the left-channel input sound signal and the sample sequence of the right-channel input sound signal shifted backward relative to the sample sequence of the left-channel input sound signal by the candidate number of samples τ_cand, obtains and outputs information indicating that the left channel is preceding as the preceding channel information in the case where τ_candwhen the absolute value of the correlation coefficient is a maximum value is a positive value, and obtains and outputs information indicating that the right channel is preceding as the preceding channel information in the case where τ_candwhen the absolute value of the correlation coefficient is a maximum value is a negative value. In the case where τ_candwhen the absolute value of the correlation coefficient is a maximum value is zero, the left-right relationship information estimation unit 183 may obtain and output information indicating that the left channel is preceding as the preceding channel information or obtain and output information indicating that the right channel is preceding as the preceding channel information, while it is preferable to obtain and output information indicating that no channel is preceding as the preceding channel information.

Each candidate number of samples set in advance may be an integer value from τ_maxto τ_min, may include fractions and decimals between τ_maxand τ_min, and may not include any of integer values between τ_maxand τ_min. In addition, τ_maxmay or may not be equal to −τ_min. When it is assumed that an input sound signal whose preceding channel is unknown is targeted, it is preferable that τ_maxbe a positive number and that τ_minbe a negative number. When a special input sound signal in which any of channels is necessarily preceding is targeted, both τ_maxand τ_minmay be positive numbers, or negative numbers. To calculate the absolute value γ_candof the correlation coefficient, one or more samples of a past input sound signal continuous to the sample sequence of the input sound signal of the current frame may also be used. In this case, it suffices to store the sample sequences of the input sound signals in a predetermined number of past frames in a storage unit not illustrated in the drawing in the left-right relationship information estimation unit 183.

In addition, for example, instead of the absolute value of the correlation coefficient, a correlation value using information about a phase of a signal may be set as γ_candas follows. In this example, the left-right relationship information estimation unit 183 first obtains frequency spectra X_L(k) and X_R(k) at each frequency k of 0 to T−1 by performing Fourier transform on each of the left-channel input sound signals x_L(1), x_L(2) . . . , x_L(t) and the right-channel input sound signals x_R(1), x_R(2) . . . , x_R(t) as in the following Equation (1-1) and Equation (1-2).

$[Math . 1]$

$\begin{matrix} X_{L} (k) = \frac{1}{\sqrt{T}} \sum_{t = 0}^{T - 1} x_{L} (t + 1) e^{- j \frac{2 π kt}{T}} & (1 - 1) \end{matrix}$

$[Math . 2]$

$\begin{matrix} X_{R} (k) = \frac{1}{\sqrt{T}} x_{R} (t + 1) e^{- j \frac{2 π kt}{T}} & (1 - 2) \end{matrix}$

Next, the left-right relationship information estimation unit 183 obtains a phase difference spectrum φ(k) at each frequency k through the following Equation (1-3) by using the frequency spectra X_L(k) and X_R(k) at each frequency k obtained through Equation (1-1) and Equation (1-2).

$[Math . 3]$

$\begin{matrix} ϕ (k) = \frac{X_{L} (k) / ❘ X_{L} (k) ❘}{X_{R} (k) / ❘ X_{R} (k) ❘} & (1 - 3) \end{matrix}$

Next, the left-right relationship information estimation unit 183 obtains a phase difference signal ψ(τ_cand) for each candidate number of samples τ_candfrom τ_maxto τ_minas in the following Equation (1-4) by performing inverse Fourier transform on the phase difference spectrum obtained through Equation (1-3).

$[Math . 4]$

$\begin{matrix} ψ (τ_{cand}) = \frac{1}{\sqrt{T}} \sum_{k = 0}^{T - 1} ϕ (k) e^{j \frac{2 π k τ_{cand}}{T}} & (1 - 4) \end{matrix}$

The absolute value of the phase difference signal ψ(τ_cand) obtained through Equation (1-4) represents some kind of correlation corresponding to the plausibility of the time difference between the left-channel input sound signals x_L(1), x_L(2) . . . , x_L(t) and the right-channel input sound signals x_R(1), x_R(2) . . . , x_R(t), and therefore the left-right relationship information estimation unit 183 uses, as a correlation value γ_cand, the absolute value of the phase difference signal ψ(τ_cand) for each candidate number of samples τ_cand. Specifically, the left-right relationship information estimation unit 183 obtains and outputs a maximum value of the correlation value γ_candthat is the absolute value of the phase difference signal ψ(τ_cand) as the left-right correlation value γ, obtains and outputs information indicating that the left channel is preceding as the preceding channel information in the case where τ_candwhen the correlation value is a maximum value is a positive value, and obtains and outputs information indicating that the right channel is preceding as the preceding channel information in the case where τ_candwhen the correlation value is a maximum value is a negative value. In the case where τ_candwhen the correlation value is a maximum value is zero, the left-right relationship information estimation unit 183 may obtain and output information indicating that the left channel is preceding as the preceding channel information, and may obtain and output information indicating that the right channel is preceding as the preceding channel information, while it is preferable to obtain and output information indicating that no channel is preceding as the preceding channel information. Note that instead of using as it is the absolute value of the phase difference signal ψ(τ_cand) as the correlation value γ_cand, the left-right relationship information estimation unit 183 may use a normalized value such as a relative difference between the average of the absolute values of phase difference signals obtained for a plurality of candidate numbers of samples before and after τ_candand the absolute value of the phase difference signal ψ(τ_cand) for each τ_cand, for example. That is, the left-right relationship information estimation unit 183 may use, as γ_cand, a normalized correlation value obtained by obtaining an average value through the following Equation (1-5) using the positive number τ_rangeset in advance for each τ_cand, and by using the following Equation (1-6) using the obtained average value ψ_c(τ_cand) and phase difference signal ψ(τ_cand).

$[Math . 5]$

$\begin{matrix} ψ_{c} (τ_{cand}) = \frac{1}{2 τ_{range} + 1} \sum_{τ^{'} = τ_{cand} - τ_{range}}^{τ_{cand} + τ_{range}} ❘ ψ (τ^{'}) ❘ & (1 - 5) \end{matrix}$

$[Math . 6]$

$\begin{matrix} 1 - \frac{ψ_{c} (τ_{cand})}{❘ ψ (τ_{cand}) ❘} & (1 - 6) \end{matrix}$

Note that the normalized correlation value obtained through Equation (1-6) is a value from 0 to 1, with a property in which the higher the plausibility of τ_candas the left-right time difference, the closer it is to 1, whereas the lower the plausibility of τ_candas the left-right time difference, the closer it is to 0.

Downmix Unit 112

The left-channel input sound signal input to the sound signal downmix apparatus 401, the right-channel input sound signal input to the sound signal downmix apparatus 401, the left-right correlation value γ output by the left-right relationship information estimation unit 183, and the preceding channel information output by the left-right relationship information estimation unit 183 are input to the downmix unit 112. The downmix unit 112 obtains a downmix signal by weighting and averaging the left-channel input sound signal and the right-channel input sound signal such that as the left-right correlation value γ becomes larger, the input sound signal of the preceding channel of the left-channel input sound signal and the right-channel input sound signal is more included in the downmix signal, and the downmix unit 112 outputs the downmix signal (step S112).

For example, in the case where the absolute value of the correlation coefficient and the normalized value are used for the correlation value as in the above-described example of the left-right relationship information estimation unit 183, the left-right correlation value γ input from the left-right relationship information estimation unit 183 is a value from 0 to 1. Therefore, the downmix unit 112 may obtain a downmix signal x_M(t) obtained by weighting and adding the left-channel input sound signal x_L(t) and the right-channel input sound signal x_R(t) by using the weight set by the left-right correlation value γ for each corresponding sample number t. To be more specific, the downmix unit 112 may obtain the downmix signal x_M(t) as x_M(t)=((1+γ)/2)×x_L(t)+((1−γ)/2)×x_R(t) in the case where the preceding channel information is information indicating that the left channel is preceding, that is, in the case where the left channel is preceding, and the downmix unit 112 may obtain the downmix signal x_M(t) as x_M(t)=((1−γ)/2)×x_L(t)+((1+γ)/2)×x_R(t) in the case where the preceding channel information is information indicating that the right channel is preceding, that is, in the case where the right channel is preceding. When the downmix unit 112 obtains the downmix signal in the above-described manner, the smaller the left-right correlation value γ, that is, the smaller the correlation of the left-channel input sound signal and the right-channel input sound signal, the downmix signal is similar to a signal obtained by averaging the left-channel input sound signal and the right-channel input sound signal, whereas the larger the left-right correlation value γ, that is, the larger the correlation of the left-channel input sound signal and the right-channel input sound signal, the downmix signal is similar to the input sound signal of the preceding channel of the left-channel input sound signal and the right-channel input sound signal.

Note that in the case where no channel is preceding, it is preferable that the downmix unit 112 obtain and output the downmix signal by averaging the left-channel input sound signal and the right-channel input sound signal such that the left-channel input sound signal and the right-channel input sound signal are included in the downmix signal with the same weight. That is, in the case where the preceding channel information indicates that no channel is preceding, the downmix unit 112 preferably obtains, for each sample number t, the downmix signal x_M(t) as x_M(t)=(x_L(t)+x_R(t))/2 obtained by averaging the left-channel input sound signal x_L(t) and the right-channel input sound signal x_R(t).

Second Example

For example, in the case where an apparatus different from the sound signal downmix apparatus performs stereo coding processing of the left-channel input sound signal and the right-channel input sound signal, and in the case where the left-channel input sound signal and the right-channel input sound signal are signals obtained through the stereo decoding processing in an apparatus different from the sound signal downmix apparatus signal, either one or both of the preceding channel information and the left-right correlation value γ identical to that obtained by the left-right relationship information estimation unit 183 can possibly be obtained in the apparatus different from the sound signal downmix apparatus. In the case where either one or both of the left-right correlation value γ and the preceding channel information has been obtained in the different apparatus, either one or both of the left-right correlation value γ and the preceding channel information obtained in the different apparatus is input to the sound signal downmix apparatus, and the left-right relationship information estimation unit 183 obtains the left-right correlation value γ or the preceding channel information that has not been input to the sound signal downmix apparatus. Below, a second example, which is an example of the sound signal downmix apparatus on the assumption that either one or both of the left-right correlation value γ and the preceding channel information is input from the outside, will be described mainly about differences from the first example.

As illustrated in FIG. 3, a sound signal downmix apparatus 405 of the second example includes a left-right relationship information obtaining unit 185 and the downmix unit 112. As indicated by the dashed line in FIG. 3, either one or both of the left-right correlation value γ and the preceding channel information obtained by a different apparatus may be input to the sound signal downmix apparatus 405, in addition to the left-channel input sound signal and the right-channel input sound signal. The sound signal downmix apparatus 405 of the second example performs processing of step S185 and step S112 exemplified in FIG. 4 for each frame. The downmix unit 112 and step S112 are identical to those of the first example, and therefore the left-right relationship information obtaining unit 185 and step S185 will be described below.

Left-Right Relationship Information Obtaining Unit 185

The left-right relationship information obtaining unit 185 obtains and outputs the left-right correlation value γ, which is a value indicating the degree of the correlation of the left-channel input sound signal and the right-channel input sound signal, and the preceding channel information, which is information indicating which of the left-channel input sound signal and the right-channel input sound signal is preceding (step S185).

As indicated by the dashed line in FIG. 3, when both of the left-right correlation value γ and the preceding channel information are input to the sound signal downmix apparatus 405 from the different apparatus, the left-right relationship information obtaining unit 185 obtains the preceding channel information and the left-right correlation value γ input to the sound signal downmix apparatus 405, and outputs them to the downmix unit 112.

As indicated by the broken line in FIG. 3, in the case where one of the left-right correlation value γ and the preceding channel information is not input to the sound signal downmix apparatus 405 from the different apparatus, the left-right relationship information obtaining unit 185 includes the left-right relationship information estimation unit 183. The left-right relationship information estimation unit 183 of the left-right relationship information obtaining unit 185 obtains the left-right correlation value γ that is not input to the sound signal downmix apparatus 405 or the preceding channel information that is not input to the sound signal downmix apparatus 405 from the left-channel input sound signal and the right-channel input sound signal as with the left-right relationship information estimation unit 183 of the first example, and outputs them to the downmix unit 112. As indicated by the dashed line in FIG. 3, regarding the left-right correlation value γ input to the sound signal downmix apparatus 405 or the preceding channel information input to the sound signal downmix apparatus 405, the left-right relationship information obtaining unit 185 outputs, to the downmix unit 112, the left-right correlation value γ input to the sound signal downmix apparatus 405 or the preceding channel information input to the sound signal downmix apparatus 405.

As indicated by the broken line in FIG. 3, in the case where both of the left-right correlation value γ and the preceding channel information are not input to the sound signal downmix apparatus 405 from the different apparatus, the left-right relationship information obtaining unit 185 includes the left-right relationship information estimation unit 183. The left-right relationship information estimation unit 183 obtains the left-right correlation value γ and the preceding channel information from the left-channel input sound signal and the right-channel input sound signal as with the left-right relationship information estimation unit 183 of the first example, and outputs them to the downmix unit 112. That is, it can be said that the left-right relationship information estimation unit 183 and step S183 of the first example belong to the categories of the left-right relationship information obtaining unit 185 and step S185, respectively.

Second Embodiment

Even in the case where the number of channels is three or more, a monaural signal useful for signal processing such as coding processing can be obtained by setting the same relationship between the downmix signal and the input sound signal of each channel as that of the sound signal downmix apparatuses 401 and 405 of the first embodiment. This configuration will be described as a second embodiment.

The way of including the input sound signal of a certain channel in a downmix signal in the sound signal downmix apparatuses 401 and 405 of the first embodiment will be described below with the channel number of each of the left channel and the right channel set as n. The sound signal downmix apparatuses 401 and 405 of the first embodiment operate such that, for each nth channel, the larger the correlation of the input sound signal of a channel succeeding the nth channel and the input sound signal of the nth channel, the larger the weight of the input sound signal of the nth channel included in the downmix signal, whereas the larger the correlation of the input sound signal of a channel preceding the nth channel and the input sound signal of the nth channel, the smaller the weight of the input sound signal of the nth channel included in the downmix signal. The sound signal downmix apparatus of the second embodiment expands the above-described relationship between the input sound signal and the downmix signal, so as to support the case with a plurality of preceding channels, the case with a plurality of succeeding channels, and the case with both a preceding channel and a succeeding channel. The sound signal downmix apparatus of the second embodiment will be described below. Note that the sound signal downmix apparatus of the second embodiment is an apparatus that expands the sound signal downmix apparatus of the first embodiment so as to support the case where the number of channels is three or more, and operates in the same manner as that of the sound signal downmix apparatus of the first embodiment when the number of channels is two.

In the first embodiment, an example has been described in which the smaller the correlation of the input sound signals between channels, the similar the downmix signal obtained by the sound signal downmix apparatuses 401 and 405 is to a signal obtained by averaging all input sound signals. The above-described relationship between the input sound signal and the downmix signal can be achieved even when the number of channels is three or more, and therefore it is described as an example of the sound signal downmix apparatus of the second embodiment.

First Example

First, a sound signal downmix apparatus of a first example of the second embodiment will be described. As illustrated in FIG. 5, a sound signal downmix apparatus 406 of the first example includes an inter-channel relationship information estimation unit 186 and a downmix unit 116. The sound signal downmix apparatus 406 obtains a downmix signal described later from an input time-domain sound signal of N-channel stereo in a frame unit of a predetermined time length of, for example, 20 ms, and outputs the signal. The number of channels N is an integer of 2 or greater. It should be noted that the sound signal downmix apparatus of the second embodiment is especially useful for the case where N is an integer of three or greater because it suffices to use the sound signal downmix apparatus of the first embodiment in the case where the number of channels is two. Time-domain sound signals of the N channels are input to the sound signal downmix apparatus 406. Examples of such signals include a digital sound signal obtained through an AD conversion of a sound such as a voice and music picked up by each of N microphones, digital sound signals of the N channels, which are obtained by performing no processing or appropriately mixing these signals, a digital sound signal of one or more channels picked up at a plurality of points and subjected to an AD conversion, a digital decoded sound signal obtained by coding/decoding the above-described digital sound signals, and a digital signal processed sound signal obtained through signal processing of the above-described digital sound signals. A downmix signal that is a time-domain monaural sound signal obtained by the sound signal downmix apparatus 406 is input to a coding apparatus that performs coding of at least the downmix signal and a signal processing apparatus that performs signal processing of at least the downmix signal. The input sound signals of the N channels are input to the sound signal downmix apparatus 406 in a frame unit, and the sound signal downmix apparatus 406 obtains and outputs the downmix signal in a frame unit. In the following description, the number of samples per frame will be described as T. T is a positive integer, and for example, when the frame length is 20 ms and the sampling frequency is 32 kHz, T is 640. The sound signal downmix apparatus 406 of the first example performs the processing of step S186 and step S116 exemplified in FIG. 6 for each frame.

Inter-Channel Relationship Information Estimation Unit 186

The input sound signals of the N channels input to the sound signal downmix apparatus 406 are input to the inter-channel relationship information estimation unit 186. The inter-channel relationship information estimation unit 186 obtains an inter-channel correlation value and the preceding channel information from the input sound signals of the N channels input thereto and outputs the inter-channel correlation value and the preceding channel information (step S186). The inter-channel correlation value and the preceding channel information are information indicating the relationship between channels for the input sound signals of the N channels, and can be referred to as inter-channel relationship information.

The inter-channel correlation value is a value indicating the degree of the correlation for each pair of two channels included in the N channels in consideration of the time difference between input sound signals. (N×(N−1))/2 pairs of two channels are included in the N channels. In the case where n is an integer from 1 to N, m is an integer greater than n and equal to or smaller than N, and the inter-channel correlation value between the nth channel input sound signal and mth channel input sound signal is γ_nm, the inter-channel relationship information estimation unit 186 obtains the inter-channel correlation value γ_nmof each of (N×(N−1))/2 pairs of n and m.

The preceding channel information is information, for each pair of two channels included in the N channels, indicating which of the input sound signals of the two channels include the same sound signal first and thus indicating which of the two channels is preceding. In the case where the preceding channel information between the nth channel input sound signal and mth channel input sound signal is referred to as INFO_nm, the inter-channel relationship information estimation unit 186 obtains the preceding channel information INFO_nmof each of the above-described (N×(N−1))/2 pairs of n and m. Note that in the following description, for each pair of n and m, the case where the same sound signal is included in the nth channel input sound signal earlier than the mth channel input sound signal may be referred to as “the nth channel is preceding the mth channel”, “the nth channel precedes the mth channel”, “the mth channel is succeeding the nth channel”, “the mth channel succeeds the nth channel”, and the like. Likewise, in the following description, for each pair of n and m, the case where the same sound signal is included in the mth channel input sound signal earlier than the nth channel input sound signal may be referred to as “the mth channel is preceding the nth channel”, “the mth channel precedes the nth channel”, “the nth channel is succeeding the mth channel”, “the nth channel succeeds the mth channel”, and the like.

It suffices that the inter-channel relationship information estimation unit 186 obtains the inter-channel correlation value γ_nmand the preceding channel information INFO_nmas with the left-right relationship information estimation unit 183 of the first embodiment for each of the (N×(N−1))/2 pairs of the nth channel and the mth channel. Specifically, the inter-channel relationship information estimation unit 186 can obtain the inter-channel correlation value γ_nmand the preceding channel information INFO_nmof each pair of the nth channel and the mth channel by performing the same operation as that of each example of the left-right relationship information estimation unit 183 of the first embodiment for each of the (N×(N−1))/2 pairs of the nth channel and the mth channel. Here, in each example of the description of the left-right relationship information estimation unit 183 of the first embodiment, the left channel is read as the nth channel, the right channel is read as the mth channel, L is read as n, R is read as m, the preceding channel information is read as the preceding channel information INFO_nm, and the left-right correlation value γ is read as the inter-channel correlation value γ_nm, for example.

For example, the absolute value of a correlation coefficient is used as a value indicating the degree of the correlation. In such a case, for each candidate number of samples τ_candfrom τ_maxto τ_minset in advance for each of the (N×(N−1))/2 pairs of the nth channel and the mth channel, the inter-channel relationship information estimation unit 186 obtains and outputs, as an inter-channel correlation coefficient γ_nm, a maximum value of the absolute value γ_candof the correlation coefficient between the sample sequence of the nth channel input sound signal and the sample sequence of the mth channel input sound signal shifted backward relative to the sample sequence of the nth channel input sound signal by the candidate number of samples τ_cand, obtains and outputs information indicating that the nth channel is preceding as the preceding channel information INFO_nmin the case where τ_candwhen the absolute value of the correlation coefficient is a maximum value is a positive value, and obtains and outputs information indicating that the mth channel is preceding as the preceding channel information INFO_nmin the case where τ_candwhen the absolute value of the correlation coefficient is a maximum value is a negative value. In the case where τ_candwhen the absolute value of the correlation coefficient is a maximum value is zero, the inter-channel relationship information estimation unit 186 may obtain and output information indicating that the nth channel is preceding as the preceding channel information INFO_nm, or may obtain and output information indicating that the mth channel is preceding as the preceding channel information INFO_nm, for each pair of the nth channel and the mth channel. Note that τ_maxand τ_minare the same as those of the first embodiment.

In addition, for example, instead of the absolute value of the correlation coefficient, a correlation value using information about a phase of a signal may be set as γ_candas follows. In this example, first, the inter-channel relationship information estimation unit 186 obtains the frequency spectrum X_i(k) at each frequency k of 0 to T−1 by performing Fourier transform on input sound signals x_i(1), x_i(2) . . . , x_i(T) as in the following Equation (2-1) for each channel i from the first channel input sound signal to the Nth channel input sound signal.

$[Math . 7]$

$\begin{matrix} X_{i} (k) = \frac{1}{\sqrt{T}} \sum_{t = 0}^{T - 1} x_{i} (t + 1) e^{- j \frac{2 π kt}{T}} & (2 - 1) \end{matrix}$

Next, the inter-channel relationship information estimation unit 186 performs subsequent processing for each of the (N×(N−1))/2 pairs of the nth channel and the mth channel. First, the inter-channel relationship information estimation unit 186 obtains the phase difference spectrum (k) at each frequency k through the following Equation (2-2) by using the nth channel frequency spectrum X_n(k) and the mth channel frequency spectrum X_m(k) at each frequency k obtained through Equation (2-1).

$[Math . 8]$

$\begin{matrix} ϕ (k) = \frac{X_{n} (k) / ❘ X_{n} (k) ❘}{X_{m} (k) / ❘ X_{m} (k) ❘} & (2 - 2) \end{matrix}$

Next, the inter-channel relationship information estimation unit 186 obtains the phase difference signal ψ(τ_cand) for each candidate number of samples τ_candfrom τ_maxto Thun as in Equation (1-4) by performing inverse Fourier transform on the phase difference spectrum obtained through Equation (2-2). Next, the inter-channel relationship information estimation unit 186 obtains and outputs the maximum value of the correlation value γ_candthat is the absolute value of the phase difference signal ψ(τ_cand) as the inter-channel correlation value γ_nm, obtains and outputs information indicating that the nth channel is preceding as the preceding channel information INFO_nmin the case where τ_candwhen the correlation value is a maximum value is a positive value, and obtains and outputs information indicating that the mth channel is preceding as the preceding channel information INFO_nmin the case where τ_candwhen the correlation value is a maximum value is a negative value. In the case where τ_candwhen the correlation value is a maximum value is zero, the inter-channel relationship information estimation unit 186 may obtain and output information indicating that the nth channel is preceding as the preceding channel information INFO_nm, or information indicating that the mth channel is preceding as the preceding channel information INFO_nm.

Note that instead of using as it is the absolute value of the phase difference signal ψ(τ_cand) as the correlation value γ_cand, the inter-channel relationship information estimation unit 186, as with the left-right relationship information estimation unit 183, may use a normalized value such as a relative difference between the average of the absolute values of phase difference signals obtained for a plurality of candidate numbers of samples before and after τ_candand the absolute value of the phase difference signal ψ(τ_cand) for each τ_cand, for example. That is, the inter-channel relationship information estimation unit 186 may obtain an average value through Equation (1-5) by using the positive number τ_rangeset in advance for each τ_cand, and use, as γ_cand, a normalized correlation value obtained through Equation (1-6) by using the obtained average value ψ_c(τ_cand) and phase difference signal ψ(τ_cand).

Downmix Unit 116

The input sound signals of the N channels input to the sound signal downmix apparatus 406, the inter-channel correlation value γ_nmof each the (N×(N−1))/2 pairs of n and m (that is, the inter-channel correlation value of each pair of two channels included in the N channels) output by the inter-channel relationship information estimation unit 186, and the preceding channel information INFO_nmof each of the (N×(N−1))/2 pairs of n and m (that is, the preceding channel information of each pair of two channels included in the N channels) output by the inter-channel relationship information estimation unit 186 are input to the downmix unit 116. The downmix unit 116 weights the input sound signal of each channel such that the larger the correlation with the input sound signal of each channel that precedes the channel, the smaller the weight, whereas the larger the correlation with the input sound signal of each channel that succeeds the channel, the larger the weight, and thus obtains and outputs a downmix signal by weighting and adding the input sound signals of the N channels (step S116).

Specific Example 1 of Downmix Unit 116

A specific example 1 of the downmix unit 116 will be described below with the channel number of each channel (channel index) as i, input sound signals of the ith channel as x_i(1), x_i(2) . . . , x_i(T), and the downmix signals as x_M(1), x_M(2) . . . , x_M(T). Assume that in the specific example 1, the inter-channel correlation value is a value from 0 to 1 as with the absolute value of the correlation coefficient and the normalized value in the above-described example of the inter-channel relationship information estimation unit 186. In addition, here, M is not a channel number, but is a subscript indicating that a downmix signal is a monaural signal. The downmix unit 116 obtains a downmix signal by performing the processing of step S116-1 to step S116-3 described below, for example. First, for each ith channel, the downmix unit 116 obtains the set I_Liof the channel numbers of the channels preceding the ith channel and the set I_Fiof the channel numbers of the channels succeeding the ith channel from the preceding channel information of the (N−1) pairs of two channels including the ith channel of the preceding channel information INFO_nminput to the downmix unit 116 (step S116-1). Next, for each ith channel, the downmix unit 116 obtains a weight w_iof the ith channel through the following Equation (2-3) using the inter-channel correlation value of the (N−1) pairs of two channels including the ith channel of the inter-channel correlation value γ_nminput to the downmix unit 116, the set I_Liof the channel numbers of the channels preceding the ith channel, and the set I_Fiof the channel numbers of the channels succeeding the ith channel (step S116-2).

$[Math . 9]$

$\begin{matrix} w_{i} = \frac{1}{N} \prod_{j \in I_{Li}} (1 - γ_{ij}) (1 + \sum_{k \in I_{Fi}} γ_{ik}) & (2 - 3) \end{matrix}$

Note that for each pair of n and m described above, the inter-channel correlation value γ_nmis the same value as the inter-channel correlation value γ_nm, and therefore both an inter-channel correlation value γ_ijof the case where i is greater than j and an inter-channel correlation value γ_ikof the case where i is greater than k are included in the inter-channel correlation value γ_nminput to the downmix unit 116.

Next, the downmix unit 116 obtains the downmix signals x_M(1), x_M(2) . . . , x_M(T) by obtaining a downmix signal sample x_M(t) through the following Equation (2-4) for each sample number t (sample index t) by using the input sound signals x_i(1), x_i(2) . . . , x_i(T) of each ith channel whose i is from 1 to N, and the weight w_iof each ith channel whose i is from 1 to N (step S116-3).

$[Math 10]$

$\begin{matrix} x_{M} (t) = \sum_{i = 1}^{N} w_{i} \times x_{i} (t) & (2 - 4) \end{matrix}$

Note that the downmix unit 116 may obtain the downmix signal by using an equation in which the weight w_iof Equation (2-4) is replaced with the right side of Equation (2-3) instead of sequentially performing step S116-2 and step S116-3. Specifically, it suffices that the downmix unit 116 obtains each sample x_M(t) of the downmix signal through Equation (2-4) with the set of the channel numbers of the channels preceding each ith channel as I_Lithe set of the channel numbers of the channels succeeding each ith channel as I_Fi, the inter-channel correlation value of a pair of each ith channel and each channel j preceding the ith channel as γ_ij, the inter-channel correlation value of a pair of each ith channel and each channel k succeeding the ith channel as γ_ik, and the weight of each ith channel as w_iexpressed by Equation (2-3).

Equation (2-4) is an equation for obtaining a downmix signal by weighting and adding the input sound signals of the N channels, and Equation (2-3) is for obtaining the weight w_iof each ith channel given to the input sound signal of each ith channel in the weighted addition. The part of the following Equation (2-3-A) in Equation (2-3) sets the weight such that the larger the correlation between the input sound signal of the ith channel and the input sound signal of each channel preceding the ith channel, the smaller the value of the weight w_i, and that the weight w_iis set to a value close to zero when there is at least one channel with a significantly large correlation between the input sound signal of the ith channel and the input sound signal of the preceding channel in the channels preceding the ith channel.

$[Math . 11]$

$\begin{matrix} \prod_{j \in I_{Li}} (1 - γ_{ij}) & (2 - 3 - A) \end{matrix}$

The part of the following Equation (2-3-B) in Equation (2-3) sets the weight such that the larger the correlation with the input sound signal of each channel succeeding the ith channel, the more the weight w_ihas a value greater than 1.

$[Math . 12]$

$\begin{matrix} (1 + \sum_{k \in I_{Fi}} γ_{ik}) & (2 - 3 - B) \end{matrix}$

When the input sound signals of all channels are independent, i.e., when there is no correlation among the channels, it is desirable to set the simple additive average of the input sound signals of all the channels as the downmix signal. In view of this, in Equation (2-3), the weight w_iis obtained by multiplying Equation (2-3-A), Equation (2-3-B) and 1/N such that the maximum value of the part of Equation (2-3-A) is 1 and that the minimum value of the part of Equation (2-3-B) is 1. Thus, when all the correlations among channels have small values, the weight w_iof all channels is set to a value close to 1/N.

Specific Example 2 of Downmix Unit 116

Since at step S116-1 of the specific example 1, the sum of all channels of the weight w_iobtained by the downmix unit 116 is not 1 in some situation, the downmix unit 116 may obtain the downmix signal by using a value obtained by normalizing the weight w_iof each ith channel such that the sum of all channels of the weight is 1 instead of the weight w_iof Equation (2-4), or by using a transformed equation of Equation (2-4) including normalization of the weight w_isuch that the sum of all channels of the weight is 1. Differences of this example, referred to as a specific example 2 of the downmix unit 116, from the specific example 1 will be described below.

For example, the downmix unit 116 may obtain the downmix signals x_M(1), x_M(2) . . . , x_M(T) by obtaining the weight w_ifor each ith channel through Equation (2-3), obtaining a normalized weight w′i by normalizing the weight w_ifor each ith channel such that the sum of all channels is 1 (that is, obtaining the normalized weight w′i through the following Equation (2-5) for each ith channel), and obtaining the downmix signal sample x_M(t) through the following Equation (2-6) for each sample number t by using the input sound signals x_i(1), x_i(T) of each ith channel whose i is from 1 to N and the normalized weight w′_i.

$[Math . 13]$

$\begin{matrix} w_{i}^{'} = \frac{w_{i}}{\sum_{i = 1}^{N} w_{i}} & (2 - 5) \end{matrix}$

$[Math . 14]$

$\begin{matrix} x_{M} (t) = \sum_{i = 1}^{N} w_{i}^{'} \times x_{i} (t) & (2 - 6) \end{matrix}$

That is, it suffices that the downmix unit 116 obtains each sample x_M(t) of the downmix signal through Equation (2-6) with the set of the channel numbers of the channels preceding each ith channel as I_Li, the set of the channel numbers of the channels succeeding each ith channel as hi, the inter-channel correlation value of a pair of each ith channel and each channel j preceding the ith channel as γ_ij, the inter-channel correlation value of a pair of each ith channel and each channel k succeeding the ith channel as γ_ik, the weight of each ith channel as w_iexpressed by Equation (2-3), and the weight normalized for each ith channel as w′_iexpressed by Equation (2-5).

Second Example

For example, in the case where an apparatus different from the sound signal downmix apparatus performs the stereo coding processing on the input sound signals of the N channels, or the case where the input sound signals of the N channels are signals obtained through stereo decoding processing at an apparatus different from the sound signal downmix apparatus, any or all of the same inter-channel correlation value γ_nmand preceding channel information INFO_nmas those obtained by the inter-channel relationship information estimation unit 186 may possibly be obtained by the apparatus different from the sound signal downmix apparatus. It suffices that in the case where any or all of the inter-channel correlation value γ_nmand the preceding channel information INFO_nmare obtained by the different apparatus, any or all of the inter-channel correlation value γ_nmand the preceding channel information INFO_nmobtained by the different apparatus are input to the sound signal downmix apparatus, and the inter-channel relationship information estimation unit 186 obtains the inter-channel correlation value γ_nmand/or the preceding channel information INFO_nmthat has not been input to the sound signal downmix apparatus. Hereinafter, differences of a second example from the first example will be mainly described. The second example is an example of a sound signal downmix apparatus on the assumption that any or all of the inter-channel correlation value γ_nmand the preceding channel information INFO_nmare input from the outside.

As illustrated in FIG. 7, a sound signal downmix apparatus 407 of the second example includes an inter-channel relationship information obtaining unit 187 and the downmix unit 116. As indicated by the dashed line in FIG. 7, in addition to the input sound signals of the N channels, any or all of the inter-channel correlation value γ_nmand the preceding channel information INFO_nmobtained by a different apparatus may be input to the sound signal downmix apparatus 407. The sound signal downmix apparatus 407 of the second example performs the processing of step S187 and step S116 exemplified in FIG. 8 for each frame. The downmix unit 116 and step S116 are identical to those of the first example, and therefore the inter-channel relationship information obtaining unit 187 and step S187 will be described below.

Inter-Channel Relationship Information Obtaining Unit 187

The inter-channel relationship information obtaining unit 187 obtains and outputs the inter-channel correlation value γ_nm, which is a value indicating the degree of the correlation of each pair of two channels included in the N channels, and the preceding channel information INFO_nm, which is information indicating which of the input sound signals of two channels includes the same sound signal first, for each pair of two channels included in the N channels (step S187).

As indicated by the dashed line in FIG. 7, in the case where all of the inter-channel correlation value γ_nmand the preceding channel information INFO_nmare input from the different apparatus to the sound signal downmix apparatus 407, the inter-channel relationship information obtaining unit 187 obtains the inter-channel correlation value γ_nmand the preceding channel information INFO_nminput to the sound signal downmix apparatus 407, and outputs them to the downmix unit 116.

As indicated by the broken line in FIG. 7, in the case where one of the inter-channel correlation value γ_nmand the preceding channel information INFO_nmis not input to the sound signal downmix apparatus 407 from the different apparatus, the inter-channel relationship information obtaining unit 187 includes the inter-channel relationship information estimation unit 186. The inter-channel relationship information estimation unit 186 of the inter-channel relationship information obtaining unit 187, as with the inter-channel relationship information estimation unit 186 of the first example, obtains the inter-channel correlation value γ_nmthat is not input to the sound signal downmix apparatus 407 or the preceding channel information INFO_nmthat is not input to the sound signal downmix apparatus 407 from the input sound signals of the N channels, and outputs it to the downmix unit 116. As indicated by the dashed line in FIG. 7, regarding the inter-channel correlation value γ_nminput to the sound signal downmix apparatus 407 or the preceding channel information INFO_nminput to the sound signal downmix apparatus 407, the inter-channel relationship information obtaining unit 187 outputs, to the downmix unit 116, the inter-channel correlation value γ_nminput to the sound signal downmix apparatus 407 or the preceding channel information INFO_nminput to the sound signal downmix apparatus 407.

As indicated by the broken line in FIG. 7, in the case where all of the inter-channel correlation value γ_nmand the preceding channel information INFO_nmare not input to the sound signal downmix apparatus 407 from the different apparatus, the inter-channel relationship information obtaining unit 187 includes the inter-channel relationship information estimation unit 186. The inter-channel relationship information estimation unit 186, as with the inter-channel relationship information estimation unit 186 of the first example, obtains the inter-channel correlation value γ_nmand the preceding channel information INFO_nmfrom the input sound signals of the N channels, and outputs them to the downmix unit 116. That is, it can be said that the inter-channel relationship information estimation unit 186 and step S186 of the first example belong to the categories of the inter-channel relationship information obtaining unit 187 and step S187, respectively.

Note that there may be a case where a part of the inter-channel correlation value γ_nmis obtained by the different apparatus while the remaining part of the inter-channel correlation value γ_nmis not obtained by the different apparatus, a case where a part of the preceding channel information INFO_nmis obtained by the different apparatus while the remaining part of the preceding channel information INFO_nmis not obtained by the different apparatus, and the like. In such cases, it suffices to include the inter-channel relationship information estimation unit 186 in the inter-channel relationship information obtaining unit 187 such that, as described above, the inter-channel relationship information obtaining unit 187 outputs one obtained by the different apparatus and input to the sound signal downmix apparatus 407 to the downmix unit 116, and that the inter-channel relationship information estimation unit 186 obtains, from the input sound signals of the N channels, one that is not obtained by the different apparatus and not input to the sound signal downmix apparatus 407, and outputs it to the downmix unit 116, as with the inter-channel relationship information estimation unit 186 of the first example.

Third Embodiment

The inter-channel relationship information estimation unit 186 of the second embodiment obtains the inter-channel correlation value γ_nmand the preceding channel information INFO_nmfor each pair of two channels included in the N channels. There are (N×(N−1))/2 pairs of two channels included in the N channels, and as such, in the case where the inter-channel correlation value γ_nmand the preceding channel information INFO_nmare obtained by the method exemplified in the description of the inter-channel relationship information estimation unit 186 of the second embodiment, the amount of arithmetic processing can become an issue when the number of channels is large. The third embodiment describes a sound signal downmix apparatus performing inter-channel relationship information estimation processing of obtaining the inter-channel correlation value γ_nmand the preceding channel information INFO_nmin an approximate manner by a method with a smaller amount of arithmetic processing than the inter-channel relationship information estimation unit 186. The downmix processing of the third embodiment is the same as that of the second embodiment.

The downmix processing performed by the downmix unit 116 of the second embodiment is processing in which, for example, when only the same sound output by a certain sound source with a given time difference is included in each of signals of a plurality of channels, one of the input sound signals of the plurality of channels including the same sound output at the earliest timing is included in the downmix signal. This processing will be described with an example in which input sound signals of six channels from a first channel (1ch) to a sixth channel (6ch) are those schematically illustrated in FIG. 9. In this example, the first channel input sound signal and the second channel input sound signal are signals including only the same first sound signal output by the first sound source with a given time difference, and the first sound signal is included in the second channel input sound signal at the earliest timing. In this example, further, the third channel input sound signal to the sixth channel input sound signal are signals including the same second sound signal output by the second sound source with a given time difference, and the second sound signal is included in the sixth channel input sound signal at the earliest timing. In this example, the downmix unit 116 obtains a downmix signal that includes the second channel input sound signal in which the first sound signal is included at the earliest timing and the sixth channel input sound signal in which the second sound signal is included at the earliest timing, but does not include the first channel input sound signal and the third channel input sound signal to the fifth channel input sound signal. When such a downmix signal is to be obtained, no problem arises even when the inter-channel correlation values γ_nmof non-adjacent channels are obtained by the following equations in an approximate manner using the inter-channel correlation values of adjacent channels, γ₁₂=1, γ₂₃=0, γ₃₄=1, γ₄₅=1, and γ₅₆=1, with each inter-channel correlation value set to a value from 0 to 1.

γ₁₃=γ₁₂×γ₂₃=1×0=0
γ₁₄=γ₁₂×γ₂₃×γ₃₄=1×0×1=0
γ₁₅=γ₁₂×γ₂₃×γ₃₄×γ₄₅=1×0×1×1=0
γ₁₆=γ₁₂×γ₂₃×γ₃₄×γ₄₅×γ₅₆=1×0×1×1×1=0
γ₂₄=γ₂₃×γ₃₄=0×1=0
γ₂₅=γ₂₃×γ₃₄×γ₄₅=0×1×1=0
γ₂₆=γ₂₃×γ₃₄×γ₄₅×γ₅₆=0×1×1×1=0
γ₃₅=γ₃₄×γ₄₅=1×1=1
γ₃₆=γ₃₄×γ₄₅×γ₅₆=1×1×1=1
γ₄₆=γ₄₅×γ₅₆=1×1=1

Likewise, no problem arises even when the time differences of non-adjacent channels are obtained by the following equations in an approximate manner using time differences τ₁₂, τ₂₃, τ₃₄, τ₄₅, and τ₅₆of the adjacent channels, and the preceding channel information INFO_nmis obtained in an approximate manner based on whether each obtained time difference between channels is positive, negative, or zero.

τ₁₃=τ₁₂+τ₂₃
τ₁₄=τ₁₂+τ₂₃+τ₃₄
τ₁₅=τ₁₂+τ₂₃+τ₃₄+τ₄₅
τ₁₆=τ₁₂+τ₂₃+τ₃₄+τ₄₅+τ₅₆
τ₂₄=τ₂₃+τ₃₄
τ₂₅=τ₂₃+τ₃₄+τ₄₅
τ₂₆=τ₂₃+τ₃₄+τ₄₅+τ₅₆
τ₃₅=τ₃₄+τ₄₅
τ₃₆=τ₃₄+τ₄₅+τ₅₆
τ₄₆=τ₄₅+τ₅₆

It should be noted that the inter-channel correlation value γ_nmand the preceding channel information INFO_nmcan be obtained using the above-mentioned equations in an approximate manner only in the case where the input sound signals with the same or similar waveforms are located at successive channels as exemplified in FIG. 9, and in the case where there is a channel with an input sound signal with a significantly different waveform between channels of the input sound signals with the same or similar waveforms as exemplified in FIG. 10, the inter-channel correlation value γ_nmand the preceding channel information INFO_nmcannot be obtained in an approximate manner using the above-mentioned equations. In view of this, the sound signal downmix apparatus of the third embodiment sorts the input sound signals of the N channels such that there is no channel with a significantly different waveform of the input sound signal between the channels with the same or similar waveforms of the input sound signals, obtains the inter-channel correlation value γ_nmand the preceding channel information INFO_nmfor adjacent channels after the sorting, and obtains other inter-channel correlation values γ_nmand preceding channel information INFO_nmin an approximate manner by using the inter-channel correlation value γ_nmand the preceding channel information INFO_nmbetween the adjacent channels after the sorting.

First Example

A sound signal downmix apparatus of a first example of the third embodiment is described below. As illustrated in FIG. 5, a sound signal downmix apparatus 408 of the first example includes an inter-channel relationship information estimation unit 188 and the downmix unit 116. The sound signal downmix apparatus 408 of the first example performs processing of step S188 and step S116 exemplified in FIG. 6 for each frame. The downmix unit 116 and step S116 are identical to those of the first example of the second embodiment, and therefore the inter-channel relationship information estimation unit 188 and step S188 different from the first example of the second embodiment will be described below. Time-domain sound signals of the N channels are input to the sound signal downmix apparatus 408 as with the sound signal downmix apparatus 408 of the first example of the second embodiment, and a downmix signal that is a time-domain monaural sound signal is obtained and output by the sound signal downmix apparatus 408 as with the sound signal downmix apparatus 406 of the first example of the second embodiment.

Inter-Channel Relationship Information Estimation Unit 188

The input sound signals of the N channels input to the sound signal downmix apparatus 408 are input to the inter-channel relationship information estimation unit 188. While the number of channels N is an integer of 2 or greater in the second embodiment, the number of channels N is an integer of three or greater in the third embodiment because no channel with a significantly different waveform of the input sound signal can be present between the channels with the same or similar waveforms of the input sound signal when the number of channel N is two. As illustrated in FIG. 11, the inter-channel relationship information estimation unit 188 includes a channel sorting unit 1881, an inter-adjacent-channel relationship information estimation unit 1882, and an inter-channel relationship information complement unit 1883, for example. The inter-channel relationship information estimation unit 188 performs processing of step S1881, step S1882 and step S1883 exemplified in FIG. 12 for each frame, for example (step S188).

Channel Sorting Unit 1881

The channel sorting unit 1881 sequentially performs sorting in the order from the first channel such that the adjacent channel is the channel with highest similarity of the waveform of the input sound signal among the remaining channels when the time differences are aligned, and obtains and outputs a first sorted input sound signal to an Nth sorted input sound signal, which are signals after the sorting of the N channels, and first original channel information c₁to Nth original channel information c_N, which are the channel numbers (that is, the channel numbers of the input sound signals) when each input sound signal to be sorted has been input to the sound signal downmix apparatus 408, for example (step S1881A). As the similarity in waveform after the aligning of the time differences, it suffices that the channel sorting unit 1881 uses a value indicating the degree of the correlation such as a value indicating the closeness of the distance between the input sound signals of two channels after the aligning of the time differences, and a value obtained by dividing the inner product of the input sound signals of the two channels after the aligning of the time differences by the geometric mean of the energy of the input sound signals of two channels.

For example, when a value indicating the closeness of the distance between the input sound signals of two channels after the aligning of the time differences is used as the similarity in waveform after the aligning of the time differences, the channel sorting unit 1881 performs the following step S1881A-1 to step S1881A-N. First, the channel sorting unit 1881 obtains the first channel input sound signal as the first sorted input sound signal, and obtains “1” that is the channel number of the first channel as the first original channel information c₁(step S1881A-1).

Next, for each candidate number of samples τ_candfrom τ_maxto τ_minset in advance (for example, τ_maxis a positive number and τ_minis a negative number) for each channel m from the second channel to the Nth channel, the channel sorting unit 1881 obtains the distance between the sample sequence of the first sorted input sound signal and the sample sequence of the mth channel input sound signal shifted backward relative to the sample sequence of the first sorted input sound signal by the candidate number of samples τ_cand, obtains the input sound signal of the channel m with the minimum distance value as a second sorted input sound signal, and obtains the channel number of the channel m with the minimum distance value as second original channel information c₂(step S1881A-2).

Next, for each candidate number of samples τ_candfrom τ_maxto Tam for each channel m that has not been set as a sorted input sound signal among channels from the second channel to the Nth channel, the channel sorting unit 1881 obtains the distance between the sample sequence of the second sorted input sound signal and the sample sequence of the mth channel input sound signal shifted backward relative to the sample sequence of the second sorted input sound signal by the candidate number of samples τ_cand, and obtains the input sound signal of the channel m with a minimum distance value as a third sorted input sound signal, and obtains the channel number of the channel m with the minimum distance value as third original channel information c₃(step S1881A-3). Thereafter, the same processing is repeated until there is only one channel that has not been set as a sorted input sound signal left, so that a fourth sorted input sound signal to a (N−1)th sorted input sound signal, and fourth original channel information c₄to (N−1)th original channel information c_(N−1)are obtained (step S1881A-4 step S1881A-(N−1)).

Finally, the channel sorting unit 1881 obtains the input sound signal of the remaining one channel that has not been set as a sorted input sound signal as the Nth sorted input sound signal, and obtains the channel number of the remaining one channel that has not been set as a sorted input sound signal as the Nth original channel information c_N(step S1881A-N). Note that in the following description, the nth sorted input sound signal for each n from 1 to N is referred to also as the input sound signal of the nth channel after the sorting, and the n of the nth sorted input sound signal is referred to also as the channel number after the sorting.

Note that the channel sorting unit 1881 may perform the sorting by evaluating the similarity without aligning the time differences, considering that the purpose is to sort the input sound signals of the N channels such that there is no channel with a significantly different waveform of the input sound signal between the channels with the same or similar waveforms of the input sound signals, and that it is preferable that the amount of arithmetic processing for the sorting processing be small. For example, the channel sorting unit 1881 may perform the following step S1881B-1 to step S1881B-N. First, the channel sorting unit 1881 obtains the first channel input sound signal as the first sorted input sound signal, and obtains “1” that is the channel number of the first channel as the first original channel information c₁(step S1881B-1).

Next, the channel sorting unit 1881 obtains the distance between the sample sequence of the first sorted input sound signal and the sample sequence of the mth channel input sound signal for each channel m from the second channel to the Nth channel, obtains the input sound signal of the channel m with a minimum distance value as the second sorted input sound signal, and obtains the channel number of the channel m with a minimum distance value as the second original channel information c₂(step S1881B-2).

Next, for each channel m that has not been set as a sorted input sound signal among channels from the second channel to the Nth channel, the channel sorting unit 1881 obtains the distance between the sample sequence of the second sorted input sound signal and the sample sequence of the mth channel input sound signal, obtains the input sound signal of the channel m with a minimum distance value as the third sorted input sound signal, and obtains the channel number of the channel m with a minimum distance value as the third original channel information c₃(step S1881B-3). Thereafter, the same processing is repeated until there is only one channel that has not been set as a sorted input sound signal left, so that the fourth sorted input sound signal to the (N−1)th sorted input sound signal, and the fourth original channel information c₄to the (N−1)th original channel information c_(N−)are obtained (step S1881B-4 step S1881B-(N−1)).

In short, it suffices that regardless of whether or not the time differences are aligned or regardless of the value to be used as the similarity of the signals, the channel sorting unit 1881 sequentially performs the sorting in the order from the first channel such that the adjacent channel is the channel with the most similar input sound signal among the remaining channels, and obtains and outputs the first sorted input sound signal to the Nth sorted input sound signal as the signals after the sorting of the N channels, and the first original channel information c₁to the Nth original channel information c_Nas the channel numbers (that is, the channel numbers of the input sound signals) when each sorted input sound signal is input to the sound signal downmix apparatus 408 (step S1881).

Inter-Adjacent-Channel Relationship Information Estimation Unit 1882

The N sorted input sound signals from the first sorted input sound signal to the Nth sorted input sound signal are input to the inter-adjacent-channel relationship information estimation unit 1882. The inter-adjacent-channel relationship information estimation unit 1882 obtains and outputs the inter-channel correlation value and the inter-channel time difference of each pair of two channels after the sorting with adjacent channel numbers after the sorting in the N sorted input sound signals (step S1882).

The inter-channel correlation value obtained at step S1882 is a correlation value that takes into account the time difference between the sorted input sound signals for each pair of two channels after the sorting with adjacent channel numbers after the sorting, that is, a value indicating the degree of the correlation that takes into account the time difference between the sorted input sound signals. There are (N−1) pairs of two channels included in the N channels. In the case where n is an integer from 1 to N−1, and the inter-channel correlation value between the nth sorted input sound signal and the (n+1)th sorted channel input sound signal is γ′_n(n+1), the inter-adjacent-channel relationship information estimation unit 1882 obtains the inter-channel correlation value γ′_n(n+1)for each of (N−1) pairs of two channels after the sorting with adjacent channel numbers after the sorting.

The inter-channel time difference obtained at step S1882 is information indicating which of two sorted input sound signals includes the same sound signal and how much earlier the same sound signal is included for each pair of two channels after the sorting with adjacent channel numbers after the sorting. In the case where the inter-channel time difference between the nth sorted input sound signal and the (n+1)th sorted input sound signal is τ′_n(n+1), the inter-adjacent-channel relationship information estimation unit 1882 obtains the inter-channel time difference τ′_n(n+1)for each of (N−1) pairs of two channels after the sorting with adjacent channel numbers after the sorting.

For example, the absolute value of a correlation coefficient is used as a value indicating the degree of the correlation. In such a case, for each candidate number of samples τ_candfrom τ_maxto τ_minset in advance for each n from 1 to N−1 (that is, for each pair of two channels after the sorting with adjacent channel numbers after the sorting), the inter-adjacent-channel relationship information estimation unit 1882 obtains and outputs, as the inter-channel correlation value γ′_n(n+1), the maximum value of the absolute value γ_candof the correlation coefficient between the sample sequence of the nth sorted input sound signal and the sample sequence of the (n+1)th sorted input sound signal shifted backward relative to the sample sequence of the nth sorted input sound signal by the candidate number of samples τ_cand, and obtains and outputs, as the inter-channel time difference τ′_n(n+1), τ_candwhen the absolute value of the correlation coefficient is a maximum value.

In addition, for example, instead of the absolute value of the correlation coefficient, a correlation value using information about a phase of a signal may be set as γ_candas follows. In this example, first, for each channel i from the first channel input sound signal to the Nth channel input sound signal, the inter-adjacent-channel relationship information estimation unit 1882 obtains the frequency spectrum X_i(k) at each frequency k of 0 to T−1 by performing Fourier transform on the input sound signals x_i(1), x_i(2) . . . , x_i(T) as in Equation (2-1).

Next, the inter-adjacent-channel relationship information estimation unit 1882 performs the following processing for each n from 1 to N−1, that is, each pair of two channels after the sorting with adjacent channel numbers after the sorting. First, the inter-adjacent-channel relationship information estimation unit 1882 obtains the phase difference spectrum φ(k) at each frequency k through the following Equation (3-1) by using the frequency spectrum X_n(k) of the nth channel and the frequency spectrum X_(n+1)(k) of the (n+1)th channel at each frequency k obtained through Equation (2-1).

$[Math . 15]$

$\begin{matrix} ϕ (k) = \frac{X_{n} (k) / ❘ X_{n} (k) ❘}{X_{(n + 1)} (k) / ❘ X_{(n + 1)} (k) ❘} & (3 - 1) \end{matrix}$

Next, the inter-adjacent-channel relationship information estimation unit 1882 obtains the phase difference signal ψ(τ_cand) for each candidate number of samples τ_candfrom τ_maxto τ_minas in Equation (1-4) by performing inverse Fourier transform on the phase difference spectrum obtained through Equation (3-1). Next, the inter-adjacent-channel relationship information estimation unit 1882 obtains and outputs, as the inter-channel correlation value γ′_n(n+1), the maximum value of the correlation value γ_candthat is the absolute value of the phase difference signal ψ(τ_cand), and obtains and outputs, as the inter-channel time difference τ′_n(n+1), τ_candwhen the correlation value is a maximum value.

Note that instead of using as it is the absolute value of the phase difference signal ψ(τ_cand) as the correlation value γ_cand, the inter-adjacent-channel relationship information estimation unit 1882, as with the left-right relationship information estimation unit 183 and the inter-channel relationship information estimation unit 186, may use a normalized value such as a relative difference between the absolute value of the phase difference signal ψ(τ_cand) for each τ_candand the average of the absolute values of phase difference signals obtained for a plurality of candidate numbers of samples before and after τ_cand, for example. That is, the inter-adjacent-channel relationship information estimation unit 1882 may obtain an average value through Equation (1-5) by using the positive number τ_rangeset in advance for each τ_cand, and use, as γ_cand, a normalized correlation value obtained through Equation (1-6) by using the obtained average value ψ_c(τ_cand) and phase difference signal ψ(τ_cand).

Inter-Channel Relationship Information Complement Unit 1883

The inter-channel correlation value and the inter-channel time difference of each pair of two channels after the sorting with adjacent channel numbers after the sorting output by the inter-adjacent-channel relationship information estimation unit 1882, and the original channel information for each channel after the sorting output by the channel sorting unit 1881 are input to the inter-channel relationship information complement unit 1883. The inter-channel relationship information complement unit 1883 obtains and outputs the inter-channel correlation value and the preceding channel information for all pairs of two channels (that is, all pairs of two channels being the sorting targets) by performing processing of step S1883-1 step S1883-5 described below (step S1883).

First, from the inter-channel correlation value of each pair of two channels after the sorting with adjacent channel numbers after the sorting, the inter-channel relationship information complement unit 1883 obtains the inter-channel correlation value of each pair of two channels after the sorting with non-adjacent channel numbers after the sorting (step S1883-1). In the case where n is an integer from 1 to N−2, m is an integer from n+2 to N, and the inter-channel correlation value between the nth sorted input sound signal and the mth sorted input sound signal is γ′_nm, the inter-channel relationship information complement unit 1883 obtains the inter-channel correlation value γ′_nmof each pair of two channels after the sorting with non-adjacent channel numbers after the sorting.

In the case where the two channel numbers of each pair of two channels after the sorting with adjacent channel numbers after the sorting are i (i is an integer from 1 to N−1) and i+1, and the inter-channel correlation value of each pair of two channels after the sorting with adjacent channel numbers after the sorting is γ′_i(i+1), the inter-channel relationship information complement unit 1883 obtains, as the inter-channel correlation value γ′_nm, a value obtained by multiplying all inter-channel correlation values γ′_i(i+1)of pairs of two channels with adjacent channel numbers after the sorting whose i is from n to m−1, for each pair of n and m (that is, for each pair of two channels after the sorting with non-adjacent channel numbers after the sorting), for example. That is, the inter-channel relationship information complement unit 1883 obtains the inter-channel correlation value γ′_nmthrough the following Equation (3-2).

$[Math . 16]$

$\begin{matrix} γ_{n m}^{'} = \prod_{i = n}^{m - 1} γ_{i (i + 1)}^{'} & (3 - 2) \end{matrix}$

Note that the inter-channel relationship information complement unit 1883 may obtain, as the inter-channel correlation value γ′_nm, the geometric mean of all the inter-channel correlation values γ′_i(i+1)of pairs of two channels with adjacent channel numbers after the sorting whose i is from n to m−1, for each pair of n and m (that is, for each pair of two channels after the sorting with non-adjacent channel numbers after the sorting). That is, the inter-channel relationship information complement unit 1883 may obtain the inter-channel correlation value γ′_nmthrough the following Equation (3-3).

$[Math . 17]$

$\begin{matrix} γ_{n m}^{'} = \sqrt[m - n]{\prod_{i = n}^{m - 1} γ_{i (i + 1)}^{'}} & (3 - 3) \end{matrix}$

It should be noted that in the case where the inter-channel correlation value is a value whose upper limit is not 1 such as the absolute value of the correlation coefficient and the normalized value, it is preferable that the inter-channel relationship information complement unit 1883 obtain the geometric mean expressed by Equation (3-3) as the inter-channel correlation value γ′_nm, rather than the multiplication value expressed by Equation (3-2) such that the inter-channel correlation value of each pair of two channels after the sorting with non-adjacent channel numbers after the sorting does not exceed the normal upper limit of the inter-channel correlation value.

Note that for example, for each pair of n and m (that is, for each pair of two channels after the sorting with non-adjacent channel numbers after the sorting), if there is a pair whose correlation is extremely small due to different sound signals included in two input sound signals of the pair in the pairs of two channels with adjacent channel numbers after the sorting whose i is from n to m−1, the inter-channel correlation value γ′_nmmay be set to a value that depends on the inter-channel correlation value γ′_i(i+1)of that pair. For example, for each pair of n and m (that is, for each pair of two channels after the sorting with non-adjacent channel numbers after the sorting), the inter-channel relationship information complement unit 1883 may obtain, as the inter-channel correlation value γ′_nm, the minimum value of the inter-channel correlation values γ′_i(i+1)of pairs of two channels with adjacent channel numbers after the sorting whose i is from n to m−1. In addition, for example, for each pair of n and m (that is, for each pair of two channels after the sorting with non-adjacent channel numbers after the sorting), the inter-channel relationship information complement unit 1883 may obtain, as the inter-channel correlation value γ ′_nm, a multiplication value or a geometric mean of a plurality of the inter-channel correlation values γ′_i(i+1)including the minimum value in the inter-channel correlation values γ′_i(i+1)of pairs of two channels with adjacent channel numbers after the sorting whose i is from n to m−1. It should be noted that in the case where the inter-channel correlation value is a value whose upper limit is not 1 such as the absolute value of the correlation coefficient and the normalized value, it is preferable that the inter-channel relationship information complement unit 1883 obtain the geometric mean rather than the multiplication value as the inter-channel correlation value γ′_nmsuch that the inter-channel correlation value of each pair of two channels after the sorting with non-adjacent channel numbers after the sorting does not exceed the normal upper limit of the inter-channel correlation value.

In short, in the case where the two channel numbers of each pair of two channels after the sorting with adjacent channel numbers after the sorting are i (i is an integer from 1 to N−1) and i+1, the inter-channel correlation value of each pair of two channels after the sorting with adjacent channel numbers after the sorting is γ′_i(i+1), n is an integer from 1 to N−2, m is an integer from n+2 to N, and the inter-channel correlation value between the nth sorted input sound signal and mth sorted input sound signal is γ′_nm, it suffices that, for each pair of n and m (that is, for each pair of two channels after the sorting with non-adjacent channel numbers after the sorting), the inter-channel relationship information complement unit 1883 obtains, as the inter-channel correlation value γ′_nm, a value that has a monotonically non-decreasing relationship with each of one or more of the inter-channel correlation values γ′_i(i+1)including the minimum value of the inter-channel correlation values γ′_i(i+1)of pairs of two channels with adjacent channel numbers after the sorting whose i is from n to m−1. Further, in the case where the two channel numbers of each pair of two channels after the sorting with adjacent channel numbers after the sorting are i (i is an integer from 1 to N−1) and i+1, the inter-channel correlation value of each pair of two channels after the sorting with adjacent channel numbers after the sorting is γ′_i(i+1), n is an integer from 1 to N−2, m is an integer from n+2 to N, and the inter-channel correlation value between the nth sorted input sound signal and mth sorted input sound signal is γ′_nm, it suffices that, for each pair of n and m (that is, for each pair of two channels after the sorting with non-adjacent channel numbers after the sorting), the inter-channel relationship information complement unit 1883 obtains, as the inter-channel correlation value γ′_nm, a value that has a monotonically non-decreasing relationship with each of one or more of the inter-channel correlation values γ′_i(i+1)including the minimum value of the inter-channel correlation values γ′_i(i+1)of pairs of two channels with adjacent channel numbers after the sorting whose i is from n to m−1 within the possible range of the inter-channel correlation value.

The inter-channel correlation value of each pair of two channels after the sorting with adjacent channel numbers after the sorting obtained by the inter-adjacent-channel relationship information estimation unit 1882 has been input, and the inter-channel correlation value of each pair of two channels after the sorting with non-adjacent channel numbers after the sorting is obtained at step S1883-1. Therefore, at the time point when step S1883-1 is performed, the inter-channel relationship information complement unit 1883 has all inter-channel correlation values for (N×(N−1))/2 pairs of two channels after the sorting included in the N channels after the sorting. That is, in the case where n is an integer from 1 to N, m is an integer greater than n and equal to or smaller than N, and the inter-channel correlation value between the nth sorted input sound signal and the mth sorted input sound signal is γ′_nm, the inter-channel relationship information complement unit 1883 has the inter-channel correlation value γ′_nmfor each of (N×(N−1))/2 pairs of two channels after the sorting at the time point when step S1883-1 is performed.

After step S1883-1, the inter-channel relationship information complement unit 1883 obtains the inter-channel correlation value between input sound signals for each pair of two channels included in the N channels by associating the inter-channel correlation value γ′_nmfor each of the (N×(N−1))/2 pairs of two channels after the sorting with a pair of channels for the input sound signals of the N channels (that is, a pair of channels being the sorting targets) by using the original channel information c₁to C_Nfor the channels after the sorting (step S1883-2). In the case where n is an integer from 1 to N, m is an integer greater than n and equal to or smaller than N, and the inter-channel correlation value between the nth channel input sound signal and the mth channel input sound signal is γ_nm, the inter-channel relationship information complement unit 1883 obtains the inter-channel correlation value γ_nmfor each of (N×(N−1))/2 pairs of two channels.

In addition, the inter-channel relationship information complement unit 1883 obtains the inter-channel time difference of each pair of two channels after the sorting with non-adjacent channel numbers after the sorting from the inter-channel time difference of each pair of two channels after the sorting with adjacent channel numbers after the sorting (step S1883-3). In the case where n is an integer from 1 to N−2, m is an integer from n+2 to N, and the inter-channel time difference between the nth channel sorted input sound signal and the mth channel sorted input sound signal is τ′_nm, the inter-channel relationship information complement unit 1883 obtains an inter-channel time difference τ′_nmof each pair of two channels after the sorting with non-adjacent channel numbers after the sorting. In the case where the two channel numbers of each pair of two channels after the sorting with adjacent channel numbers after the sorting are i (i is an integer from 1 to N−1) and i+1, and the inter-channel time difference of each pair of two channels after the sorting with adjacent channel numbers after the sorting is τ′_i(i+1), the inter-channel relationship information complement unit 1883 obtains, as the inter-channel time difference τ′_nm, a value obtained by adding up all of inter-channel time differences τ′_i(i+1)of pairs of two channels with adjacent channel numbers after the sorting whose i is from n to m−1, for each pair of n and m (that is, for each pair of two channels after the sorting with non-adjacent channel numbers after the sorting). That is, the inter-channel relationship information complement unit 1883 obtains the inter-channel time difference τ′_nmthrough the following Equation (34).

$[Math . 18]$

$\begin{matrix} τ_{n m}^{'} = \sum_{i = n}^{m - 1} τ_{i (i + 1)}^{'} & (3 - 4) \end{matrix}$

The inter-channel time difference of each pair of two channels after the sorting with adjacent channel numbers after the sorting obtained by the inter-adjacent-channel relationship information estimation unit 1882 has been input, and the inter-channel time difference of each pair of two channels after the sorting with non-adjacent channel numbers after the sorting is obtained at step S1883-3. Therefore, at the time point when step S1883-3 is performed, the inter-channel relationship information complement unit 1883 has all the inter-channel time differences of (N×(N−1))/2 pairs of two channels after the sorting included in the N channels after the sorting. That is, in the case where n is an integer from 1 to N, m is an integer greater than n and equal to or smaller than N, and the inter-channel time difference of the pair of the nth channel after the sorting and the mth channel after the sorting is τ_nm, the inter-channel relationship information complement unit 1883 has the inter-channel time difference τ′_nmof each of the (N×(N−1))/2 pairs of two channels after the sorting at the time point when step S1883-3 is performed.

After step S1883-3, the inter-channel relationship information complement unit 1883 obtains the inter-channel time difference between input sound signals for each pair of two channels included in the N channels by associating the inter-channel time difference τ′_nmfor each of the (N×(N−1))/2 pairs of two channels after the sorting with a pair of channels for the input sound signal of the N channels (that is, a pair of channels being the sorting targets) by using the original channel information c₁to c_Nfor the channels after the sorting (step S1883-4). In the case where n is an integer from 1 to N, m is an integer greater than n and equal to or smaller than N, and the inter-channel time difference between the nth channel input sound signal and the mth channel input sound signal is τ_nm, the inter-channel relationship information complement unit 1883 obtains the inter-channel time difference τ_nmof each of the (N×(N−1))/2 pairs of two channels.

After step S1883-4, the inter-channel relationship information complement unit 1883 obtains the preceding channel information INFO_nmof each of the (N×(N−1))/2 pairs of two channels from the inter-channel time difference τ_nmof each of the (N×(N−1))/2 pairs of two channels (step S1883-5). The inter-channel relationship information complement unit 1883 obtains information indicating that the nth channel is preceding as the preceding channel information INFO_nmwhen the inter-channel time difference τ_nmis a positive value, and obtains information indicating that the mth channel is preceding as the preceding channel information INFO_nmwhen the inter-channel time difference τ_nmis a negative value. The inter-channel relationship information complement unit 1883 may obtain, for each pair of two channels, information indicating that the nth channel is preceding as the preceding channel information INFO_nmwhen the inter-channel time difference τ_nmis zero, or information indicating that the mth channel is preceding as the preceding channel information INFO_nm.

Note that instead of step S1883-4 and step S1883-5, the inter-channel relationship information complement unit 1883 may perform step S1883-4′ of obtaining preceding channel information INFO′_nmfrom the inter-channel time difference τ′_nmas in step S1883-5 for each of the (N×(N−1))/2 pairs of two channels after the sorting, and step S1883-5′ of obtaining the preceding channel information INFO_nmof each pair of two channels included in the N channels by associating the preceding channel information INFO′_nmfor each of the (N×(N−1))/2 pairs of two channels after the sorting obtained at step S1883-4′ with a pair of channels for the input sound signals of the N channels (that is, a pair of channels being the sorting targets) by using the original channel information c₁to c_Nfor the channels after the sorting. That is, it suffices that the inter-channel relationship information complement unit 1883 obtains the preceding channel information INFO_nmof each pair of two channels included in the N channels by establishing an association with a pair of channels for the input sound signals of the N channels using the original channel information c₁to c_N, from the inter-channel time difference τ′_nmof each of the (N×(N−1))/2 pairs of two channels after the sorting, and by obtaining the preceding channel information based on whether the inter-channel time difference is positive, negative or zero.

Second Example

Instead of the inter-channel relationship information estimation unit 186 of the second example of the second embodiment, the inter-channel relationship information estimation unit 188 of the first example of the third embodiment may be used. In this case, it suffices that the inter-channel relationship information obtaining unit 187 of the sound signal downmix apparatus 407 includes the inter-channel relationship information estimation unit 188 instead of the inter-channel relationship information estimation unit 186, and that the inter-channel relationship information obtaining unit 187 performs an operation in which the inter-channel relationship information estimation unit 186 is read as the inter-channel relationship information estimation unit 188. In this case, the sound signal downmix apparatus 407 has the apparatus configuration exemplified in FIG. 7, and the sound signal downmix apparatus 407 performs the processing as exemplified in FIG. 8.

Fourth Embodiment

It is possible to provide the sound signal downmix apparatus of the second and third embodiments as a sound signal downmix unit in a coding apparatus for coding sound signals, and this configuration will be described as a fourth embodiment.

Sound Signal Coding Apparatus 106

As illustrated in FIG. 13, a sound signal coding apparatus 106 of the fourth embodiment includes a sound signal downmix unit 407 and a coding unit 196. The sound signal coding apparatus 106 of the fourth embodiment obtains and outputs a sound signal code by performing coding of an input time-domain sound signal of N-channel stereo in a frame unit of a predetermined time length of, for example, 20 ms. The time-domain sound signal of N-channel stereo to be input to the sound signal coding apparatus 106 is, for example, a digital voice signal or an acoustic signal obtained through an AD conversion of a sound such as a voice and music picked up by each of N microphones, and is composed of N input sound signals of the first channel input sound signal to the Nth channel input sound signal. A sound signal code output by the coding apparatus is input to the decoding apparatus. A sound signal coding apparatus 105 of the fourth embodiment performs processing of step S407 and step S196 exemplified in FIG. 14 for each frame. The sound signal coding apparatus 106 of the fourth embodiment will be described with reference to the second embodiment and the third embodiment as appropriate.

Sound Signal Downmix Unit 407

The sound signal downmix unit 407 obtains and outputs a downmix signal from N input sound signals of the first channel input sound signal to the Nth channel input sound signal input to the sound signal coding apparatus 106 (step S407). As with the sound signal downmix apparatus 407 of the second embodiment or the third embodiment, the sound signal downmix unit 407 includes the inter-channel relationship information obtaining unit 187 and the downmix unit 116. The inter-channel relationship information obtaining unit 187 performs the above-described step S187, and the downmix unit 116 performs the above-described step S116. That is, the sound signal coding apparatus 106 includes the sound signal downmix apparatus 407 of the second embodiment or the third embodiment as the sound signal downmix unit 407, and performs the processing of the sound signal downmix apparatus 407 of the second embodiment or the third embodiment as step S407.

Coding Unit 196

At least the downmix signal output by the sound signal downmix unit 407 is input to the coding unit 196. The coding unit 196 obtains a sound signal code by performing at least coding on the input downmix signal, and outputs the signal (step S196). The coding unit 196 may also perform coding on N input sound signals of the first channel input sound signal to the Nth channel input sound signal, and may output the sound signal code including the code obtained through the coding. In this case, as indicated with the broken line in FIG. 13, N input sound signals of the first channel input sound signal to the Nth channel input sound signal are also input to the coding unit 196.

The coding processing performed by the coding unit 196 is not limited. For example, a sound signal code may be obtained by performing coding on the downmix signals x_M(1), x_M(2) . . . , x_M(T) of input T samples by a monaural coding scheme such as 3GPP EVS standard. Moreover, for example, in addition to obtaining a monaural code by coding a downmix signal, a stereo code may be obtained by coding N input sound signals of the first channel input sound signal to the Nth channel input sound signal by a stereo coding scheme supporting a stereo decoding scheme of MPEG-4 AAC standard, and a combination of the monaural code and the stereo code may be obtained and output as a sound signal code. Furthermore, for example, in addition to obtaining a monaural code by coding a downmix signal, a stereo code may be obtained by performing coding on the weighted difference and the difference from the downmix signal for each channel for N input sound signals of the first channel input sound signal to the Nth channel input sound signal, and a combination of the monaural code and the stereo code may be obtained and output as a sound signal code.

Fifth Embodiment

It is possible to provide the sound signal downmix apparatus of the second embodiment and the third embodiment as a sound signal downmix unit in a signal processing apparatus for processing a sound signal, and this configuration is described as a fifth embodiment below.

Sound Signal Processing Apparatus 306

As illustrated in FIG. 15, a sound signal processing apparatus 306 of the fifth embodiment includes the sound signal downmix unit 407 and a signal processing unit 316. The sound signal processing apparatus 306 of the fifth embodiment performs a signal processing on an input time-domain sound signal of N-channel stereo in a frame unit of a predetermined time length of, for example, 20 ms, and thus obtains and outputs a signal processing result. Examples of the time-domain sound signal of N-channel stereo input to the sound signal processing apparatus 306 include a digital voice signal or an acoustic signal obtained through an AD conversion of a sound such as a voice and music picked up by N microphones, a digital voice signal or an acoustic signal obtained by processing the digital voice signal or acoustic signal, and, a digital decoding voice signal or a decoding acoustic signal obtained through decoding of a stereo code at a decoding apparatus. The time-domain sound signal of N-channel stereo is composed of N input sound signals of the first channel input sound signal to the Nth channel input sound signal. The sound signal processing apparatus 306 of the fifth embodiment performs processing of step S407 and step S316 exemplified in FIG. 16 for each frame. The sound signal processing apparatus 306 of the fifth embodiment will be described with reference to the second embodiment and the third embodiment as appropriate.

Sound Signal Downmix Unit 407

The sound signal downmix unit 407 obtains a downmix signal from the N input sound signals of the first channel input sound signal to the Nth channel input sound signal input to the sound signal processing apparatus 306, and outputs the downmix signal (step S407). As with the sound signal downmix apparatus 407 of the second embodiment or the third embodiment, the sound signal downmix unit 407 includes the inter-channel relationship information obtaining unit 187 and the downmix unit 116. The inter-channel relationship information obtaining unit 187 performs the above-described step S187, and the downmix unit 116 performs the above-described step S116. That is, the sound signal processing apparatus 306 includes the sound signal downmix apparatus 407 of the second embodiment or the third embodiment as the sound signal downmix unit 407, and performs the processing of the sound signal downmix apparatus 407 of the second embodiment or the third embodiment as step S407.

Signal Processing Unit 316

At least the downmix signal output by the sound signal downmix unit 407 is input to the signal processing unit 316. The signal processing unit 316 performs at least signal processing on the input downmix signal, and obtains and outputs a signal processing result (step S316). The signal processing unit 316 may also perform a signal processing on the N input sound signals of the first channel input sound signal to the Nth channel input sound signal and obtain a signal processing result. In this case, as indicated with the broken line in FIG. 15, the N input sound signals of the first channel input sound signal to the Nth channel input sound signal are also input to the signal processing unit 316, and the signal processing unit 316 performs a signal processing using a downmix signal on the input sound signal of each channel, and obtains an output sound signal of each channel as a signal processing result, for example.

Program and Recording Medium

The processing of each unit of each sound signal downmix apparatus, sound signal coding apparatus and sound signal processing apparatus may be implemented using a computer, and in this case, the processing detail of the function that should be provided in each apparatus is described in a program. When this program is read in a storage unit 1020 of a computer 1000 illustrated in FIG. 17 to operate an arithmetic processing unit 1010, an input unit 1030, an output unit 1040 and the like, the various processing function in each apparatus is implemented on the computer.

A program in which processing content thereof has been described can be recorded on a computer-readable recording medium. The computer-readable recording medium is, for example, a non-temporary recording medium, specifically, a magnetic recording device, an optical disk, or the like.

Further, distribution of this program is performed, for example, by selling, transferring, or renting a portable recording medium such as a DVD or CD-ROM on which the program has been recorded. Further, the program may be distributed by being stored in a storage device of a server computer and transferred from the server computer to another computer via a network.

For example, a computer executing such a program first temporarily stores the program recorded on the portable recording medium or the program transmitted from the server computer in an auxiliary recording unit 1050 that is its own non-temporary storage device. Then, when executing the processing, the computer reads the program stored in the auxiliary recording unit 1050 that is its own storage device to the storage unit 1020 and executes the processing in accordance with the read program. Further, as another execution mode of this program, the computer may directly read the program from the portable recording medium to the storage unit 1020 and execute processing in accordance with the program, or, further, may sequentially execute the processing in accordance with the received program each time the program is transferred from the server computer to the computer. Further, a configuration in which the above-described processing is executed by a so-called application service provider (ASP) type service for implementing a processing function according to only an execution instruction and result acquisition without transferring the program from the server computer to the computer may be adopted. It is assumed that the program in the present embodiment includes information provided for processing of an electronic calculator and being pursuant to the program (such as data that is not a direct command to the computer, but has properties defining processing of the computer).

Further, in this embodiment, although the present device is configured by a predetermined program being executed on the computer, at least a part of processing content of thereof may be achieved by hardware.

It is needless to say that the present disclosure can appropriately be modified without departing from the gist of the present disclosure.

Number	Date	Country	Kind
PCT/JP2020/010080	Mar 2020	WO	international
PCT/JP2020/010081	Mar 2020	WO	international
PCT/JP2020/041216	Nov 2020	WO	international

Number	Name	Date	Kind
20070223708	Villemoes	Sep 2007	A1
20080010072	Yoshida et al.	Jan 2008	A1
20160142846	Herre	May 2016	A1
20160142854	Fueg	May 2016	A1
20160255453	Fueg	Sep 2016	A1

Sound signal downmixing method, sound signal coding method, sound signal downmixing apparatus, sound signal coding apparatus, program and recording medium

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

CPC

Field of Search

US

CPC

International Classifications

Disclaimer

Term Extension

Abstract

Description

Claims

Priority Claims (3)

PCT Information

US Referenced Citations (5)

Foreign Referenced Citations (1)

Related Publications (1)