An example embodiment of the present invention relates generally to analysis and synthesis of multichannel signals.
There are several methods to generate a binaural audio signal from a multichannel signal that are based on a fixed filterbank structure. Some other variations include using a non-uniform filterbank structure or structures based on alternative auditory scales. Although binaural signals can be satisfactorily generated, such methods are not suitable to manipulating the components present within the audio signal. The spatial analysis of a multichannel signal is performed on a single band which may contain contributions from multiple auditory sources (i.e. a multipitch signal could have very closely spaced harmonics). It may not be possible to get the spatial distribution of the different components present in the entire spectrum of the signal. Performance of pitch synchronous analysis of such signals is restricted to signals containing a single pitch, since multipitch signals tend to be difficult to analyze and require complex algorithms.
Many signal processing applications require detecting a tone and estimating its location from a signal. Some examples where detection of tones from audio signal spectrum is required include sinusoidal modeling requiring detection of spectral peaks and psychoacoustic models requiring identification of tone and noise like components in spectrum to apply the appropriate masking rules. A voice signal is characterized by harmonic structure and detecting harmonicity in spectrum requires detection of tone. Further, most musical instruments produce sounds containing tonal structure (it could be harmonic or inharmonic). Alternative applications include detection of interfering tones or selecting tone from noisy background or estimation of periodicity.
Performance of tone detection methods can suffer due to noise. Some tonal component detection methods may require estimating approximate pitch in a time domain and then refining the spectral peak estimate in a spectral domain. In such scenarios, performance of pitch detection can degrade in the presence of multiple periodicities in the signal. Many techniques are based on distance measures or correlation based or geometrical and search based methods to detect the tones and require comparison with a threshold for some stage of decision making. Thresholds on spectral mismatches are prone to errors in the presence of noise and also need normalization based on signal strengths.
A method, apparatus and computer program product are therefore provided according to an example embodiment of the present invention in order to perform categorical analysis and synthesis of a multichannel signal to synthesize binaural signals and extract, separate, and manipulate components within the audio scene of the multichannel signal that were captured through multichannel audio means.
In one embodiment, a method is provided that at least includes receiving a multichannel signal, computing the spectrum for the multichannel signal, determining tonality of bands within the spectrum, and generating a band structure for the spectrum. The method of this embodiment also includes performing spatial analysis of the bands, performing source filtering using the bands, performing synthesis on the filtered band components, and generating an output signal.
In some embodiments, the method may further include determining the tonality of bands within the spectrum on only one channel in the multichannel signal. In some embodiments, determining the tonality of bands within the spectrum comprises determining if the band is tonal or non-tonal. In some embodiments, the width of the bands may be variable. For example one of the choices for widths of the bands may be {29.6 Hz, 41 Hz, 52.75 Hz, 64.5 Hz, 76 Hz}.
In some embodiments, the method may further include a tonality determination of bands in the spectrum based on statistical goodness of fit tests. In some embodiments, the tonality determination comprises comparing a spectral component distribution in a band to an expected spectral component distribution. In some embodiments, the expected spectral component distribution may be generated by an ideal sinusoid. In some embodiments, comparison of the spectral component distributions may include using a test of goodness of fit, such as a chi-square test.
In some embodiments, the method may further include generating a band structure for the spectrum by categorizing bands as tonal or non-tonal and computing upper and lower limits of tonal and non-tonal bands. In some embodiments, generating a band structure for the spectrum may include consolidating multiple continuous tonal bands into a single band.
In some embodiments, spatial analysis of the bands may include determining the spatial location of a source. In some embodiments, source filtering of the bands may include processing the bands with head related transfer function (HRTF) filters. In some embodiments, synthesis on the filtered band components may include applying an inverse Discrete Fourier transform and applying add and overlap synthesis. In some embodiments, the output signal may be an individual source in an audio scene of the multichannel signal, a binaural signal, source relocation within an audio scene of the multichannel signal, or directional component separation.
In another embodiment, an apparatus is provided that includes at least one processor and at least one memory including computer program instructions with the at least one memory and the computer program instructions configured to, with the at least one processor, cause the apparatus at least to receive a multichannel signal, compute the spectrum for the multichannel signal, determine tonality of bands within the spectrum, and generating a band structure for the spectrum. The at least one memory and the computer program instructions are also configured to, with the at least one processor, cause the apparatus at least to perform spatial analysis of the bands, perform source filtering of the bands, perform synthesis on the filtered band components, and generate an output signal.
In a further embodiment, a computer program product is provided that includes at least one non-transitory computer-readable storage medium bearing computer program instructions embodied therein for use with a computer with the computer program instructions including program instructions configured to receive a multichannel signal, compute the spectrum for the multichannel signal, determine tonality of bands within the spectrum, and generating a band structure for the spectrum. The program instructions are further configured to perform spatial analysis of the bands, perform source filtering of the bands, perform synthesis on the filtered band components, and generate an output signal.
In another embodiment, an apparatus is provided that includes at least means for receiving a multichannel signal, means for computing the spectrum for the multichannel signal, means for determining tonality of bands within the spectrum, and means for generating a band structure for the spectrum. The apparatus of this embodiment also includes means for performing spatial analysis of the bands, means for performing source filtering of the bands, means for performing synthesis on the filtered band components, and means for generating an output signal.
Having thus described certain embodiments of the invention in general terms, reference will now be made to the accompanying drawings, which are not necessarily drawn to scale, and wherein:
Some embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all, embodiments of the invention are shown. Indeed, various embodiments of the invention may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements. Like reference numerals refer to like elements throughout. As used herein, the terms “data,” “content,” “information,” and similar terms may be used interchangeably to refer to data capable of being transmitted, received and/or stored in accordance with embodiments of the present invention. Thus, use of any such terms should not be taken to limit the spirit and scope of embodiments of the present invention.
Additionally, as used herein, the term ‘circuitry’ refers to (a) hardware-only circuit implementations (e.g., implementations in analog circuitry and/or digital circuitry); (b) combinations of circuits and computer program product(s) comprising software and/or firmware instructions stored on one or more computer readable memories that work together to cause an apparatus to perform one or more functions described herein; and (c) circuits, such as, for example, a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation even if the software or firmware is not physically present. This definition of ‘circuitry’ applies to all uses of this term herein, including in any claims. As a further example, as used herein, the term ‘circuitry’ also includes an implementation comprising one or more processors and/or portion(s) thereof and accompanying software and/or firmware. As another example, the term ‘circuitry’ as used herein also includes, for example, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in a server, a cellular network device, other network device, and/or other computing device.
As defined herein, a “computer-readable storage medium,” which refers to a non-transitory physical storage medium (e.g., volatile or non-volatile memory device), can be differentiated from a “computer-readable transmission medium,” which refers to an electromagnetic signal.
A method, apparatus and computer program product are provided in accordance with an example embodiment of the present invention to perform categorical analysis and synthesis of a multichannel signal to synthesize binaural signals and extract, separate, and manipulate components within the audio scene of the multichannel signal that were captured through multichannel audio means.
Embodiments of the present invention may perform analysis and synthesis of a multichannel signal to synthesize binaural signals and extract, separate, and manipulate components within the audio scene of the multichannel signal that were captured through multichannel audio means. Embodiments of the present invention do not require pitch estimation in time and frequency domains. The embodiments may perform spatial analysis categorically on the spectrum rather than on the entire spectrum. The categorization may be based on a tonal nature of regions or bands within the spectrum. The categorical analysis-synthesis enables various functions such as source separation, source manipulation, and binaural synthesis.
In some embodiments, spatial cues for the multichannel signal may be captured by analyzing fewer components (e.g. tonal components) in the spectrum, which are more relevant for carrying information about the direction. In some embodiments, operations may be more computationally efficient since only the bands specific to tonal regions need analysis and/or synthesis. Additionally, the tonality computation does not require pitch detection and is also suitable for use with multipitch signals.
In one embodiment, a method is provided that at least includes receiving a multichannel signal, computing the spectrum for the multichannel signal, determining tonality of bands within the spectrum, and generating a band structure for the spectrum. The method of this embodiment also includes performing spatial analysis of the bands, performing source filtering of the bands, performing synthesis on the filtered band components, and generating an output signal.
Further embodiments provide for determining tonality for regions of a spectrum by detecting peaks within a spectrum using a parametric statistical goodness of fit test. Such embodiments do not require apriori pitch estimation of temporal processing and use spectrum as input for the tonality detection. For example, even if a signal is a combination of harmonic and non-harmonic components, spectral peaks can be reliably estimated. The tonality detection operation is flexible enough to allow gradual tuning by changing its parameters.
Some embodiments of the present invention may use a statistical goodness of fit method for identifying tonality in the spectrum. The sum of two complex exponentials with the same frequency of oscillation would give two lines; one at +ve and one at −ye frequency, 0.5*(exp(−j\omega t)+exp(j\omega t)). Once windowed the lines smear and spectrum is given by the Discrete Fourier Transform (DFT) of the windowed signal. Smearing may also occur if the N in an N-point DFT is not large enough to have enough spectral resolution. In some embodiments, the ideal shape of the windowed spectrum of a tone is used as reference or expected spectral content distribution to which the region in the spectrum to be tested for tonality (or the observed distribution) is compared. In essence this process corresponds to comparing the shape of a region in a spectrum to an ideal spectral shape of a windowed tone. The interval over which the tonality is detected may be variable and can be changed based on the region in which it is applied. To be able to apply a statistical goodness of fit tests, however, the expected and observed sets of samples cannot be compared as they are; rather, they need to resemble discrete probability distributions. As such, the observed and expected distribution functions are normalized by using the sum of magnitude of their spectral values over the interval of comparison. This ensures that sum of the spectral samples sum up to unity.
In some embodiments, once such normalization is carried out a goodness of fit test may be performed. In example embodiments, this can be any of the well-known statistical tests such as Chi-Square, Anderson-Darling, or Kolmogorov-Smirnov test. Such tests require a statistic to be computed and hypothesis test to be carried out for a particular significance level. In an example embodiment, the NULL hypothesis is that a tonal component is present, but if the test statistic is higher than a threshold value (decided by the significance level) the NULL hypothesis is rejected. In an example embodiment, the statistic may be computed at every DFT bin value, when a tone is found the chi-square statistic takes a low value. This also means that the shape of spectral region found in a spectrum matches closely to the ideal harmonic at the selected significance level.
The statistical nature of test in such embodiments provides flexibility of tuning the whole procedure by various parameters, such as using different significance levels for different regions and using variable intervals across the spectrum over which a goodness of fit is carried out.
In some embodiments, the DFT bins where tones are found may be stored and used for further computation along with their corresponding interval sizes.
An embodiment of the present invention may include an apparatus 100 as generally described below in conjunction with
It should also be noted that while
Referring now to
In some embodiments, the processor (and/or co-processors or any other processing circuitry assisting or otherwise associated with the processor) may be in communication with the memory device via a bus for passing information among components of the apparatus. The memory device may include, for example, a non-transitory memory, such as one or more volatile and/or non-volatile memories. In other words, for example, the memory device may be an electronic storage device (e.g., a computer readable storage medium) comprising gates configured to store data (e.g., bits) that may be retrievable by a machine (e.g., a computing device like the processor). The memory device may be configured to store information, data, content, applications, instructions, or the like for enabling the apparatus to carry out various functions in accordance with an example embodiment of the present invention. For example, the memory device could be configured to buffer input data for processing by the processor 102. Additionally or alternatively, the memory device could be configured to store instructions for execution by the processor.
In some embodiments, the apparatus 100 may be embodied as a chip or chip set. In other words, the apparatus may comprise one or more physical packages (e.g., chips) including materials, components and/or wires on a structural assembly (e.g., a baseboard). The structural assembly may provide physical strength, conservation of size, and/or limitation of electrical interaction for component circuitry included thereon. The apparatus may therefore, in some cases, be configured to implement an embodiment of the present invention on a single chip or as a single “system on a chip.” As such, in some cases, a chip or chipset may constitute means for performing one or more operations for providing the functionalities described herein.
The processor 102 may be embodied in a number of different ways. For example, the processor may be embodied as one or more of various hardware processing means such as a coprocessor, a microprocessor, a controller, a digital signal processor (DSP), a processing element with or without an accompanying DSP, or various other processing circuitry including integrated circuits such as, for example, an ASIC (application specific integrated circuit), an FPGA (field programmable gate array), a microcontroller unit (MCU), a hardware accelerator, a special-purpose computer chip, or the like. As such, in some embodiments, the processor may include one or more processing cores configured to perform independently. A multi-core processor may enable multiprocessing within a single physical package. Additionally or alternatively, the processor may include one or more processors configured in tandem via the bus to enable independent execution of instructions, pipelining and/or multithreading.
In an example embodiment, the processor 102 may be configured to execute instructions stored in the memory device 104 or otherwise accessible to the processor. Alternatively or additionally, the processor may be configured to execute hard coded functionality. As such, whether configured by hardware or software methods, or by a combination thereof, the processor may represent an entity (e.g., physically embodied in circuitry) capable of performing operations according to an embodiment of the present invention while configured accordingly. Thus, for example, when the processor is embodied as an ASIC, FPGA or the like, the processor may be specifically configured hardware for conducting the operations described herein. Alternatively, as another example, when the processor is embodied as an executor of software instructions, the instructions may specifically configure the processor to perform the algorithms and/or operations described herein when the instructions are executed. However, in some cases, the processor may be a processor of a specific device configured to employ an embodiment of the present invention by further configuration of the processor by instructions for performing the algorithms and/or operations described herein. The processor may include, among other things, a clock, an arithmetic logic unit (ALU) and logic gates configured to support operation of the processor.
Meanwhile, the communication interface 106 may be any means such as a device or circuitry embodied in either hardware or a combination of hardware and software that is configured to receive and/or transmit data from/to a network and/or any other device or module in communication with the apparatus 100. In this regard, the communication interface may include, for example, an antenna (or multiple antennas) and supporting hardware and/or software for enabling communications with a wireless communication network. Additionally or alternatively, the communication interface may include the circuitry for interacting with the antenna(s) to cause transmission of signals via the antenna(s) or to handle receipt of signals received via the antenna(s). In some environments, the communication interface may alternatively or also support wired communication. As such, for example, the communication interface may include a communication modem and/or other hardware/software for supporting communication via cable, digital subscriber line (DSL), universal serial bus (USB) or other mechanisms.
The apparatus 100 may include a user interface 108 that may, in turn, be in communication with the processor 102 to provide output to the user and, in some embodiments, to receive an indication of a user input. For example, the user interface may include a display and, in some embodiments, may also include a keyboard, a mouse, a joystick, a touch screen, touch areas, soft keys, a microphone, a speaker, or other input/output mechanisms. The processor may comprise user interface circuitry configured to control at least some functions of one or more user interface elements such as a display and, in some embodiments, a speaker, ringer, microphone and/or the like. The processor and/or user interface circuitry comprising the processor may be configured to control one or more functions of one or more user interface elements through computer program instructions (e.g., software and/or firmware) stored on a memory accessible to the processor (e.g., memory 104, and/or the like).
The method, apparatus, and computer program product may now be described in conjunction with the operations illustrated in
The apparatus 100 may further include means, such as the processor 102, the memory 104, or the like, for computing the spectrum of a received multichannel signal. See block 204 of
As shown in block 206 of
Any of a variety of methods may be used to determine which bands of the spectrum are tonal, such as peak picking, F-ratio test, interpolation based techniques to determine spectral peaks. In an exemplary embodiment, the tonality of the bands in the spectrum may be based on statistical goodness of fit tests as described below.
Using a statistical goodness of fit test, tonality is detected by comparing the of spectral component distribution in a band (i.e. the observed distribution) to a spectral component distribution generated by an ideal sinusoid (i.e. the expected distribution). The comparison is carried out using chi-square test of goodness of fit. However, other possible goodness of tests such as Kolmogorov-Smirnov or Anderson-Darling may be used as well. A goodness of fit test is commonly used for comparing probability distributions; hence the first operation is to ensure that the functions to be compared have properties of probability density functions. This is achieved by normalizing the spectrum over the band by sum of its magnitudes in that band. A similar normalization is carried out on a Discrete Fourier Transform of the sine window centered on the harmonic. Once the two functions resemble probability density functions, a chi-square test is performed. The width of the band becomes the degrees of freedom for the chi-square distribution. In one example, the significance level is set to 10% but can be changed based on strictness of the test.
In an example embodiment, the statistic is computed as follows:
where χ2 is the chi-square statistic, So and Si are the normalized observed and expected spectral magnitude distributions. Si is derived from the Discrete Fourier Transform samples of the sine window function (used for the Discrete Fourier Transform computation) centered on the harmonic, while So is derived from the observed contiguous set of samples sampled in the Discrete Fourier Transform spectrum. ‘n’ is the interval size over which the statistic is computed. In one example, the interval size can be chosen from five different sizes. The ‘n’ also serves to determine the degree of chi-square function to choose for the hypothesis test. The Si and So are not directly used from the window and signal themselves; rather they are normalized by the sum of magnitudes of the Discrete Fourier Transform samples over the interval. This is necessary in order to make them resemble frequency distribution and be able to apply the hypothesis testing.
The subplot 406 of
As shown in block 208 of
As shown in block 210 of
The delay may be transformed into an angle in azimuthal plane using basic geometry. The angle may be used to determine the spatial location of the source of the signal. Typically, the bands generated due to a source in a particular direction would result in similar value of azimuthal angle.
As shown in block 212 of
In some embodiments, bands categorized as tonal may constitute a directional component and the remaining spectral lines or bands may constitute the ambience component of the signal. A respective synthesis of these components may provide dominant and ambient signal separation. A clustering algorithm on the angles for different band may be used to reveal the distribution of audio components along spatial directions. In an alternative embodiment, for video containing two or three visible audio sources in the field of view, it may be possible to capture the rough directions of the sources from lens parameters. Such information can be used to segment the bands in specific directions and which may be synthesized to separately synthesize the sources. The sources identified in this manner need not be separated but the entire band could be translated, allowing source relocation to be realized with the same analysis-synthesis framework. In some embodiments, after the angles of arrival for tonal bands are obtained, pruning and/or cleaning operations may be carried out to improve the performance in cases of reverberant environments.
As shown in block 214 of
As shown in block 216 of
In some example embodiments, the band structure used in the analysis-synthesis may be dynamic and may therefore adapt to dynamic changes in the signal. For example, if the spectral components of two sources overlap, when using a fixed band structure, there is no effective way to identify the two components within the band. However, with a dynamic band structure, the probability of each of these components being detected is higher. The probability of determining a correct direction for each tone is also higher leading to improved spatial synthesis. Additionally, with a fixed band structure multiple sources could be present or a single band could partially cover a spectral contribution due to a single audio source. Using a dynamic band structure overcomes this limitation by positioning bands around the tonal components.
A dynamic band structure may also allow different resolution across the frequency bands. The interval over which tonality detection happens may also be varied allowing the use of a narrower interval in lower frequency regions and a wider interval in the higher frequency regions.
An example of tonality determination performed by some embodiments of the present invention may now be described in conjunction with the operations illustrated in
S(k)=Σn=0N−1x(n)e−2πkn/N.
The window function and the signal in that window are shown in
As shown in block 504 of
S
o
={S(k), S(k+1), . . . , S(k+Mi−1)}
and
S
e
={W(k), W(k+1), . . . , W(k+Mi−1)},
where Mi is the size of interval over which goodness of fit is performed, and ‘i’ is used to index the interval size since multiple interval sizes may be used. The So and Se cannot be used as is by themselves and should resemble the discrete probability density functions. Therefore, they are normalized with their sums over the interval and get
Example normalized expected and observed distributions are shown in
As shown in block 506 of
As shown in block 508 of
As shown in block 510 of
As described above,
Accordingly, blocks of the flowchart support combinations of means for performing the specified functions and combinations of operations for performing the specified functions for performing the specified functions. It will also be understood that one or more blocks of the flowchart, and combinations of blocks in the flowchart, can be implemented by special purpose hardware-based computer systems which perform the specified functions, or combinations of special purpose hardware and computer instructions.
In some embodiments, certain ones of the operations above may be modified or further amplified. Furthermore, in some embodiments, additional optional operations may be included. Modifications, additions, or amplifications to the operations above may be performed in any order and in any combination.
Many modifications and other embodiments of the inventions set forth herein will come to mind to one skilled in the art to which these inventions pertain having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that the inventions are not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Moreover, although the foregoing descriptions and the associated drawings describe example embodiments in the context of certain example combinations of elements and/or functions, it should be appreciated that different combinations of elements and/or functions may be provided by alternative embodiments without departing from the scope of the appended claims. In this regard, for example, different combinations of elements and/or functions than those explicitly described above are also contemplated as may be set forth in some of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.
Number | Date | Country | Kind |
---|---|---|---|
4164/CHE/2012 | Oct 2012 | IN | national |