This application includes a Computer Listing Appendix on compact disc, hereby incorporated by reference.
1. Field of the Invention
The present invention relates generally to audio processing and, more particularly, to a method, apparatus, and system for synthesizing an audio performance in which one or more acoustic characteristics, such as acoustic space, microphone modeling and placement, are varied using pseudo-convolation processing techniques.
2. Description of the Prior Art
Digital music synthesizers are known in the art. An example of such a digital music synthesizer is disclosed in U.S. Pat. No. 5,502,747, hereby incorporated by reference. The system disclosed in the '747 patent discloses multiple component filters and is based on hybrid time domain and frequency domain processing. Unfortunately, the methodology utilized in the U.S. Pat. No. 5,502,747 patent is relatively computationally intensive and is thus not efficient. As such, the system disclosed in the '747 patent is primarily only useful in academic and scientific applications where computation time is not critical. Thus, there is a need for an efficient synthesizer that is relatively more efficient than those in the prior art.
The present invention relates to a method, apparatus, and system for use in synthesizing an audio performance in which one or more acoustic characteristics, such as acoustic space, microphone modeling and placement, can selectively be varied. In order to reduce processing time, the system utilizes pseudo-convolution processing techniques at a greatly reduced processor load. The system is able to emulate the audio output in different acoustic spaces, separate musical sources (instruments and other sound sources) from musical context; interactively recombine musical source and musical context with relatively accurate acoustical integrity, including surround sound contexts, emulate microphone models and microphone placement, create acoustic effects, such as reverberation, emulate instrument body resonance and interactively switch emulated instrument bodies on a given musical instrument.
The present invention relates to an audio processing system for synthesizing an acoustic response in which one or more acoustic characteristics are selectably varied. For example, the audio response in a selectable musical context or acoustical space can be emulated. In particular, a model of virtually any acoustic space, for example, Carnegie Hall, can be recorded and stored. In accordance with one aspect of the invention, the system emulates the acoustic response in the selected acoustic space model, such that the audio input sounds as if it were played in Carnegie Hall, for example.
In accordance with one aspect of the invention, the system has the ability to separate musical sources (i.e. instruments and other sound sources) from the musical context (i.e. acoustic space in which the sound sources are played). By emulating the response to selectable music contexts, as described above, the acoustic response to various musical sources can be emulated for virtually any acoustic space: including the back seat of a station wagon.
Various techniques can be used for generating a model of an acoustic space. The model may be considered a fingerprint of a room or other space or musical context. The model is created, for example, by recording the room response to a sound impulse, such as a shot from a starter pistol or other acoustic input. The sound impulse may be created, for example, by placing a speaker in the room or space to be modeled and playing a frequency sweep. More particularly, a common technique is the sine sweep method which has a sweep tone and a complementary decode tone. The convolution of the sweep tone and the decode tone is a perfect single sample spike (impulse). After the sweep tone is played through the speaker and recorded by a microphone in the room, the resulting recording is convolved with the decode tone which reveals the room impulse response. Alternatively, simply firing a starter pistol in the space and recording the response is another way. Alternatively, various “canned” acoustic space models are currently available on the Internet at http:/www.echochamber.ch[!]; http:/altiverb.claw-mac.com; and http:/noisevault.com.
In accordance with other aspects of the invention, the system is able to emulate other acoustic characteristics, such as the response to one or more predetermined microphones, such as a vintage AKG C-12 microphone. The microphone is emulated in the same manner as the musical context. In particular, the acoustic response to an acoustic impulse of the vintage microphone, for example, is recorded and stored. Any musical source played through the system is processed so that it sounds as if it were played through the vintage microphone.
The system is also able to emulate other acoustic characteristics, such as the location of an audio source within an audio context. In particular, in accordance with another aspect of the invention, the system is able to combine a sound source, the response of an acoustic space, a microphone and an instrument body resonance response into separate, reconfigurable audio sources in an audio performance. For example, when an instrument, say a violin, is performed in a room and recorded through a microphone, the resulting audio contains tonality and reverberation dictated by multiple impulse elements, namely the microphone, room acoustics and the violin body. In many cases it is desirable to control these three elements individually and separate from each other and the string vibration of the violin. By doing so, different choices of microphone, room environment or violin body can be independently selected by the user or content author for an audio performance. In addition, the system is able to optionally emulate the response to another audio characteristic, such as the location of an audio source relative to the microphone placement, thus allowing the audio source to be virtually moved relative to the microphone. As such, drums, for example, can be made to sound closer or further apart from the microphones.
In accordance with another aspect of the invention, the system is a real time audio processing system that is significantly less computation intensive than known music synthesizers, such as the audio processing system disclosed in the '747 patent discussed above. In particular, various techniques are used to reduce the processing load relative to known systems. For example, as will be described in more detail below, in a “Turbo” mode of operation, the system processes input audio samples at a slower sample rate than the input sample rate thus reducing the processor load up to 75%, for example.
An exemplary host computing platform for use with the present invention is illustrated in
A button 114 is provided for selectively enabling and disabling a “cascade” feature associated with application of the raw impulse selected via the drop-down menu 104 to an audio track. A button 116 is provided for selectively enabling and disabling an “encode” feature which permits the application of a user-selected acoustic model to the instrument selected via the drop-down menu 106. A display area 118 optionally may show a graphical or photographic representation of the musical context selected by the drop-down menu 102.
A button 120 is provided for selectively activating and deactivating a mid/side (M/S) microphone pair arrangement for left-side and right-side microphones. Additional buttons 121, 122, 123, and 124 are provided for specifying groups of microphones, including, for example, all microphones (button 121), from (“F”) microphones (button 122), wide (“W”) microphones (button 123), and rear or surround (“S”) microphones (button 124).
The user also may enter microphone polar patterns and roll-off characteristics for each of the microphones employed in any given simulation. For that purpose, buttons 124, 125, 126, 127, 128, and 129 are provided for selecting a microphone roll-off characteristic or response. For example, buttons 125 and 126 select two different low-frequency bumps; button 127 selects a flat response, and buttons 128 and 129 select two different low-frequency roll-off responses, respectively. Similarly, buttons 130-134 allow a user to select one of several different well-recognized microphone polar patterns, such as an omni-directional pattern (button 130), a wide-angle cardioid pattern (button 131), a cardioid pattern (button 132), a hyper cardioid pattern (button 133), or a so-called “figure-8” pattern (button 134).
The control panel 100 also includes a placement control section 135, which, in the illustrated embodiment, contains a plurality of placement selector/indicator buttons (designated by numbers 1 through 18). These placement selector/indicator buttons allow a user to specify a position of musical instruments within the user-selected musical context (e.g., the position of the instrument selected by the drop-down menu 106 relative to the user-specified microphone(s)). The graphical display area 118 may display a depiction of the perspective of the room or musical context selected by the drop-down menu 102 corresponding to the placement within that room or musical context specified by the particular placement selector/indicator button actuated by the user. Of course, as will be readily apparent to those of ordinary skill in the art, many different alternative means may be employed to permit a user to select instrument placements within a particular musical context in addition to or instead of the placement selector/indicator buttons shown in
As also shown in
The mic-to-output control section 136 also includes a button 140 for selectively enabling and disabling a “simulated stereo” mode in which a single microphone simulation or output is processed to develop two (i.e., stereo) mixer output channels. This may be used, for example, to enable a simulated stereo output to be produced by a slow computer which does not have sufficient processing power to handle full stereo real-time processing. A button 142 is provided for selectively enabling a “true stereo” mode, which simply couples left and right stereo microphone simulations or outputs to two mixer output channels. Further, a button 144 is provided for selectively enabling and disabling a “seven-channel” mode in which each of seven microphone simulations or outputs is coupled to a respective mixer output channel to provide for full seven-channel surround sound output.
A button 146 is provided for selectively enabling and disabling a “tail extend” feature which causes the illustrated synthesizer to derive the first N seconds of the synthesized response by performing a full convolution and then to derive an approximation of the tail or terminal portion of the synthesized response using a recursive algorithm (described in mare detail below) which is lossy but computationally efficient. Where exact acoustically simulation is not required, enabling the tail extend feature provides a trade-off between exact acoustical simulation and computational overhead. Associated with the tail extend feature are three parameters, Overlap, Level, and Cutoff, and a respective slider control 148, 150, and 152 is provided for adjustment of each of these parameters.
More particularly, the slider control 148 permits adjustment of an amount of overlap between the recursively generated tail portion of the synthesized response or output signal and a time-wise prior portion of the output signal which is calculated by convolution at a particular sample rate. The slider control 150 permits adjustment of the level of the recursively generated portion of the output signal so that it more closely matches the level of the time-wise prior convolved portion of the output signal. The slider control 152 permits adjustment of the frequency-domain cutoff between the recursively generated portion of the output signal and the time-wise prior convolved portion thereof to thereby smooth the overall spectral damping of the synthesized response or output signal such that the frequency-domain bandwidth of the recursively generated portion of the output signal more closely matches the frequency domain bandwidth of the convolved portion thereof at the transition point between those two portions.
A plurality of further slider controls may be provided to allow a user to adjust the level corresponding to each microphone used in a particular simulation. In the illustrated embodiment, slider controls 154-160 are provided for adjusting recording levels of each of seven recording channels, each corresponding to one of the available microphones in the illustrated simulation or synthesizer system. In addition, a master slider control 161 is provided to allow a user to simultaneously adjust the levels set by each of the slider controls 154-160. As shown, a digital read-out is provided in tandem with each slider control 154-161 to indicate numerically to the user the level set at any given time by the corresponding slider control 154-161. In the illustrated embodiment, the levels are represented by 11-bit numbers ranging from 0 to 2047. However, it should be evident to those of ordinary skill in the art that any other suitable range of levels in any suitable units could be used instead.
The control panel 100 also includes a level button 164, a perspective button 166, and a pre-delay button 168. The level button 164 allows a user to selectively activate and deactivate the level controls 154-161. The perspective button 166 allows the user to selectively activate and deactivate a perspective feature which allows the slider controls 154-161 to be used to adjust a parameter which simulates, for any given simulation, varying the physical dimensions of the musical context or room selected by the drop-down menu 102. The pre-delay button 168 allows the user to employ the slider controls 154-161 to adjust a parameter which simulates echo response speed (by adjusting the simulated lag between the initial echo in a recorded signal and a predetermined amount of echo density buildup).
Alternate exemplary graphical user interfaces (GUI) are illustrated in
In order to reduce runtime CPU resource utilization, the loadtime coefficient processing routine 62 pre-processes at load time the time domain impulse coefficients from storage device 60 with audio signal processing to facilitate changes to the audio response based on user input, and converts the resulting time domain coefficient data into the frequency domain. The runtime sequencing, control, and data manager 52 processes the audio source input samples and the processed impulse response coefficients as to facilitate CPU load balancing and efficient real time processing. The processed samples and coefficients from the runtime sequencing, control, and data manager 52 are applied to the process channel module 53 in order to produce audio output samples 68, which emulate the audio response of the input audio source to various user selected audio characteristics.
A fast Fourier transform (FFT) module 82, including FFT routines 84, 86, and 88, is provided for converting frames of data, which are represented in the time domain in frame buffers 76 and 78 into corresponding frequency-domain data. More particularly, the FFT routine 84 produces a fast Fourier transform of an XLB frame from the frame buffer 76 and provides the transformed data to a frequency domain buffer (XLBF) 94. In a turbo mode, frame data from the frame buffer (XLA) 78 is filtered by a low-pass filter, for example a 2:1 filter to reduce the sample rate to ½ of the audio input source sample rate. The low pass filter simply reduces the audio bandwidth to one half of the input sample bandwidth and truncates the result by saving only every other sample. The filtered samples are stored in a decimation frame buffer (Xl-lP) 92. This decimation frame buffer 92 contains the band reduced and truncated samples produced by low pass filtering and throwing away every other sample, and passes these samples to the FFT routine 86 which performs an FFT on the decimated, filtered frame data and stores the resulting frequency domain frame data in a frequency domain buffer (XLAF) 96.
In the event a user wishes not to employ tail end processing (i.e., preferring instead to achieve the acoustic accuracy of full-sample-rate convolution which results greater processing power), the FFT module 88 may be operated at the full sample rate (i.e. same sample rate as the input samples) to transform the frame data from the frame buffer (XLA) 78 at its original sample rate and thus provide full-sample-rate frequency domain data to the frequency domain buffer 96 (XLAF).
Operation of the frame copy routines (B) and (A) 72 and 74, the tail maintenance routine 80, the FFT module 82, and the low-pass alter 90 is handled by a frame control process routine 98. The frame control process routine synchronizes the timing of the frames so that they work in phase together, assembling a frequency domain frame which is larger than the time domain frame size, such that an entire frequency domain frame is made up of multiple time domain frames. The frame control process also synchronizes the multiple sample rates and frame sizes of the XLA, XLB, XLAF, and XLBF buffers, as fed into the real time scheduling and CPU load balancing routines within the runtime sequencing, control and data manager 52.
A plurality of T output buffers OUT 1, OUT 2 . . . OUT T, identified with the reference numerals 112, 113, 114, are provided in the run-time memory 100. Each of the output buffers 112, 113 and 114 is sized to receive one frame of output audio samples at a time for outputting the respective T output sample streams. The output buffer pointers pIOBuf1, pIOBuf2 . . . pIOBufT for the user selected audio characteristic of each channel CH1. 1, CH. 2 . . . CH, N of the input audio samples is time multiplexed by the channel sequencing module 118 to provide independent references to process channel 53, which synthesizes audio output streams in real time into the output buffers OUT 1, OUT 2 . . . OUT T, identified with the reference numerals 112, 113 and 114.
Multiple copies or multiple instances of the same audio processing system 48 can be used simultaneously or in time multiplex. The multiple instances allow for simultaneous processing, for example, of different musical instruments. For example, the relative location of each instrument in an orchestra relative to a microphone can be simulated. Since such instruments are played simultaneously, multiple copies or instances of the audio processing system 48 are required in order to synthesize the effects in real time. As such, the channel sequencing module 118 must provide appropriate references of all of the copies or instances to the process channel module 53. As such, an instance data buffer I, identified with the reference numeral 116, is provided in the runtime memory 100 for each instance of the audio processing system 48 being employed.
In order to provide a clear understanding of the audio processing involved in the present invention, a time-domain representation of an exemplary impulse response input signal is shown graphically in
There is a unique co-efficient for the “a” portion and “b” portion, HindexA and HindexB respectively.
As shown in
The data structure 150 may include at a plurality of exemplary data fields 152, 154, 156, 158, 160, 162, 164, 166, and 168, as shown. As shown in
The field 154 (
The field 156 is dynamically generated and contains an intermediate part of the product of a vector multiplication from a vector multiplier 172 of the FIR co-efficients panned to by a Co-efficient index Sequencing Routine 170 and the frequency domain audio source input data XLBF, XLAF, illustrated in the box, identified with the reference numeral 174, for the N channels. The buffer XLBF contains the full sample rate, early portion of the impulse response or FIR alter coefficients in the frequency domain output from (
Hlen represents in the time domain the equivalent of one frame of frequency domain data halfHlen represents in the tune domain the equivalent of one-half frame of frequency domain data.
The field 160 contains indices to past and present frames in the audio collection buffer for the B portion of the impulse response (acolindexprevB and acolindexB, respectively, and for the A portion of the impulse response, acolindexprevA and acolindexB, respectively. The field 162 contains the audio collection buffer (acol) 162 corresponding to the processing which occurs at the full sample rate as indicated by the block 178a (
As shown in
As shown in (
After all half sample rate processing is offset according to appropriate phase by collection indices and overlap added into acolh, the 1:2 upsample block field 180 converts the half sample rate data into the full sample rate and accumulates the result into the audio collect full sample rate buffer field 162.
After all half sample rate tail extension processing is offset according to appropriate phase by collection indices and overlap added into acolDH, the tail extension 1:2 upsample block field 182 converts this tail extension half sample rate data into the full sample rate and accumulates the result into the tail extension audio collect delay full rate buffer, acoID, field 168.
Tail extension processing is optionally enabled by the user in order to model the very end portion of an impulse response to mitigate the fact that convolution processing is very CPUintensive. More particularly, rather than spend valuable computation time on portions of an impulse response that may be nearing the point of inaudibility or otherwise less significant than earlier portions of an impulse response, tail extension modeling employs an algorithmic model at a far lower computational load. For example, if an impulse response is 4 seconds in duration, the last second may be modeled to save premium convolution processing time for only the early part of the response.
The later portion of the convolution processing, for example the third second in our 4 second impulse example, may be copied into a buffer, acolDH, at the half sample rate, or acolD at the full sample rate. The tail extension model, similar to a conventional reverberation algorithm, is synchronized and applied to the late response. There are low pass filters for timbre matching, volume control for volume matching to the tail level of the actual impulse, feedback and overlap parameters, all of which facilitate a smooth transition from convolution processing to algorithm processing.
An important aspect of the invention relates to embedding and controlling convolution technology within a sampler or synthesizer that is a music sampler for a music synthesizer engine and what this technology will do is add to the description of a virtual musical instrument. One example relates to modeling an acoustic piano. In that example, the behavior of the piano soundboard resonate is emulated. In this example, the parameters that control the impulse response of the piano soundboard may be saved into a file description which contains both the original samples of the individual notes on the piano and control parameters to dynamically scale the convolution perimeters in real-time such that the behavior of an acoustic piano soundboard is the same as the model version. So, in essence, the system embeds and controls convolution-related parameters within a synthesizer engine-thus embedding that convolution process inside the virtual musical instrument processing itself. Typically, a sampler or synthesizer engine includes an interpolator which gives you pitch control, a low-frequency oscillator or LFO, and all envelope generator. Envelope generators provides dynamic control of amplitude over time which are all processing audio which is routed through a convolution process where now other aspects of the control and modeling of the sound is coming from the synthesizer engine in dynamically controlling the convolution process. Examples of dynamically controlling the convolution process are, controlling the pre and post convolution level control, damping of audio energy from within the convolution buffers for simulating a damping of a piano soundboard as when the damper pedal is raised, changing the wet/dry, adding and subtracting various impulse responses representing various attributes of a sound, and changing the “perspective control”. In regards to “perspective control,” what that is doing is changing the envelope of the impulse response in real-time as a musical instrument is being played. By combining all of these processes, physical instruments can be modeled with far greater detail and accuracy than before.
Various file structures can be employed in which the impulse responses associated with the sound of a musical instrument, the control parameters associated with the impulse responses, the digital sound samples representing single or multiple notes of an instrument, control parameters for the synthesizer engine filters, LFO, envelope generators, interpolators, and sound generators are stored together into a file structure representation of a musical instrument. This file structure has single or multiple data fields representing each of these characters of the synthesized sound, which may be organized in a variety of ways using a variety of file data types. This musical instrument file structure may include the ambient environment, instrument body resonance, microphone type, microphone placement, or other audio character of the synthesized sound. An example file structure is as follows: Impulse Response 1 . . . Impulse Response(n), Impulse Response 1 impulse control 1 . . . Impulse Response 1 impulse control(m), Impulse Response(n) impulse control 1 . . . Impulse Response(n) impulse control(m), digital sound sample 1 . . . digital sound sample(p), sampler engine control parameter 1 . . . sampler engine control parameter (q), synthesizer engine control parameter 1 . . . synthesizer engine control parameter (r), pointer to other file 1, . . . pointer to other file(n). Together, these parameters represent the sound behavior of a musical instrument or sound texture generator, in which the impulse responses and their interactivity within the synthesizer engine via user performance data are contributing to the sound produced by the instrument model.]
An exemplary channel data structure is illustrated below. The Channel Sequencing Routine 118 (
indicates data missing or illegible when filed
The foregoing description is for the purpose of teaching those skilled in the art the best mode of carrying out the invention and is to be construed as illustrative only. Numerous modifications and alternative embodiments of the invention will be apparent to those skilled in the art in view of this description, and the details of the disclosed structure may be varied substantially without departing from the spirit of the invention. Accordingly, the exclusive use of all modifications within the scope of the appended claims is reserved.
This application claims the benefit of U.S. provisional patent application Nos. 60/510,068 and 60/510,019, both filed on Oct. 9, 2003.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US04/33290 | 10/8/2004 | WO | 00 | 3/5/2007 |
Number | Date | Country | |
---|---|---|---|
60510019 | Oct 2003 | US | |
60510068 | Oct 2003 | US |