All material in this document, including the figures, is subject to copyright protections under the laws of the United States and other countries. The owner has no objection to reproduction of this document or its disclosure as it appears in official governmental records. All other rights are reserved.
The technical fields are audio-visual technology, computer technology, and measurement.
Performed music typically consists of notes played from a scale, such as an equal-tempered 12-tone scale. Different music notes, with their overtones, appear with different intensities and durations during the course of the performance. These tones generally span over several octaves. In harmonic and polyphonic music, a number of tones may be dominant in intensity (loudness) at one time. Time series music sound is usually digitized at some fixed sample rate such as a CD standard of 44.1 kHz. It is desirable to observe in the frequency domain music data quantitatively and accurately through spectral analysis.
Spectral analysis of sound, including music, is typically done with a Digital Fourier Transform (DFT) on the digitized signal. The aperture for DFT analysis is a time-series data of a fixed sample size. DFT spectral output is half that sample size in complex numbers, representing spectral content of the time series data. To take advantage of computational efficiency, a Fast Fourier Transform (FFT), an efficient method for some DFT computations, is usually employed. This is a well-known procedure.
The DFT/FFT approach to analyzing music for its spectral content has some disadvantages:
In a DFT, the resulting spectral components are linearly distributed into frequency bins, determined by sampling rate and sample size. To illustrate, a sample of 2,048 time series data taken at a sampling rate of 44.1 kHz are Fourier Transformed into 1,024 spectral bins equally spaced at 21.53 Hz apart. They are fixed at 0.00, 21.53, 43.07, 64.60, . . . , 22,028.47 Hz. In music, fundamental and overtones are not linearly, but rather logarithmically spaced. For example, in a 440 equal-tempered scale, starting with low E to two octaves above middle C, the tones are 82.41, 87.3, 92.5, . . . , 987.8, 1046.5 Hz. (See
In summary, using FFTs to analyze music suffers from poor frequency resolution for low tones. Spectral components cannot be aligned with music tones, making spectral analysis necessarily imprecise. Restricting frame size to powers-of-two samples in FFTs places further constraints. FFTs are susceptible to sizeable distortion due to glitches and the Gibbs phenomenon.
This invention, which I will call Regression Spectral Analysis (RSA), is more suited to analyzing music than DFTs. RSA eschews the use of Fourier Transform in the spectral analysis of music. Instead, it uses regression techniques from statistics to min-squared best-fit a mathematical projection of a music vector onto a set of vectors of a predefined set of tones. Analysis produces a “best” estimate of the magnitude and phase of individual music tones present. The number of tones in a typical music scale is limited. A piano has about eighty some notes. A chorus of mixed singers covers half that range. Instead of thousands of badly placed frequency bins in FFT, RSA frequency bins are the nominal music tones themselves, therefore are much less numerous. Less computation is required and more precision results. Glitches are effectively averaged out by the “best-fit” process, causing minimal distortion to the result. There is no distortion on spectrum frame boundaries due to Gibbs phenomenon, thus no extraneous “windowing” of music data is necessary. In RSA, data frames are not limited to powers-of-two samples, and can be optimally chosen to trade-off between low-note coverage and analysis agility.
On the right is the operation process flow of RSA. This can be done in real time for driving visual display or in stop-frame mode for music evaluation and editing. It segments the long audio stream into Audio Frames, which are represented as vectors whose number of dimensions equals the number of samples in the Audio Frame, and whose components are discrete amplitude values. Each Audio Frame vector is multiplied by the WVM from calibration to form the Keyboard Transform KBT. The KBT is not the final result in RSA as its basis vectors are not orthogonal. The final analysis result is the complex spectral vector CSV. Standard rectangular-to-polar conversion produces real vectors Magnitude Spectral Vector MSV and Phase Spectral Vector PSV.
The following describes preferred embodiments. However, the invention is not limited to those embodiments. The description that follows is for purpose of illustration and not limitation. Other systems, methods, features and advantages will be or will become apparent to one with skill in the art upon examination of the following figures and detailed description. It is intended that all such additional systems, methods, features and advantages be included within this description, be within the scope of the inventive subject matter, and be protected by the accompanying claims.
A specific invention embodiment and example application illustrates well the RSA process. By way of non limiting example, let us examine a coverage range that spans 45 tones from a low F (87.307) to a high C-sharp (1108.731) on a 12-tone equal-tempered scale. Source data is from a digital audio music stream in CD format. The stream is segmented into consecutive 66.67 ms audio frames of 2,938 samples for analysis. Results are reported 15 times a second, or every 2,940 samples, after each frame, in the form of the magnitude and phase of each tone detected within that frame. These sample numbers are purposely chosen to illustrate that a gap of two samples between frames causes no observable disturbance in the analysis. A few of the inexhaustible illustrative examples are explored showing how the data can be used to monitor, archive, characterize, evaluate, and edit the audio. Other examples show how the analysis can be used in real time to drive tone-based visual display of the music or electronic instrument accessories. It should be noted that RSA is scale, range, and frame size agnostic. Other embodiments of the invention with different ranges, frame sizes, and arbitrary scales are accommodated by RSA without deviation from the basic approach. RSA can also accommodate overlapping as well as non-contiguous frames or losses or breaks in stream data with no ill effect.
There are two distinct parts in the process of real-time regression spectral analysis (RSA) for music:
1. Instrument calibration; and
2. Analysis operation.
Performing a new calibration is necessary only when analyzing new music tuned to a different scale. The left side of
First a scale, described by a fixed range of discrete frequencies, must be selected. This scale can contain any finite range of or collection of nominal frequencies or pitches. The pitches need not be “evenly” or “regularly” spaced, need not contain octaves, etc. The number of pitches is limited solely by computing power and computational precision. The upper and lower bounds are limited only by the quality of the sample data to be used in the analysis phase. The proximity of adjacent tones is limited by potential singularity in the matrix inverse operation.
For the purposes of illustration, let us use a common 12-tone equal-tempered scale of 45 tones with a reference pitch of 440 Hz (commonly referred to by musicians as “A4”, or the “A above middle C”). Constructing a 12-tone equal-tempered scale of 45 tones starts with that reference pitch. All other tone-pitches are referenced to it by the fixed ratio of r, the twelfth-root of 2 between adjacent tones:
p
n
=p
ref
r
n−29
where:
pref is the reference pitch in Hz (e.g., 440)
A 45 tone scale where pref is 440 Hz, and n is in the range [1, 45], would be:
To re-tune, to Baroque 415 for example, the reference pitch would be changed to 415, and the values recalculated. Again, RSA is scale agnostic. Other scales use other algorithms to assign tone pitches. Even arbitrary values may be used.
Let P be the set of tone pitches in the scale, from p1 to pm, where m is the number of tones. In our example, m is 45, p1 is a low F, and pm or p45 is a high C-sharp).
Let S be the number of samples in the audio frame, and let Fs be the sample frequency in Hz. In our example, S is 2,938, and −Fs is 44.1 kHz or 44,100.
Now, for each pn in the set of tone pitches p1, through pm construct two Wave Vectors, each of length s, as follows:
For vector index i in [0, S−1]:
Or, in our example:
For vector index i in [0, 2937]:
Form a Wave-Matrix WVM with the Wave Vectors by “stacking” first the Cosine vectors, then the Sine vectors. The first m rows are the Cosine vectors in ascending pitches, and the last m rows are the Sine vectors in the same order. The matrix then has 2m rows and S columns:
In our example:
Create a Cross-Wave Product Matrix XWP by multiplying the Wave-Matrix WVM by its own transpose WVMT. The XWP matrix is square with 2m rows and 2m columns.
XWP=WVM·WVMT
Invert the XWP matrix to create the inverse XWP−1. It is commonly known that inverting a matrix this large or larger accurately usually requires precision computation tools available to scientists. Persons of ordinary skill in the art will appreciate that matrix inversion is performed “off-line” only once per calibration in RSA and is not performed in the analysis operation. Time requirement aside, computing very large matrix inverse proves difficult to do with sufficient precision for satisfactory results.
Identifying and quantifying a range of tones (e.g., a music scale), computing the Wave Matrix WVM, and computing its Inverse Cross-wave Matrix XWP−1 completes the calibration process for RSA.
Music in digital format, whether it is digitized from a live performance or a playback from a recording, consists of long streams of data, with one stream per channel. The right side of
In our example, the long stream of data is segmented into frames of 2,938 samples, giving an analysis aperture of 66.62 ms. For a standard sampling rate of 44.1 kHz, 15 frames are analyzed every second. Frame size must be large enough to accurately discern low tones and small enough not to confound fast moving music. In RSA, frame size is not confined to powers-of-two samples. The frames are sequential, but need not be exactly contiguous. A small gap between frames, e.g. two-sample in the example, has little perturbing effect on the spectrum as long as it is known and accounted for in timing calculations.
By way of continuing our example, to perform the analysis phase, multiply each frame of 2,938 samples, now called the Audio Frame, by the set of vectors in the Wave Matrix WVM. In precise mathematical terms, perform a matrix multiplication of the (90×2,938) matrix WVM and the (2,938×1) Audio Frame Vector. The result is a (90×1) vector designated as Keyboard Transform Vector KBT. The complex KBT is analogous to, but distinctly different from, the Digital Fourier Transform DFT of the Audio Frame vector. In DFT, the set of basis vectors are mutually orthogonal. In KBT, they are not. Even a pure tone may spill into several bins of KBT. While imprecise, vector KBT is a strong indicator of where the significant tones are. KBT is an intermediate and not the final product of RSA. It needs to be “cleaned up”.
To perform such a “clean up”, produce a (2m×1) Complex Spectral Vector CSV by multiplying matrices XWP−1 and KBT. Multiplication by XWP−1 minimizes, in a “best fit” manner, contents in the tonal bins in KBT that are not caused by spectral components of the Audio Frame as an artifact of using non-orthogonal wave-vectors. The CSV is essentially a vector of m complex numbers. It contains quantitative information of both magnitude and phase (in rectangular form) of detected tones in the frame. CSV, in polar form magnitude and phase, is the desired end-product of RSA.
To convert from rectangular-form to the more useful polar form of magnitude and phase for the m tones in the scale, index n from 1 to m, perform the standard transformation:
Atan2[y, x] will be apparent to those skilled in the art to mean a four-quadrant arctangent function in radians with the respective rectangular coordinate arguments. Phase angles are expressed in units of cycles through division by 2π. The above will result in a Magnitude Spectral Vector MSV and a Phase Spectral Vector PSV.
In our example, for each n from 1 to 45:
In
Pitch deviation can be obtained from phase spectral vector PSV phases in two consecutive frames. This allows actual tone pitches contained the Audio Frame to deviate from the nominal and the deviation can be calculated for any tone, particularly those tones which are prominent. Small tones in the background noise level will not produce meaningful results.
The procedure for determining frequency deviation for a specific tone is best illustrated by an example. A “trombone” note C-sharp was synthesized and analyzed by RSA with a frame size s of 2,205. The MSV magnitudes are shown in
More precisely stated, the phase deviation A) for this example is =[−0.11344−0.04277+Q]−[155.563×( 1/20)]=[−0.15621+Q]−7.7780. Q is a whole number which should be chosen to minimize |Δp|, or make it nearest zero. For example, for Q=8, Δp=0.06579 which is the smallest in absolute value. (9 would give 1.06579 and 7 would give −0.93421, both of which would result in a larger absolute values. Other integers would result in values even further from zero.) The pitch deviation Δp would then be ΔΦ/( 1/20)≈+1.31 Hz. Generally:
where c is the current audio frame, c−1 is the previous audio frame, each Φ are data from PSV expressed in cycles, and pn is the nominal pitch in Hz of the prominent tone n in question. The factor T is the time of consecutive frames, including any gaps or overlaps.
Frequency deviation calculation may continue for any prominent tones. If the frequency deviation is found to be fluctuating at a few hertz rate, then it is vibrato. The extent and rate characterize this vibrato. If the deviation is constant and does not vary with time, then it is due to de-tuning. It can be both, vibrato and detuning, if the deviation fluctuates about an offset.
Another method of illustrating frequency deviation, favored by instrument tuners, is to observe a spinning inhomogeneous disc, the direction of spin signifies sharp or flat, and the rate of spin signifies the amount of detuning, with a frozen disc signifying in-tune. This can be accomplished with PSV data Φ, for any prominent tone n:
θn(c)=θn(c−1)+Φn(c)−Φn(c−1)−pn·T
where the θn(c) is the current disc angle θn(c−1) is the disc angle in the previous frame. The range for θn is [0, 1] as it spins, ignoring all whole revolutions. Φn(c) and Φn(c−1) are PSV values for the current frame and previous frame respectively. T is the time of consecutive frames, including any gap or overlap.
The following are but a few of the nearly limitless uses of RSA. RSA now makes forms of editing accessible that were previously very difficult, if not impossible. By using magnitude and phase data provided by MSV and PSV, individual tone magnitudes can be modified to create different tone qualities without otherwise changing the music. For example, to remove one offending tone, one would add to the music vector a tone of the same frequency and magnitude but opposite in phase as expressed by MSV and PSV. This can be done even in the presence of other notes. The same can be done to overtones of the offending note.
Why does a particular violin, or voice, or organ pipe sound better than another?RSA can be a tool for technical analysis by experts through observing the relative magnitudes, perhaps even phases, of overtones for the same notes played or sung.
A spinning wheel visual display may depict pitch deviation, with direction and rotation rate indicative of polarity and extent of the deviation. Application to tuning musical instrument is obvious.
Visual Display of music can be controlled by individual tones with data from MSV. Different colors may illuminate whenever specific chords are detected. The possibilities are endless, limited only by the artistry of the display programmer. Tones identified can be used to electronically activate audio accompaniment accessories in near real time. One important difference from previous visual display or audio accompaniment techniques is that they are music content-activated in real time, providing automatic synchronization without detailed prior knowledge of the music through a score, and without beat-by-beat human intervention.
The analysis process shown in
There is an alternative method to use Regression Spectral Analysis (RSA) on a selected number of prominent tones determined by the |KBT|.
However, RSA can be applied only to the most prominent tones indicated by |KBT|2. It will validate the truly prominent tones and eliminate tones, which only appear to be prominent. By doing so, computation is reduced without sacrificing accuracy. The assumption, shown to be valid, is that truly prominent tones will appear to be prominent in |KBT|2, but not every prominent KBT tone is truly prominent.
Identify a set of tones P. Let S be the number of samples in the audio frame, and let Fs be the sample frequency in Hz. In our example, P is a 12-tone equal-tempered scale of 45 tones includes a reference pitch, such as a common 440 for A, S is 2,938, and F, is 44.1 kHz or 44,100.
For each pi in the set of tone pitches P, construct two Wave Vectors, each the same length as the sample size S, as follows:
For vector index n in [0, 2937];
Form a Wave-Matrix WVM with the Wave Vectors by “stacking” first the Cosine vectors, then the Sine vectors. In our example, the first 45 rows are the Cosine vectors in ascending pitches, and the last 45 rows are the Sine vectors in the same order. The matrix then has 90 rows and 2,938 columns. The order in which the vectors are placed is immaterial as long as it is consistent, and uniquely represents the tones in the scale.
Create a Cross-Wave Product Matrix XWP by multiplying the Wave-Matrix WVM by its own transpose WVMT. The XWP matrix is square with 90 rows and 90 columns. Thus far, the operations of RSA and SRSA are identical. However, SRSA eliminates the computationally expansive step of calculating XWP−1.
Identifying and quantifying a range of tones (music scale), computing the Wave Matrix WVM, and the Cross Wave Matrix XWP completes the calibration process of SRSA.
The right side of
The beginning operations of RSA and SRSA are the same. The long stream of data is segmented into frames of 2,938 samples, giving an analysis aperture of 66.67 milliseconds (ms). For a standard sampling rate of 44.1 kHz, 15 frames (or 2,940 samples) are analyzed every second. Multiply each frame of 2,938 samples, now called the Audio Frame, by the set of vectors in the Wave Matrix, WVM. In precise mathematical terms, perform a matrix multiplication of the (90×2,938) Wave Matrix by the (2,938×1) Audio Frame Vector. The result is a (90×1) vector designated as Keyboard Transform KBT.
The following operations of SRSA differ from those of RSA. Produce a (m×1) |KBT|2 squared magnitude vector. Index n from 1 to m as follows:
|KBT(n)|2=KBT2(n)+KBT2(n+m)
In our example:
|KBT(n)|2=KBT2(n)+KBT2(n+45)
Rank these squared magnitudes and note the respective index n for each magnitude squared. Choose the largest six and note their indices.
Create a (d×1) decimated-KBT vector by selecting the indices with the d largest tones. In our example, let d be 12.
Create a (d×d) (e.g., (12×12)) decimated-XWP by selecting only rows and columns of XWP with the same indices.
Invert the decimated-XWP to get a (d×d) dccimated-XWP−1.
Multiply the decimated-XWP−1 by the decimated-KBT to get a (d×1) (e.g., (12×1)) decimated-CSV vector.
Embed the decimated-CSV vector in zeros to form a full (2m×1) (e.g., (90×1)) CSV vector, placing the decimated-CSV elements in their original indices.
To convert from rectangular-form to polar-form of magnitude and phase for the six tones, six n indices embedded from 1 to 45 (i.e., one for each of the m tones in the range):
Atan2[y, x] means a four-quadrant arctangent function in radians. Phase angles are expressed in units of cycles through division by 2π. The above will result in a Magnitude Spectral Vector MSV and a Phase Spectral Vector PSV for SRSA.
The CSV vector and its polar equivalent MSV and PSV found by SRSA should differ little from that found by the more comprehensive RSA provided that the actual prominent tones are among those selected for analysis by SRSA.
It is not possible to analyze all sound as music. Percussion, for example, cannot easily be separated into distinct tones. In the embodiments, tones are separated by the ratio of 100 cents or about 6% absolute. A tone that is off-key by 50 cents may be considered either 50-cent higher than the lower nominal tone or 50-cent lower than the higher nominal tone. Therefore it is theoretically impossible to analyze it unambiguously. Even before a tone becomes that far off-key, the MSV will show spurious values for supposedly vacant tones. For well tuned instrumental music and disciplined vocal music, the tones are usually not that far off-key. There is always the option of tuning the apparatus to suit the music by adjusting the reference frequency (e.g. from 440) to something else more appropriate. Should the music be undisciplined a capella (unaccompanied) singing when the pitch degenerates very rapidly, it is an artistic judgment call when to retune. The inventor has no suggestion. In some natural music scales, there may be many more notes than 12 in an octave. A D-sharp may be distinct from an E-flat although the two may be very close. It is not recommended that they both be entered as nominal frequencies. Rather a mean-tone should be used as nominal and the pitch “deviation” techniques be used for close-in analysis.
The invention pertains to analysis of digital audio signals and any industry where that may be of value or importance.