Technology exists to analyze the beat-related characteristics of an audio item. However, the task of analyzing the characteristics of audio information may be a computationally intensive operation. Existing technology may not enable to perform this task in a suitably efficient manner. This potential deficiency, in turn, may restrict the uses to which this technology may be applied.
A beat analysis module is described for determining beat information associated with an audio item. The beat analysis module uses a statistical modeling approach (such as an Expectation-Maximization approach) to determine an average beat period. In one illustrative implementation, the modeling approach performs correlation over diverse representations of the audio item. Next, the beat analysis module uses the average beat period to determine beat onset information associated with the commencement of the beats in the audio item. The beat onset information identifies the average onset of beats in the audio item and the actual onset for each individual beat.
Various applications can make use of the analysis performed by the beat analysis module. According to one illustrative aspect, the beat analysis module is configured to determine the beat information in a relatively short period of time. As such, the beat analysis module can perform its analysis together with another application task without disrupting the real time performance of that application task.
For example, in one illustrative application, the beat analysis module can be used to analyze beat information in the context of operations performed by a game module. In this approach, a user may select one or more audio items to be used in the course of a game. The beat analysis module can analyze the beat information and apply the beat information in the course of the game without disrupting the real time performance of the game.
According to one illustrative aspect, an application (such as a game module application) allows the user to select his or her own audio items to be used with the application. In other words, the providers of the application do not dictate a collection of audio items to be used with the application.
The above approach can be manifested in various types of systems, components, methods, computer readable media, data structures, and so on.
This Summary is provided to introduce a selection of concepts in a simplified form; these concepts are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
The same numbers are used throughout the disclosure and figures to reference like components and features. Series 100 numbers refer to features originally found in
This disclosure sets forth an approach for analyzing an audio item to determine beat information. The disclosure also sets forth various applications of the approach.
The disclosure is organized as follows. Section A describes an illustrative beat analysis module for determining beat information from an audio item. Section B describes various applications of the beat analysis module of Section A. Section C describes illustrative processing functionality that can be used to implement any aspect of the features described in Sections A and B.
As a preliminary matter, some of the figures describe concepts in the context of one or more structural components, variously referred to as functionality, modules, features, elements, etc. The various components shown in the figures can be implemented in any manner, for example, by software, hardware (e.g., discrete logic components, etc.), firmware, and so on, or any combination of these implementations. In one case, the illustrated separation of various components in the figures into distinct units may reflect the use of corresponding distinct components in an actual implementation. Alternatively, or in addition, any single component illustrated in the figures may be implemented by plural actual components. Alternatively, or in addition, the depiction of any two or more separate components in the figures may reflect different functions performed by a single actual component.
Other figures describe the concepts in flowchart form. In this form, certain operations are described as constituting distinct blocks performed in a certain order. Such implementations are illustrative and non-limiting. Certain blocks described herein can be grouped together and performed in a single operation, certain blocks can be broken apart into plural component blocks, and certain blocks can be performed in an order that differs from that which is illustrated herein (including a parallel manner of performing the blocks). The blocks shown in the flowcharts can be implemented by software, hardware (e.g., discrete logic components, etc.), firmware, manual processing, etc., or any combination of these implementations.
As to terminology, the phrase “configured to” encompasses any way that any kind of functionality can be constructed to perform an identified operation. The functionality can be configured to perform an operation using, for instance, software, hardware (e.g., discrete logic components, etc.), firmware etc., and/or any combination thereof.
The term “logic” encompasses any functionality for performing a task. For instance, each operation illustrated in the flowcharts corresponds to logic for performing that operation. An operation can be performed using, for instance, software, hardware (e.g., discrete logic components, etc.), firmware, etc., and/or any combination thereof.
A. Illustrative System
A.1. Overview of Illustrative Beat Analysis Module
The beat analysis module 102 includes an audio receiving module 104 for receiving the audio item (or multiple audio items) and storing the audio item in an audio buffer store 106. In one case, the beat analysis module 102 selects a relatively small portion of the audio item for analysis, such as, without limitation, a sample of 4-10 seconds in duration. However, the beat analysis module 102 can perform its analysis on audio items of any length. For example, the beat analysis module 102 can perform its analysis over the span of an entire audio item (e.g., an entire song). In the following explanation, the operations of the beat analysis module 102 will be described as being performed on an “audio item,” where it is to be understood that the audio item may refer to a sample of the originally received audio item of any duration or the entire audio item.
The rhythmic content of the audio item may contribute to the appearance of regularly occurring patterns in its waveform. For instance, each instance of a regularly occurring pattern may include a distinct spike in audio level (or other telltale signal form). This spike may be attributed to a drum strike or other musical occurrence that marks out the tempo of a song. According to the terminology used herein, each instance of a regularly occurring pattern is referred to as a beat. As such, the audio item includes a sequence of beats. In formal musical notation, the beat of an audio item may have some relation a measure of a song, which, in turn, is governed by a time signature and tempo of the song. For example, a beat may correspond to a portion of a measure.
A pre-processing module 108 performs pre-processing on the audio item to place it in an appropriate form for further processing. In one case, for example, the audio item may include multiple channels. The pre-processing module 108 can convert the multiple channels into a single audio item by averaging the channels together to produce a single audio item. That is, in the case that there are n channels (j=1 to n), each sample vi of the resultant single-channel audio item is determined by:
The pre-processing module 108 may also either downsample or upsample the audio item to a desired sample rate. For example, in one particular but non-limiting case, the pre-processing module 108 may downsample or upsample the audio item to 16 kHz.
An average beat period determination module (ABPD) 110 analyzes the beat determination module using a statistical modeling approach, such as an Expectation-Maximization (EM) approach. The ABPD module 110 determines the average beat period of beats within the audio item.
A beat onset determination (BOD) module 112 uses the average beat period to first determine the average beat onset for the audio item. That is, the onset of a beat determines when the beat is considered to commence. The average beat onset is formed by taking the average of individual beat onsets within the audio item. The BOD module 112 also determines the beat onset for each individual beat within the audio item. An individual beat onset is referred to herein as an actual beat onset for that particular beat.
The average beat period, the average beat onset, and actual beat onsets may be referred to herein as beat information. Also, any part of this information is referred to as beat information (for example, the average beat period can generically be referred to as beat information). The beat analysis module 102 can store the beat information in an analyzed beat information store 114.
An application module 116 may use the beat information to perform any type of application task (referred to in the singular below for brevity). For example, a game module may use the beat information in the course of the play of a game. For instance, the game module may use the beat information to synchronize action in the game to an audio item, to synchronize an audio item to action in the game, to select an appropriate audio item from a collection of audio items, and so on. No limitation is placed on the uses of the beat information. Section B will provide additional information regarding illustrative applications of the beat information.
Later figures will be used to explain in detail how the ABPD module 110 and the BOD module 112 may be configured to operate. At this point, suffice it to say that the beat analysis module 102 is configured to compute the beat information in a relatively short period of time, for example, in one case, in a fraction of a second. This enables the application module 116 to perform beat analysis in an integrated manner with other application tasks. In other words, because the beat analysis is performed so quickly, it does not unduly interfere with the performance of the application tasks. This makes it possible to perform the beat analysis in an integrated fashion with other application tasks, rather than, for example, in off-line fashion prior to the application tasks. In one concrete case, a game module can incorporate beat analysis in the course of a game playing operation without unduly affecting the real-time operation of the game.
A.2. General Mathematical Basis for Beat Analysis
As a preliminary matter, this section sets out general mathematical principles for use in determining beat information. The next section (Section A.3) describes one illustrative implementation of the mathematical approach in this section. There are many ways to implement the analysis in this section; the specific implementation in Section A.3 represents a particularly fast and accurate approach for performing beat analysis that does not follow from the general principles described in this section.
Let um denote the signal energy at frame m of an audio item. To compute um, the waveform of the audio item can be analyzed in the time domain. The approach applies a window function at equally spaced time points, indexed by m=1, . . . , M. um is the mean squared value of the windowed signal.
The approach can model the beat by assuming that um is approximately periodic in m, with beat period τ. To estimate τ, the approach can use the following model:
u
m
=ηu
m−τ+ρm (2).
Here, ρm is, for example, Gaussian noise with mean zero and variance σ2. This defines a probabilistic model in which um are the observed variances, τ is a hidden variable, and η and σ are parameters. The model can be expressed by:
To complete the definition of the model, the prior distribution p(τ) can be defined as a flat distribution. That is, p(τ)=const.
The Expectation-Maximization (EM) algorithm can then be used to estimate the period τ and the model parameters. EM is an iterative algorithm, where the E-step updates the sufficient statistics and the M-step updates the parameter estimates. In the present context, the sufficient statistics corresponds to the full posterior distribution over the beat period, conditioned on the data. It is computed via Bayes' rule:
Here, z is a normalization constant. It can be shown to be equal to the data distribution, z=p({um}), but since it is independent of τ it does not need to be actually computed. This posterior can be computed efficiently for any value of τ by observing that its logarithm is the autocorrelation of um:
The posterior can be computed using Fast Fourier Transform (FFT). The resulting complexity of the E-step is O(M log M).
The M-step update rules can be derived by minimizing the complete data log-likelihood E log p({um}|τ)p(τ), where the operator E performs averaging over τ with respect to the posterior formulation provided above in equation (4). The following expressions are obtained:
As in the E-step, the computations involved in equations (6) and (7) can be performed efficiently using FFT.
Finally, the beat period can be obtained by using a maximum a posteriori (MAP) estimate:
Experimentally, the posterior over τ is relatively narrow. In the following, τ can be used to refer to {circumflex over (τ)}.
To compute the average beat onset, the approach can divide um into consecutive non-overlapping sequences of length τ. The sequence i can be denoted by (u1i, u2i, . . . uτi), where uni=u(i−1)τ+n and n=1, . . . τ. The approach can then perform averaging over those sequences. The average sequence can be denoted by (ū1, . . . ūτ). The average onset
The actual beat onset for an individual beat can be computed for each τ-long sequence above. It can be assumed, in one case, that the onset time l for a given sequence may deviate from the average onset time
The onset times li can be converted back to the time domain where they form part of the beat information.
A.3. Particular Illustrative Implementation of Beat Analysis
This section describes one particular implementation of the statistical modeling approach of Section A.2. One way in which the particular implementation of this section improves on the approach in Section A.2 is by performing correlation over a diverse set of representations of the audio item. In the following explanation, the beat period will be referred to as P. More generally, the definition of symbols used in this section is to be found within this section, not the prior section.
Starting with
In block 404, the ABPD module 110 determines the average beat period P by performing correlations over plural representations of the audio item. Subsequent figures will explain how this operation is performed.
In block 406, the BOD module 112 determines the average onset for the beats in the audio item.
In block 408, the BOD module 112 determines the actual onsets for individual beats in the audio samples.
In block 410, the application module 116 applies the above-defined beat information for use in performing any application task.
Starting with
In block 504, the pre-processing module 108 can perform pre-processing operations on the original audio item to convert it into a form that is suitable for further analysis. In one case, the pre-processing may entail extracting a portion of the audio item for analysis, such as, without limitation, a portion of the audio item of 4-10 second duration. Pre-processing may also entail converting the multiple channels of the audio item into a single channel (e.g., using the averaging technique of equation (1)). The pre-processing may also entail downsampling or upsampling the audio items to a desired sampling rate, such as, without limitation, 16 kHz. As a result of these operations, the audio item defines a linear sequence v of N samples, that is, v≡N. Expression 802 of
In block 506, the ABPD module 110 reshapes the linear sequence of samples in the audio item into a M×B array of samples V, that is V=M×B. In other words, the ABPD module 110 populates the elements of the matrix V one row of M samples at a time. Matrix 804 of
In one case, there is no overlap in samples in the matrix V. In this case, the element v21 at the start of the second row is the next element following v1M, which is the last element in the first row; in other words, if element v1M corresponds to element vj in the sequence of linear samples, then element v21 corresponds to element vj+1. In another implementation, there is an overlap of samples between rows of the matrix V. For example, assuming that M is 512, then the first element in the second row (v21) could start at, for example, element v440 in the sequence of linear samples, even though the last element in the first row (v1M) corresponds to the element vM (i.e., v512) in the linear sequence.
In block 508, the ABPD module 110 computes the FFT of each of the rows of the matrix V. As shown in expression 806 of
In block 510, the ABPD module 110 constructs a vector y that contains the average frequency spectrum energy in each of the rows of S. To produce this vector y, the ABPD module 110 can square each of the elements in the matrix S, that is, by performing the operation ∥S2∥. For instance, the ABPD module 110 can square the element s11 by adding the square of its real component to the square of its imaginary component, to yield element
The vector y has B real elements.
In block 512, the ABPD module 110 normalizes the vector y by dividing each element of the vector y by the standard deviation (std) of the vector y. Expression 904 in
Advancing to
In block 604, the ABPD module 110 begins by calculating the vector α=FFT(y) (which is a complex vector), b=|a|2 (which is a real vector), and c=FFT(y2) (which is a complex vector).
In block 604, the ABPD module 110 determines the vector q as follows:
q=β e
λRe[FFT
(b−max(b))] (11).
In expression (11), λ is a scaling factor and β is chosen such that Σq=1. Values of (b−max(b)) are real. To create a complex vector from this real vector, the ABPD module 110 can set the real component of the complex vector to (b−max(b)) and the imaginary component to zero.
In block 606, the ABPD module 110 next determines the vectors f=FFT(q) (which defines a complex vector), g=FFT−1(f·a) (which defines a real vector), and h=FFT−1(f·c) (which defines a real vector).
In block 608, the ABPD module 110 next determines:
At this point, the loop in
In block 610, the ABPD module 110 can now extract the average beat period from the vector q upon the completion of the last iteration. That is, the index (index) at which the maximum value in q occurs corresponds to average beat period. This index can be converted to an actual beat period t (where t is the index multiplied by some large constant, such as 200), by iteratively multiplying l by 2 or dividing l by 2 until the value of t satisfies the expression 0.7<fs/t<2.3, where fs is the sampling frequency.
At this point, the ABPD module 110 has performed its task of determining the average beat period P of the audio item (that is, P=t). As noted above, the iterative EM procedure is implemented over a diverse set of correlations, e.g., by performing the correlations using different representations of the audio item. In the context of
Advancing to
In block 704, the BOD module 112 forms a vector W by taking the average single energy across different beats. As shown in expression 1006 of
In block 706, the BOD module 112 next forms a circular moving average over the vector W. As indicated by waveform 1008 of
Finally, in block 708, the BOD module 112 determines the beat onset for each of the individual beats in the audio sample. To perform this task, the BOD module 112 can take the circular moving average of an individual beat in the audio sample, as represented by operation 1012 of
The information calculated in procedure 500 (the average beat period, the average beat onset, and the actual beat onsets) defines beat information.
B. Illustrative Applications
As described above, different types of applications can make use of the beat analysis module 102 of
In this system 1100, the user may have access to a collection of audio items 1104. In one case, the user may own these audio items 1104. For example, the user may have acquired various free audio items from any source of such items. In addition, or alternatively, the user may have purchased various audio items 1104 from any source of such items. In addition, or alternatively, the user may have created various audio items 1104 (for example, the user may have recorded his or her own songs). In any event, a provider of the application module 1102 does not necessarily dictate the audio items that the user is expected to use in the application module 1102. Rather, the provider enables the user to select his or her own audio items from any source of audio items. This aspect of the system 1100 has various advantages. The user may consider this feature to be desirable because it empowers the user to select his or her own audio items.
An interface module 1106 defines any functionality by which the user can select one or more of the audio items 1104 for use by the application module 1102. In one case, the application module 1102 may provide a user interface that enables the user to select audio items for use with the application module 1102.
The beat analysis module 102 can compute the beat information relatively quickly. In one case, for example, the beat analysis module 102 can compute the beat information in a fraction of a second. In view of this feature, the operations performed by the beat analysis module 102 can be integrated together the other application tasks performed by the application module 1102 without unduly interfering with these application tasks. In one concrete case, a game module can perform beat analysis at various junctures in the game without slowing down the game or otherwise interfering with the game. As such, the game module does not need to perform the beat analysis in off-line fashion, although part of the analysis (or all the analysis) can also be performed in off-line fashion.
The application module 1102 itself can use the beat information in many different ways. In one example, the application module 1102 may include a synchronization module 1108. In one case, the synchronization module 1108 can use the beat information associated with an audio item to synchronize any kind of action (such as any kind of action happening in a game, or, more generally, behavior exhibited by a game) with the tempo of the audio item. In another example, the synchronization module 1108 can synchronize the audio item to any kind of action (such as any kind of action happening in a game, physical action performed by a human user, etc.). The synchronization module 1108 can synchronize the audio item to action by changing the tempo of the audio item (e.g., by slowing down or speeding up the audio item to match the action). In another example, the synchronization module 1108 can use the beat information to synchronize one audio item with respect to another audio item. The synchronization module 1108 can perform this operation, for example, by changing the tempo of one of the audio items to match the other, or by changing the tempos of both audio items until they are the same or similar. This type of synchronizing operation may be appropriate where it is desirable to create a smooth transition from one song to the next. Still other types of synchronization operations can be performed.
A clip selection module 1110 can use the beat information to select an appropriate audio item or to select multiple appropriate audio items. For example, the user may have identified a collection of audio samples that he or she would like to use with the application module 1102. The clip selection module 1110 can select the audio item at a particular juncture that is most appropriate in view of events occurring at that particular juncture. For example, a game module can select an audio item that matches the tempo of action happening at a particular juncture of the game. An exercise-related module can select an audio item that matches the pace of physical actions performed by the user, and so on. To perform this task, the application module 1102 can analyze the beat information of one or more audio items in real time when an audio item is needed. It is also possible for the application module 1102 to perform this operation off-line, e.g., before the audio item is needed. In similar fashion, the clip selection module 1110 can select an audio item which most appropriately matches the tempo of another audio item.
The application module 1102 can make yet other uses of the beat information. For example, although not shown, the application module 1102 can use the beat information to form an identification label for an audio item. The application module 1102 can then use the identification label to determine whether an unknown audio item matches a previously-encountered audio item (e.g., by comparing the computed identification label for the unknown audio item with a list of known identification labels).
In block 1204, the beat analysis module 102 is used to determine beat information for one or more audio items. As explained above, the application module 1102 can invoke the beat analysis module 102 in off-line fashion (e.g., before performing other application tasks) or on-line fashion (e.g., in the course of performing other application tasks).
In block 1206, the application module 1102 performs any type of application based on the beat information. Without limitation, these applications can include: synchronizing events to beats in the audio item; synchronizing the audio item to events (e.g., by changing the tempo of the audio item); synchronizing an audio item with another audio item; selecting an appropriate audio item; determining a beat identification label; using a beat identification label to retrieve an audio item or perform some other task, and so on.
C. Representative Processing Functionality
In the context of
The processing functionality 1300 can include volatile and non-volatile memory, such as RAM 1302 and ROM 1304. The processing functionality 1300 also optionally includes various media devices 1306, such as a hard disk module, an optical disk module, and so forth. More generally, instructions and other information can be stored on any computer-readable medium 1308, including, but not limited to, static memory storage devices, magnetic storage devices, optical storage devices, and so on. The term “computer-readable medium” also encompasses plural storage devices. The term “computer-readable medium” also encompasses signals transmitted from a first location to a second location, e.g., via wire, cable, wireless transmission, etc.
The processing functionality 1300 also includes one or more processing modules 1310 (such as one or more computer processing units, or CPUs). The processing functionality 1300 also may include one or more special purpose processing modules 1312 (such as one or more graphic processing units, or GPUs). A graphics processing module performs graphics-related tasks. One or more components of the special purpose processing modules 1312 can also be used to efficiently perform operations (such as FFT operations) used to analyze beat information.
The processing functionality 1300 also includes an input/output module 1314 for receiving various inputs from a user (via input module(s) 1316), and for providing various outputs to the user (via output module(s) 1318). One particular type of input module is a game controller 1320. The game controller 1320 can be implementing as any mechanism for controlling a game. The game controller 1320 may include various direction-selection mechanisms (e.g., 1322, 1324) (such as joy stick-type mechanisms), various trigger mechanisms (1326, 1328) for firing weapons, and so on. One particular output module is a presentation module 1330, such as a television screen, computer monitor, etc.
The processing functionality 1300 can also include one or more network interfaces 1332 for exchanging data with other devices via a network 1334. The network 1334 may represent any type of mechanism for allowing the processing functionality 1300 to interact with any kind of network-accessible entity. One or more communication buses 1336 communicatively couple the above-described components together.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
Number | Date | Country | |
---|---|---|---|
Parent | 12472777 | May 2009 | US |
Child | 14498560 | US |