The present disclosure relates generally to the determination or classification of audio channels included in audio data, and, more particularly, to techniques that may be utilized to identify which type of audio channel corresponds to a particular set of audio data.
This section is intended to introduce the reader to various aspects of art that may be related to various aspects of the present techniques, which are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present disclosure. Accordingly, it should be understood that these statements are to be read in this light, and not as admissions of prior art.
Content, such as television, movie, film, audiobooks, songs, may include audio data that have multiple audio channels. For example, the audio data may be included in a multichannel audio file that includes channels for particular speakers of sets of speakers that are to generate sound corresponding to the audio data of the audio channels. For example, for 5.1 surround sound, a multichannel audio file may have six channels having one of the following channel types: (front) left, (front) right, center, low-frequency effects, surround left, and surround right. In some cases, the audio data may not indicate or be indicative of which type of channel (e.g., corresponding to a particular speakers or set of speakers) one or more sets of audio data correspond to. Traditionally, to ensure that the audio content is played back using the correct speakers, audio data is analyzed manually (e.g., by human analysts) to identify which type of channel a particular channel is. However, the traditional manual approach to characterize audio content may be labor intensive, time-consuming, inconsistent, inaccurate, and inefficient.
Certain embodiments commensurate in scope with the originally claimed subject matter are summarized below. These embodiments are not intended to limit the scope of the claimed subject matter, but rather these embodiments are intended only to provide a brief summary of possible forms of the subject matter. Indeed, the subject matter may encompass a variety of forms that may be similar to or different from the embodiments set forth below.
The current embodiments relate to systems and methods for characterizing audio data, for instance, by determining which type of audio channel a particular audio data set is associated with in a (multichannel) audio file, and whether a particular order or mode (e.g., film or Society of Motion Picture and Television Engineers (SMPTE)) of audio channels exist within the audio file. The techniques described below may additionally determine discrepancies in received audio data, such as the audio channels of the audio data being in an incorrect order or the audio channels being unsynchronized. In some embodiments, machine-learning may be employed to make such determinations. By utilizing the techniques described herein, and audio channels may be more efficiently, accurately, and quickly identified relative to manual analysis techniques.
These and other features, aspects, and advantages of the present disclosure will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:
One or more specific embodiments of the present disclosure will be described below. In an effort to provide a concise description of these embodiments, all features of an actual implementation may not be described in the specification. It should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another. Moreover, it should be appreciated that such a development effort might be complex and time consuming, but would nevertheless be a routine undertaking of design, fabrication, and manufacture for those of ordinary skill having the benefit of this disclosure.
When introducing elements of various embodiments of the present disclosure, the articles “a,” “an,” “the,” and “said” are intended to mean that there are one or more of the elements. The terms “comprising,” “including,” and “having” are intended to be inclusive and mean that there may be additional elements other than the listed elements.
As set forth above, there exists an opportunity to more efficiently, quickly, and accurately, determine the audio channels in audio data, such as audio files. As discussed herein, machine-learning may be employed to process audio data to determine several characteristics of the audio data, such as which type of audio channel a particular set of audio data is associated with in a (multichannel) audio file, and whether a particular order or mode (e.g., film or SMPTE) of audio channels exists within the audio file. The techniques described below may additionally determine discrepancies in received audio data, such as the audio channels of the audio data being in an incorrect order or the audio channels being unsynchronized. By utilizing the techniques described herein, and audio channels may be more efficiently, accurately, and quickly identified relative to manual analysis techniques.
The characterized audio data 14 may be audio data (e.g., an audio file) that has metadata (e.g., as applied by the audio processing system 10) indicating which channels that audio channels of the audio data are. For example, in the context of the audio data 12 having six audio channels (e.g., channel 1, channel 2, channel 3, channel 4, channel 5, channel 6) the characterized audio data 14 may include metadata indicating which type of channel (e.g., (front) left, center, (front) right, LFE, surround left, surround right) each particular channel is. The characterized audio data 14 may also include metadata (applied by the audio processing system 10) indicating a particular order or order format of the audio channels of the audio data 12. For example, the characterized audio data 14 may include metadata indicative of the characterized audio data 14 having a particular mode, order, or order format, such a film order (e.g., (front) left, center, (front) right, surround left, surround right, LFE for content with six channels) or SMPTE order (e.g., (front) left, center, (front) right, LFE, surround left, surround right for content with six channels). As discussed below with respect to
The audio processing system 10, for instance, may be implemented utilizing a computing device or computing system (e.g., a cloud-based system). Accordingly, the audio processing system 10 may include processing circuitry 16 and memory/storage 18. The audio processing system 10 may also include suitable wired and/or wireless communication interfaces configured to receive the audio data 12, for example, from other computing devices or systems. The processing circuitry 16 may include one or more general purpose central processing units (CPUs), one or more graphics processing units (GPUs), one or more microcontrollers, one or more reduced instruction set computer (RISC) processors, one or more application-specific integrated circuits (ASICs), one or more programmable logic controllers (PLCs), one or more field programmable gate arrays (FPGAs), one or more digital signal processing (DSP) devices, and/or any combination thereof as well as any other circuit or processing device capable of executing the functions described herein. The memory/storage 18, which may also be referred to as “memory,” may include a computer-readable medium, such as a random access memory (RAM), a computer-readable non-volatile medium, such as a flash memory. Alternatively, a floppy disk, a compact disc—read only memory (CD-ROM), a magneto-optical disk (MOD), and/or a digital versatile disc (DVD) may also be used. As such, the memory/storage 18 may include one or more non-transitory computer-readable media capable of storing machine-readable instructions that may be executed by the processing circuitry 16.
The memory/storage 18 may include a channel classification application 20 that may be executed by the processing circuitry 16 to generate the characterized audio data 14 from the audio data 12. More specifically, the processing circuitry 16 may generate audio channel representations 22 from the audio data 12 and execute the channel classification application 20 to analyze the audio channel representations 22 to generate the characterized audio data 14. While the audio channel representations 22 are discussed in more detail below, there may be one audio channel representation for each channel of the audio data 12, and the audio channel representations 22 may be any suitable computer-readable representations of the audio data 12 including, but not limited to, one or more graphs, one or more images, one or more waveforms, one or more spectrograms, or a combination thereof.
The channel classification application 20 may include a machine-learning module 24, (e.g., stored in the memory/storage 18), though it should be noted that, in other embodiments, the machine-learning module 24 may be kept elsewhere in the memory/storage 18 (e.g., not included in the channel classification application 20) in other embodiments. The machine-learning module 24 may include any suitable machine-learning algorithms to perform supervised learning, semi-supervised learning, or unsupervised learning, for example, using training data 26. The processing circuitry 16 may make the determinations discussed herein by executing the machine-learning module 24 to utilize machine-learning techniques to analyze the audio channel representations 22.
As used herein, machine-learning may refer to algorithms and statistical models that computer systems (e.g., including the audio processing system 10) use to perform a specific task with or without using explicit instructions. For example, a machine-learning process may generate a mathematical model based on a sample of data (e.g., the training data 26) in order to make predictions or decisions without being explicitly programmed to perform the task.
Depending on the inferences to be made, the machine-learning module 24 (or processing circuitry 16 executing the machine-learning module 24) may implement different forms of machine-learning. For example, in some embodiments (e.g., when particular known examples exist that correlate to future predictions or estimates that the machine-learning engine may be tasked with generating), a machine-learning engine (e.g., implemented by the processing circuitry 16) may implement supervised machine-learning. In supervised machine-learning, a mathematical model of a set of data contains both inputs and desired outputs. This data, which may be the training data 26, may include a set of training examples. Each training example may have one or more inputs and a desired output, also known as a supervisory signal. In a mathematical model, each training example is represented by an array or vector, sometimes called a feature vector, and the training data 26 may be represented by a matrix. Through iterative optimization of an objective function, supervised learning algorithms may learn a function that may be used to predict an output associated with new inputs. An optimal function may allow the algorithm to correctly determine the output for inputs that were not a part of the training data 26. An algorithm that improves the accuracy of its outputs or predictions over time is said to have learned to perform that task.
Supervised learning algorithms may include classification and regression techniques. Classification algorithms may be used when the outputs are restricted to a limited set of values, and regression algorithms may be used when the outputs have a numerical value within a range. Similarity learning is an area of supervised machine-learning closely related to regression and classification, but the goal is to learn from examples using a similarity function that measures how similar or related two objects are. Similarity learning has applications in ranking, recommendation systems, visual identity tracking, face verification, and speaker verification.
Additionally and/or alternatively, in some situations, it may be beneficial for the machine-learning engine (e.g., implemented by the processing circuitry 16) to utilize unsupervised learning (e.g., when particular output types are not known). Unsupervised learning algorithms take a set of data that contains only inputs, and find structure in the data, like grouping or clustering of data points. The algorithms, therefore, learn from test data that has not been labeled, classified, or categorized. Instead of responding to feedback, unsupervised learning algorithms identify commonalities in the data and react based on the presence or absence of such commonalities in each new piece of data.
That is, the machine-learning module 24 may implement cluster analysis, which is the assignment of a set of observations into subsets (called clusters) so that observations within the same cluster are similar according to one or more predesignated criteria, while observations drawn from different clusters are dissimilar. Different clustering techniques make different assumptions on the structure of the data, often defined by some similarity metric and evaluated, for example, by internal compactness, or the similarity between members of the same cluster, and separation, the difference between clusters. In additional or alternative embodiments, the machine-learning module 24 may implement other machine-learning techniques, such as those based on estimated density and graph connectivity.
Once the machine-learning module 24 is trained using the training data 26, the processing circuitry 16 may utilize the machine-learning module 24 to generate the characterized audio data 14. For example, the processing circuitry 16 may determine which types of channels the channels of the audio data 12 are and apply metadata indicative of which type of channel each of the channels is to the audio data 12 to generate the characterized audio data.
By automating the classification of audio channels of the audio data 12, time-consuming tasks that have typically required significant human subjectivity can be reduced. For example, automatic classification of the audio channels on the audio data 12 may be performed. This may result in audio channels being more accurately identified as well as higher-quality content (e.g., when content is played back by providing correct audio channel data to a corresponding speaker (or speakers) of a sound system (e.g., a surround sound system)).
Keeping the foregoing in mind,
At process block 42, the processing circuitry 16 may receive the audio data 12. For example, the audio processing system 10 may be communicatively coupled to an electronic device (e.g., a computing device or a storage device) via a wired or wireless connection and receive the audio data 12 from such a device. In one embodiment, the processing circuitry 16 may receive the audio data 12 from a database or cloud-based storage system.
At process block 44, the processing circuitry 16 may generate representations of the audio channels in the audio data 12. In other words, the processing circuitry 16 may generate the audio channel representations 22. The processing circuitry 16 may generate an audio channel representation for each audio channel of the audio data 12. The audio channel representations 22 may be any suitable computer-readable representations of the audio data 12 including, but not limited to, one or more graphs, one or more images, one or more waveforms, one or more spectrograms, or a combination thereof. Thus, in an example in which there are six audio channels in the audio data 12, the processing circuitry 16 may generate six audio channel representations 22, such as the spectrograms 60 (referring collectively to spectrogram 60A, spectrogram 60B, spectrogram 60C, spectrogram 60D, spectrogram 60E, and spectrogram 60F). In particular, the spectrograms 60 include spectrogram 60A for a first audio channel of the audio data 12, spectrogram 60B for a second audio channel of the audio data 12, spectrogram 60C for a third audio channel of the audio data 12, spectrogram 60D for a fourth audio channel of the audio data 12, spectrogram 60E for a fifth audio channel of the audio data 12 of the audio data 12, and spectrogram 60F for a sixth audio channel of the audio data 12. Each of the spectrograms 60 may be indicative of frequency (e.g., as indicated by axis 62 over time (e.g., as indicated by axis 64). Furthermore, it should be noted that, in some embodiments, the processing circuitry 16 may generate multiple audio channel representations 22 for each channel. For example, the processing circuitry may generate audio channel representations 22 representative of a particular blocks of time (e.g., a particular number of frames of data, duration of audio content, portion of a file size, etc.). Accordingly, the processing circuitry 16 may process the audio data 12 to generate the audio channel representations 22 (e.g., the spectrograms 60).
Returning to
At process block 72, the processing circuitry 16 (e.g., utilizing the channel classification application 20 and/or the machine-learning module 24) may receive the audio channel representations 22. In other embodiments, the operations of process block 72 may be performed by the processing circuitry at process block 44 of the process 40 in which the processing circuitry 16 may generate the audio channel representations 22.
At process block 74, the processing circuitry 16 may determine data points in the audio channel representations 22. For example,
Returning to
The processing circuitry 16 may also analyze the data points 92 to determine the types of audio channels based on data points 92 corresponding to maxima in the audio channel representations 22 and whether the audio data represented in the audio channel representations 22 is indicative of dialogue. For example,
For example, spectrogram 100C corresponding to a third audio channel of the audio data 12 may have a maximum (e.g., peak) data point representing the highest values (e.g., frequency values) of local and/or absolute maxima of the data points 92 (as indicated by arrow 102) among the spectrograms 100. Furthermore, the spectrogram 100C may be indicative of the audio data 12 including dialogue (as indicated by arrow 104). Accordingly, the processing circuitry 16 may preliminarily (and ultimately) identify the third audio channel as being the center channel, at least in part, on the features that the spectrogram 100C has the maximum data point and/or the most data points among the spectrograms 100.
The processing circuitry 16 may also identify pairs (e.g., left and right channels, surround left and surround right channels, rear left and rear right channels) based on the data points. For instance, the processing circuitry 16 may identify a first audio channel corresponding to the spectrogram 100A as being the (front) left channel based on data points (indicated by arrows 106) based on the maximum values of the data values of the spectrogram 100A being the next highest in value. The processing circuitry 16 may identify a second audio channel corresponding to the spectrogram 100B as being the (front) right channel based on data points (indicated by arrows 108) having maximum values most similar to (and less than) those of the spectrogram 100A. In an aspect, after channel pairs are identified (e.g., based on the similarities of their data points), the processing circuitry 16 may identify which channel among the channel pair is the front channel pair versus the surround channel pair. For example, the front channel pair may tend to include more data points than the surround channel pair, and therefore, the processing circuitry 16 may classify the channel pair with more data points as the front channel pair and the remaining channel pair as the surround channel pair. As between the left and right channels corresponding to the front channel pair, the processing circuitry 16 may use techniques to identify which is the left versus the right channel. In an aspect, the processing circuitry 16 may utilize machine learning to identify common differences between front left and front right channels and use those differences to classify the channels within the front channel pair. For example, the front left channel may have more data points than the front right channel (or vice versa). In another expect, the front left channel may have more high frequency and/or more low frequency data points compared to the front right channel (or vice versa). Similar techniques may be used to distinguish between the surround left and surround right channels.
Somewhat similarly, the processing circuitry 16 may identify a fifth audio channel corresponding to the spectrogram 100E as being the surround left channel based on data points (indicated by arrows 110) based on the maximum values of the data values of the spectrogram 100E being the next highest in value. The processing circuitry 16 may identify a sixth audio channel corresponding to the spectrogram 100F as being the surround right channel based on data points (indicated by arrows 112) having maximum values most similar to (and less than) those of the spectrogram 100E.
Lastly, in the example provided in
The processing circuitry 16 may also determine that the format of the audio data 12, in the example provided in
Returning to
At process block 80, the processing circuitry 16 may assign channel types of the channels based on the probabilities determined at process block 78. For example, the processing circuitry 16 may assign a channel as being a particular type of channel based on the probability of the channel having a highest value for being the particular channel type (e.g., among the probabilities determined at process block 78).
Returning to
In other embodiments, additionally or alternatively, the characterized audio data 14 may be or include data that is visually presentable, for example, in the form of a user interface, report, or image that is presentable on an electronic display. Bearing this in mind,
The mode indicator 130 may indicate an order format of the audio data 12 as determined by the processing circuitry 16 (e.g., during performance of the process 40). The mode indicator may be indicative of the number of audio channels in the audio data 12. For example, in the illustrated embodiment, the “5.1” is indicative of the audio data 12 having six channels. More specifically, the “5.1” is indicative of the audio data 12 having five full bandwidth channels and one LFE channel. The “SMPTE” is indicative of the six channels of the audio data 12 having the SMPTE order format described above. In other embodiments, the mode indicator 130 may indicate another mode, such as film mode, or another number of channels.
The channel indicators 132 may include a channel indicator 132 for each channel of the audio data 12 (e.g., as determined to be present in the audio data 12 by the processing circuitry 16) that indicates which channel (e.g., type of channel) a particular channel of the audio data 12 is. For example, in the illustrated example, the channel indicator 132A indicates that a first channel is the (front) left channel, the channel indicator 132B indicates that a second channel is the (front) right channel, the channel indicator 132C indicates that a third channel is the LFE channel, the channel indicator 132D indicates that a fourth channel is the center channel, the channel indicator 132E indicates that a fifth channel is the left surround channel, and the channel indicator 132F indicates that a sixth channel is the right surround channel.
The channel order indicator 134 may indicate whether the channels are in the correct order, with the correct order being the order the channels should have according to the format indicated by the mode indicator 130. For instance, for SMPTE 5.1 content, the first channel should be the (front) left channel, the second channel should be the (front) right channel, the third channel should be the center channel, the fourth channel should be the LFE channel, the fifth channel should be the left surround channel, and the sixth channel should be the right surround channel. In the illustrated embodiment, the third channel is the LFE channel (as indicated by the channel indicator 132C), and the fourth channel is the center channel, meaning the channels do not have the correct order. As such, the channel order indicator 134 is indicative of the channels being out of order. Somewhat similarly, the channel order message 136 indicates that the channels are out of order. In some embodiment, the channel order message 136 may indicate which channels are out order. Additionally, upon determining that the channels are in the correct order, the processing circuitry 16 may cause a different symbol to be utilized as the channel order indicator 134, such as a check mark (as used for the channel synchronicity indicator 140). Also, when the channels have the correct order, the channel order message 136 may indicate that the channels have the correct order.
The characterized audio data 14 may also include a selectable channel reordering element 138, which may be a graphical user interface (GUI) item that may be selected by a user (e.g., using an input device such as a mouse or keyboard or, for touchscreen displays, a finger or stylus) to cause the processing circuitry 16 to reorder the channels to have the correct order. For example, in response to receiving an input indicative of a selection of the selectable channel reordering element 138, the processing circuitry 16 may generate audio data (e.g., another form of the characterized audio data 14) that includes the channels in the correct order and, in some embodiments, metadata indicating the identity (e.g., type of channel) of each of the channels of the generated audio data.
As also illustrated, the characterized audio data 14 may include the channel synchronicity indicator 140 and the channel synchronicity message 142, which may both indicate whether the audio channels are synchronous or not. For example, in the illustrated embodiment, the channel synchronicity indicator 140 is a check mark, and the channel synchronicity message 142 states that the channels of the audio data 12 are synchronous. When the audio channels of the audio data 12 are asynchronous, the channel synchronicity indicator 140 may be different, such as an error symbol like the channel order indicator 134, and the channel synchronicity message 142 may indicate that the channels are asynchronous. More specifically, the channel synchronicity message 142 may indicate which channel or channels are asynchronous from other channels (e.g., one or two channels being asynchronous from five or four other channels of the audio data 12 in the example of the audio data 12 being for 5.1 surround sound systems).
Accordingly, the presently disclosed techniques enable the identities (e.g., types) of audio channels of audio content to be identified. Additionally, as described above, the techniques provided herein enable a format of the audio content (e.g., corresponding to an order of the audio channels) to be identified. As also discussed herein, the presently disclosed techniques may be utilized to determine whether audio channels are synchronized and in an order consistent with a determined format of the audio content.
While only certain features of the present disclosure have been illustrated and described herein, many modifications and changes will occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the present disclosure.