The present disclosure generally relates to audio signal processing and more particularly to systems and methods for comparing audio signals.
Previous approaches for comparing audio samples with a database of reference audio samples have typically relied on traditional audio processing techniques. These approaches often involve analyzing the audio samples using various signal-processing algorithms to extract relevant features such as spectral content, temporal patterns, time-frequency landmarks, and amplitude variations. However, when attempting to quantify whether two audio signals match but are not exact copies, without normalization, these approaches may not adequately account for variations in the overall energy levels, frequency balance, and signal length between the query audio sample and the reference audio samples.
In some cases, audio processing systems have attempted to address this issue by applying normalization techniques that adjust the overall characteristics of the query audio sample itself. While these techniques can partially mitigate the effects of variations in loudness, overall frequency balance, and signal length they may not provide a comprehensive solution that takes into account the specific spectro-temporal pattern of the query audio sample.
Other approaches have focused on comparing audio samples using statistical methods such as dynamic time warping or hidden Markov models. These methods aim to align the query audio sample with the reference audio samples by considering the temporal relationships between different segments of the audio signals. However, these approaches may not provide accurate results when comparing audio samples with significant variations in loudness or energy levels, or with slight frequency shifts between spectral components.
Therefore, there is still a need in the art for an audio processing system for a comprehensive comparison approach to accurately compare a query audio sample with a database of multiple reference audio samples.
It is an object of some embodiments to provide a system and a method for comparing audio signals or audio samples. Additionally or alternatively, it is an object of some embodiments to provide a system and a method for comparing audio samples with each other and/or with a plurality of other audio samples.
Some embodiments are based on recognizing that for a fair comparison, the audio samples need to be normalized. For example, the audio samples can be modified to be of the same length and loudness. Such a normalization modifies the audio sample itself and is referred to herein as an internal normalization. However, such an internal normalization is insufficient for a fair comparison of different audio samples. One of the reasons for such a deficiency is that the similarity of two audio samples depends on how the frequency content of an audio sample varies over time, which can also be described as comparing the spectro-temporal patterns between two audio samples. It is recognized that some sounds may have spectro-temporal patterns that are well-matched with a wide variety of other sounds. An extreme example of this is a white noise sound, which has energy at all frequencies and is constant across time, and may be well-matched to any sounds with constant frequency characteristics across time. Alternatively, a single brief click sound surrounded by silence may be well-matched to any other sound with a short duration in time, irrespective of the frequency content of the click. This issue persists even after normalizing sounds in terms of volume and length before comparison.
Hence, there is a need to normalize the audio comparison beyond the internal normalization of the audio samples. Such normalization is referred to herein as an external normalization that does not modify the audio sample itself but modifies the result of the comparison to make it more fair for different kinds of audio samples. The external normalization allows for the comparison of multiple audio queries to determine if a given query matched any of the reference samples without a need for a query-specific threshold. Using external normalization, a single threshold can be used to detect matches across multiple types of audio queries. In some embodiments, this external normalization determines a bias term based on a spectro-temporal pattern of a query audio sample and adds this bias term to the similarity scores produced by comparing the query audio sample with other reference audio samples. Hence, additionally or alternatively to normalizing the audio samples by internal normalization, some embodiments normalize the similarity scores by external normalization.
In some aspects, the techniques described herein relate to an audio processing system for comparing a query audio sample with a database of multiple reference audio samples using an external normalization, including: at least one processor; and memory having instructions stored thereon that, when executed by the processor, cause the audio processing system to: determine a bias term of the external normalization based on a spectro-temporal pattern of the query audio sample; compare the query audio sample with each of the reference audio samples to produce a similarity score for each comparison; combine the bias term with each of the similarity scores to produce normalized similarity scores; compare the normalized similarity scores with a threshold to produce a result of comparison; output the result of comparison.
In some aspects, the techniques described herein relate to an audio processing method for comparing a query audio sample with a database of multiple reference audio samples using an external normalization, wherein the method uses a processor coupled with stored instructions implementing the method, wherein the instructions, when executed by the processor carry out steps of the method, including: determining a bias term of the external normalization based on a spectro-temporal pattern of the query audio sample; comparing the query audio sample with each of the reference audio samples to produce a similarity score for each comparison; combining the bias term with each of the similarity scores to produce normalized similarity scores; comparing the normalized similarity scores with a threshold to produce a result of comparison; outputting the result of the comparison.
In some aspects, the techniques described herein relate to a non-transitory computer-readable storage medium embodied thereon a program executable by a processor for performing a method for comparing a query audio sample with a database of multiple reference audio samples using an external normalization, the method including: determining a bias term of the external normalization based on a spectro-temporal pattern of the query audio sample; comparing the query audio sample with each of the reference audio samples to produce a similarity score for each comparison; combining the bias term with each of the similarity scores to produce normalized similarity scores; comparing the normalized similarity scores with a threshold to produce a result of comparison; outputting the result of the comparison.
The present disclosure is further described in the detailed description which follows, in reference to the noted plurality of drawings by way of non-limiting examples of exemplary embodiments of the present disclosure, in which like reference numerals represent similar parts throughout the several views of the drawings. The drawings shown are not necessarily to scale, with emphasis instead generally being placed upon illustrating the principles of the presently disclosed embodiments.
While the above-identified drawings set forth presently disclosed embodiments, other embodiments are also contemplated, as noted in the discussion. This disclosure presents illustrative embodiments by way of representation and not limitation. Numerous other modifications and embodiments can be devised by those skilled in the art that fall within the scope and spirit of the principles of the presently disclosed embodiments.
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be apparent, however, to one skilled in the art that the present disclosure may be practiced without these specific details. In other instances, apparatuses and methods are shown in block diagram form only in order to avoid obscuring the present disclosure. Contemplated are various changes that may be made in the function and arrangement of elements without departing from the spirit and scope of the subject matter disclosed as set forth in the appended claims.
As used in this specification and claims, the terms “for example,” “for instance,” and “such as,” and the verbs “comprising,” “having,” “including,” and their other verb forms, when used in conjunction with a listing of one or more components or other items, are each to be construed as open-ended, meaning that the listing is not to be considered as excluding other, additional components or items. The term “based on” means at least partially based on. Further, it is to be understood that the phraseology and terminology employed herein are for the purpose of the description and should not be regarded as limiting. Any heading utilized within this description is for convenience only and has no legal or limiting effect.
Specific details are given in the following description to provide a thorough understanding of the embodiments. However, understood by one of the ordinary skills in the art can be that the embodiments may be practiced without these specific details. For example, systems, processes, and other elements in the subject matter disclosed may be shown as components in block diagram form in order not to obscure the embodiments in unnecessary detail. In other instances, well-known processes, structures, and techniques may be shown without unnecessary detail in order to avoid obscuring the embodiments. Further, like reference numbers and designations in the various drawings indicate like elements.
Some embodiments are based on recognizing that for a fair comparison, the audio samples need to be normalized. For example, the audio samples can be modified to be of the same length, same loudness, same average frequency characteristics, etc. Such a normalization modifies the audio sample itself and is referred to herein as an internal normalization. However, such an internal normalization is insufficient for a fair comparison of different audio samples. One of the reasons for such a deficiency is that the similarity of two audio samples depends on how the frequency content of an audio sample varies over time, which can also be described as comparing the spectro-temporal patterns between two audio samples. It is recognized that some sounds may have spectro-temporal patterns that are well-matched with a wide variety of other sounds. An extreme example of this is a white noise sound, which has energy at all frequencies and is constant across time, and may be well-matched to any sounds with constant frequency characteristics across time. Alternatively, a single brief click sound surrounded by silence may be well-matched to any other sound with a short duration in time, irrespective of the frequency content of the click. This issue persists even after normalizing sounds in terms of volume, equalization and/or length before comparison.
Hence, there is a need to normalize the audio comparison beyond the internal normalization of the audio samples. Such normalization is referred to herein as an external normalization that does not modify the audio sample itself but modifies the result of the comparison to make it more fair for different kinds of audio samples. In some embodiments, this external normalization determines a bias term based on a spectro-temporal pattern of a query audio sample and adds this bias term to the similarity scores produced by comparing the query audio sample with other reference audio samples. Hence, additionally or alternatively to normalizing the audio samples by internal normalization, some embodiments normalize the similarity scores by external normalization.
Another type of internal normalization includes applying a gain to the audio signals such that they are at the same loudness, where loudness can be measured using non-perceptual measures such as root mean square (RMS) level, or perceptually motivated loudness measures such as loudness units full scale. Some implementations also internally normalize audio signals in terms of their overall frequency content using equalization filters. Alternative embodiments do not do that because this type of internal normalization is often not beneficial for comparing sounds with highly varying content.
In contrast, the external normalization 260 happens after comparison 220 and modifies the results of comparison 240. Instead of modifying the audio files for a fair comparison, the external normalization adds a bias term to the similarity score determined as a function of a spectro-temporal pattern of the query audio sample.
At step 310, method 300 determines a bias term of the external normalization based on a spectro-temporal pattern of the query audio sample. At step 320, method 300 compares the query audio sample with each of the reference audio samples to produce a similarity score for each comparison. At step 330, method 300 adds the bias term to each of the similarity scores to produce normalized similarity scores. At step 340, method 300 compares the normalized similarity scores with a threshold to produce a result of the comparison. At step 350, method 300 outputs the result of the comparison. Doing this in such a manner adjusts the results of the comparison in consideration of the specifics of the audio files.
Notably, this method can be used to compare multiple audio samples with each other when some of the audio samples are regarded as queries and other audio samples are regarded as reference sets. For each pair of a query q and a reference r, a similarity score is computed, resulting in a pairwise similarity matrix. Some embodiments follow a two-stage semi-manual approach consisting of a retrieval stage and a verification stage. In the retrieval stage, the embodiments retrieve the queries q whose top-1 match similarity score is above a certain threshold τ.
The similarity score s(q, r) between query q and reference r is computed based on a similarity score between descriptors q, r∈d extracted from q and r, using a similarity score, including but not limited to a cosine similarity:
Some embodiments use a low-dimensional log mel spectrogram as a sample descriptor. This is a natural choice since most audio generative models employing latent diffusion extract the latent representation from a mel one, and using a low dimensional mel spectrogram helps to smooth out fine-grained details in the audio signal, which can make it difficult to detect near duplication in datasets. Some embodiments use a contrastive language audio pretraining (CLAP) descriptor, which is an embedding vector from a pre-trained deep neural network that aligns audio content with its semantic description. The choice of the CLAP model varies among embodiments. For example, some embodiments replace the CLAP model with other deep neural network audio models that provide an embedding vector for an audio signal.
Additionally, for each query q, some embodiments discount its similarity with each reference r using a bias term based on the average similarity between q and its K nearest neighbors in a background set of other samples, resulting in the normalized similarity score:
where bk is the k-th nearest neighbor of q in the background set (based on the similarity of their descriptors), and β a scalar. Therefore, in the retrieval stage, we retrieve the following set of queries:
which are then inspected along with their top-1 matches in the verification stage.
The similarity measure 430 takes as input two audio signals 110 and 460 and outputs 430 a single number between 0 and 1, where 1 represents the fact that two audio signals are identical, and 0 means they have no similar features. The cosine similarity from equation (1) above is typically used as the similarity measure 430. Using only the raw output of the similarity measure to find potential matches, however, is not sufficient to find approximate matches in sets of real-world audio signals. By normalizing the score, the method prevents certain sounds (e.g., noise or clicks) from always being returned when searching for approximate matches in a data-driven manner.
The method normalizes the similarity score by subtracting 440 a bias term 425 to the output of a similarity measure 430. The bias term of the external normalization is determined 410 based on a spectro-temporal pattern of the query audio sample. For example, in some embodiments, the method extracts the spectro-temporal pattern of the query audio sample and processes the extracted spectro-temporal pattern with a predetermined analytical function to produce the bias term. Some examples of analytical functions used by some implementations are based on low-level characteristics of the audio signal such as zero-crossing rate or spectral flatness measures, which can be used as a measure of how noise-like an audio sample is. Thus, sounds with a higher zero crossing rate or spectral flatness could have a higher bias term 425, as in noise-like sounds, which typically have energy at most frequencies, to be more similar to a wide variety of sounds in terms of their spectro-temporal patterns. In different embodiments, the method processes the extracted spectro-temporal pattern with a learned function trained with machine learning to produce the bias term.
In some implementations, the bias term computed 410 by similarity normalization uses a background or training dataset having sounds that are different from those that are currently being searched for potential matches. The bias term is computed as the average similarity from a subset of the background dataset most similar to the query sound q, as was described in equation (2) where the subset consisted of the K nearest neighbors in the background dataset. It is useful for the background dataset to be diverse in terms of the spectro-temporal patterns of included sounds. If the background dataset over-represents sounds with certain spectro-temporal patterns, the similarity normalization may overly penalize sounds with spectro-temporal patterns similar to those that are overly represented in the background dataset. When combining the similarity measure computed by comparing query audio sample q and reference audio sample r, with the bias from similarity normalization, it is beneficial to have a scaling weight 420 to trade-off between the two components of the similarity score. This scaling weight is a nonnegative number specified by the user and allows one to trade-off between the importance of the raw similarity measure and the similarity normalization bias. In such a case where the background dataset lacks diversity, which would cause the bias from similarity normalization to be of lower quality, a smaller value for the scaling weight is selected to compensate for that deficiency. However, if the background dataset is extensive and diverse, a larger value for the scaling weight can be used. Some implementations use a scaling weight of 0.5.
Examples of extracted spectro-temporal patterns include things like the onset times of different sound events in audio signals, different harmonic patterns, etc. These spectro-temporal patterns are typically represented using a time-frequency representation such as a spectrogram, and the frequency axis can also contain a perceptual grouping of frequencies, such as in a mel spectrogram. Features extracted from powerful deep learning models can also be used as the representation for the spectro-temporal pattern of an audio signal. Examples of the predetermined analytical function 620 include the spectral flatness or zero-crossing rate of the audio signal as a proxy for how noise-like an audio signal is. Examples of the learned function include a neural network and support vector machines trained with machine learning or a simple K-nearest neighbor average from a background set. For example, in some embodiments, the learned function is trained with supervised machine learning using bias terms determined based on averaged similarity measures 520 of training audio samples.
In some implementations, the embodiment is configured to normalize the mel spectrograms with an internal normalization 770. For example, the internal normalization 770 converts the mel spectrograms 720 and 725 to decibel scale 730 and 735, respectively and normalizes 740 and 745 such that the maximum bin in the mel spectrogram has a value of 0 dB, and all bin values below −40 dB are clipped to a value of −40 dB. Then, the embodiment flattens 750 and 755 from two-dimensional (containing time and frequency dimensions) mel spectrograms into one-dimensional vectors and computes the cosine similarity 760 between the two vectors as shown in equation (1).
For example, this embodiment can use a contrastive language audio pre-training (CLAP) model. This is a deep neural network that learns a common embedding space between sound signals and their text descriptions. The CLAP model takes as input an audio signal and returns a 512-dimensional vector. The embodiment can compute a similarity measure by computing the cosine similarity as in equation (1) using the CLAP embeddings computed for the query audio sample q and reference audio sample r.
In some embodiments, the database of multiple reference audio samples includes the query audio sample such that the reference audio samples are compared with each other. One embodiment uses the comparison to prune the database of multiple reference audio samples upon detecting duplications indicated by the result of the comparison. The pruned database of multiple reference audio samples can be used for training an audio deep-learning model more efficiently thereby improving the operation of the computers.
The embodiments find connected components 950 in the binary similarity matrix treated as a graph adjacency matrix, containing N nodes where a value of 1 in the (i,j) index of the similarity matrix indicates that two nodes are connected in a graph and a value of zero means they are not connected. Various methods can be to find 950 connected components in a graph, where each connected component can be considered a cluster of duplicated sounds. An example, the processing of the similarity matrix for the case when N=4 is also illustrated for each of the processing steps. In this example, there is one set of connected components in row/column 3 and 4 indicating that the sound files in the training set with indices 3 and 4 are duplicates.
The introduction of audio-generative models possessing the ability to generate realistic sound clips on demand from a text description has the potential to revolutionize how we work with audio. However, evidence from generative models used for text and images has shown extensive evidence of models that often replicate their training data, rather than generating novel text/images. However, how these results extend to audio-generative models is currently unknown. In this work, we propose a technique for detecting when an audio-generative model replicates its training data. Our approach uses score normalization to mitigate the issue that certain sounds may be inherently more similar to all other sounds. Detecting such replications could be useful for subsequently preventing future training data replication to ensure the training data remains private. Furthermore, our detection approach is also useful for finding near duplicates in large training datasets, where knowledge of these duplicates can enable more efficient training algorithms, potentially saving large compute costs.
The ability to generate novel digital data such as text, audio, and music using a trained machine learning model has recently exploded into public consciousness, often referred to as “Generative AI.” These models are typically controlled by natural language and are easy to use. However, it is recognized that generative models don't always create novel data, and sometimes output exact or near exact replicas of data included in the training set. This can cause security concerns as training data may be private, and replications may violate copyright law. Research attempting to answer the technical question of how generative models may memorize and/or copy their training data has begun to appear in the case of images. However, audio generative models, such as those based on latent diffusion models, are less mature, and methods for robustly detecting replicated training data in audio models are an underexplored area.
Recent works on text-to-music generation have uncovered evidence of training data memorization for transformer-based models and the lack of novelty of generated samples from a diffusion-based model. However, those works do not account for the fact that some sounds (e.g., constant white noise) may be inherently more similar to all other sounds, and as such many of the matches they find are sounds that lack diversity (i.e., have stationary characteristics). Furthermore, methods such as audio fingerprinting are good at finding exact matches in short snippets of audio but may fail for approximate matches, e.g., in identifying training data replications in audio generative models that may have artifacts or other changes but remain perceptually very similar to training data samples.
Different embodiments detect duplication during training and/or execution of the audio-generative models. For example, one embodiment is configured to train an audio-generative model using the database of multiple reference audio samples; and detect duplication of audio samples generated by the audio-generative model and reference samples in the database of multiple reference audio samples. Additionally or alternatively, another embodiment is configured to execute an audio-generative model trained using the database of multiple reference audio samples to generate an audio sample; and transmit the generated audio sample unless the generated audio sample with the external normalization is a duplication of one or more of the reference audio samples.
The audio processing system 102 includes a hardware processor 1308 is in communication with a computer storage memory, such as a memory 1310. Memory 1310 includes stored data, including algorithms, instructions, and other data that may be implemented by the hardware processor 1308. It is contemplated the hardware processor 1308 may include two or more hardware processors depending upon the requirements of the specific application. The two or more hardware processors may be either internal or external. The audio processing system 102 may be incorporated with other components including output interfaces and transceivers, among other devices.
In some alternative embodiments, the hardware processor 1308 may be connected to a network 1312, which is in communication with one or more data source(s) 1314, a computer device 1316, a mobile phone device 1318, and a storage device 1320. The network 1312 may include, by non-limiting example, one or more local area networks (LANs) and/or wide area networks (WANs). The network 1312 may also include enterprise-wide computer networks, intranets, and the Internet. The audio signal processing system 1300 may include one or more client devices, storage components, and data sources. Each of the one or more client devices, storage components, and data sources may comprise a single device or multiple devices cooperating in a distributed environment of the network 1312.
In some other alternative embodiments, the hardware processor 1308 may be connected to a network-enabled server 1322 connected to a client device 1324. The hardware processor 1308 may be connected to an external memory device 1326, and a transmitter 1328. Further, an output for each target speaker may be outputted according to a specific user's intended use 1330. For example, the specific user intended use 1330 may correspond to displaying speech in text (such as speech commands) on one or more display devices, such as a monitor or screen, or inputting the text for each target speaker into a computer-related device for further analysis, or the like.
The data source(s) 1314 may comprise data resources for training the restoration operator 104 for a speech enhancement task. For example, in an embodiment, the training data may include acoustic signals of multiple speakers talking simultaneously along with background noise. The training data may also include acoustic signals of single speakers talking alone, acoustic signals of single or multiple speakers talking in a noisy environment, and acoustic signals of noisy environments.
The data source(s) 1314 may also comprise data resources for training the restoration operator 104 for a speech recognition task. The data provided by data source(s) 1314 may include labeled and un-labeled data, such as transcribed and un-transcribed data. For example, in an embodiment, the data includes one or more sounds and may also include corresponding transcription information or labels that may be used for initializing the speech recognition task.
Further, unlabeled data in the data source(s) 1314 may be provided by one or more feedback loops. For example, usage data from spoken search queries performed on search engines can be provided as un-transcribed data. Other examples of data sources may include by way of example, and not limitation, various spoken-language audio or image sources including streaming sounds or video, web queries, mobile device camera or audio information, webcam feeds, smart-glasses and smart-watch feeds, customer care systems, security camera feeds, web documents, catalogs, user feeds, SMS logs, instant messaging logs, spoken-word transcripts, gaming system user interactions such as voice commands or captured images (e.g., depth camera images), tweets, chat or video-call records, or social-networking media. Specific data source(s) 1314 used may be determined based on the application including whether the data is a certain class of data (e.g., data only related to specific types of sounds, including machine systems, entertainment systems, for example) or general (non-class-specific) in nature.
The audio processing system 102 may also include third-party devices, which may include any type of computing device, such as an automatic speech recognition (ASR) system on the computing device. For example, the third-party devices may include a computer device, or a mobile device 1318. The mobile device 1318 may include a personal data assistant (PDA), a smartphone, a smart watch, smart glasses (or other wearable smart device), an augmented reality headset, a virtual reality headset, a laptop, a tablet, a remote control, an entertainment system, a vehicle computer system, an embedded system controller, an appliance, a home computer system, a security system, a consumer electronic device, or other similar electronics device. The mobile device 1318 may also include a microphone or line-in for receiving audio information, a camera for receiving video or image information, or a communication component (e.g., Wi-Fi functionality) for receiving such information from another source, such as the Internet or a data source 1314. In one example embodiment, the mobile device 1318 may be capable of receiving input data such as audio and image information. For instance, the input data may include a query of a speaker into a microphone of the mobile device 1318 while multiple speakers in a room are talking. The input data may be processed by the ASR in the mobile device 1318 using the audio processing system 102 to determine a content of the query. The audio processing system 102 may enhance the input data by reducing noise in the environment of the speaker, separating the speaker from other speakers, or enhancing audio signals of the query and enabling the ASR to output an accurate response to the query.
In some example embodiments, the storage 1320 may store information including data, computer instructions (e.g., software program instructions, routines, or services), and/or data related to the neural network model of the audio processing system 102. For example, the storage 1320 may store data from one or more data source(s) 1314, one or more deep neural network models, information for generating and training deep neural network models, and the computer-usable information outputted by one or more deep neural network models.
Furthermore, embodiments of the subject matter disclosed may be implemented, at least in part, either manually or automatically. Manual or automatic implementations may be executed, or at least assisted, through the use of machines, hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. When implemented in software, firmware, middleware, or microcode, the program code or code segments to perform the necessary tasks may be stored in a machine-readable medium. A processor(s) may perform the necessary tasks.
The above-described embodiments of the present disclosure may be implemented in any of numerous ways. For example, the embodiments may be implemented using hardware, software, or a combination thereof. When implemented in software, the software code may be executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers. Such processors may be implemented as integrated circuits, with one or more processors in an integrated circuit component. However, a processor may be implemented using circuitry in any suitable format.
Also, the various methods or processes outlined herein may be coded as software that is executable on one or more processors that employ any one of a variety of operating systems or platforms. Additionally, such software may be written using any of a number of suitable programming languages and/or programming or scripting tools, and also may be compiled as executable machine language code or intermediate code that is executed on a framework or virtual machine. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments.
Also, the embodiments of the present disclosure may be embodied as a method, of which an example has been provided. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts concurrently, even though shown as sequential acts in illustrative embodiments. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the present disclosure.
Although the present disclosure has been described with reference to certain preferred embodiments, it is to be understood that various other adaptations and modifications can be made within the spirit and scope of the present disclosure. Therefore, it is the aspect of the append claims to cover all such variations and modifications as come within the true spirit and scope of the present disclosure.
Number | Date | Country | |
---|---|---|---|
63589645 | Oct 2023 | US |