Any and all applications for which a foreign or domestic priority claim is identified in the Application Data Sheet as filed with the present application are hereby incorporated by reference under 37 CFR 1.57.
A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document and/or the patent disclosure as it appears in the United States Patent and Trademark Office patent file and/or records, but otherwise reserves all copyrights whatsoever.
The present disclosure generally relates to audio content processing, and in particular, to methods and systems for adjusting the volume levels of speech in media files.
Conventional approaches for leveling speech in media files have proven deficient. For example, certain conventional techniques for leveling speech are time consuming, highly manual, and require the expert technical knowledge of audio professionals. Certain other existing techniques often change the level of the audio at the wrong time and place, thereby failing to retain the emotion and human characteristics of the speaker.
The following presents a simplified summary of one or more aspects in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later.
An aspect of the present disclosure relates to a system comprising: at least one processing device operable to: receive audio data; receive an identification of specified deliverables; access metadata corresponding to the specified deliverables, the metadata specifying target parameters including at least specified target loudness parameters for the specified deliverables; normalize an audio level of the audio data to a first specified target level using a corresponding gain to provide normalized audio data; perform loudness measurements on the normalized audio data; obtain a probability that speech audio is present in a given portion of the normalized audio data and identify a corresponding time duration; determine if the probability of speech being present within the given portion of the normalized audio data satisfies a first threshold; at least partly in response to determining that the probability of speech being present within the given portion of the normalized audio data satisfies the first threshold and that the corresponding time duration satisfies a second threshold, associating a speech indicator with the given portion of the normalized audio data; based at least in part on the loudness measurements, associate a given portion of the normalized audio data with a pair of change point indicators that indicate a short term change in loudness that satisfies a third threshold, the pair of change point indicators defining an audio segment; determine a gain needed to reach an interim target audio level for the given audio segment associated with the speech indicator; use the determined gain needed to reach the interim target audio level to perform volume leveling on the given audio segment to thereby provide a volume-leveled given audio segment; use one or more dynamics audio processors to process the volume-leveled given audio segment to satisfy one or more of the target parameters, including at least a specified target loudness parameter.
An aspect of the present disclosure relates to a computer implemented method comprising: accessing audio data; receiving an identification of specified deliverables; accessing metadata corresponding to the specified deliverables, the metadata specifying target parameters including at least specified target loudness parameters; performing loudness measurements on the audio data; obtaining a likelihood that speech audio is present in a given portion of the audio data and identify a corresponding time duration; determining if the likelihood of speech being present within the given portion of the audio data satisfies a first threshold; at least partly in response to determining that the likelihood of speech being present within the given portion of the audio data satisfies the first threshold and that the corresponding time duration satisfies a second threshold, associating a speech indicator with the given portion of the audio data; based at least in part on the loudness measurements, associating a given portion of the audio data with a pair of change point indicators that indicate a short term change in loudness that satisfies a third threshold, the pair of change point indicators defining an audio segment; determining a gain needed to reach an interim target audio level for the given audio segment associated with the speech indicator; using the determined gain needed to reach the interim target audio level to perform volume leveling on the given audio segment to thereby provide a volume-leveled given audio segment; using one or more dynamics audio processors to process the volume-leveled given audio segment to satisfy one or more of the target parameters, including at least a specified target loudness parameter; generating a file comprising audio data processed to satisfy one or more of the target parameters; and providing the file generated using the processed audio data to one or more destinations.
Embodiments will now be described with reference to the drawings summarized below. These drawings and the associated description are provided to illustrate example aspects of the disclosure, and not to limit the scope of the invention.
The following description is presented to enable any person skilled in the art to make and use the apparatus and is provided in the context of particular applications of the apparatus and their requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art and the general principles defined herein may be applied to other embodiments and applications without departing from the scope of the present apparatus. Thus, the present apparatus is not intended to be limited to the embodiments shown but is to be accorded the widest scope consistent with the principles and features disclosed herein.
Conventional methods for leveling speech in media files are time consuming, slow, inaccurate, and have proven deficient. For example, using one conventional approach users identify audio leveling problems in real-time by listening to audio files and watching the feedback from various types of meters. After an audio leveling problem is identified, the user needs to determine how to correct the audio leveling problem. One conventional approach uses an audio editing software application graphical user interface that requires the user to “draw” in the required volume changes by hand with a pointing device, a tedious, inaccurate, and time-consuming task.
Other conventional methods (often used in combination with the foregoing “drawing” approach), utilize various types of dynamic range processors (DRPs.). One critical problem with DRPs is that DRPs are generally configured for music or singing and so perform poorly on recorded speech whose signals vary both in dynamic range and amplitude. Even those DRPs configured to perform audio leveling specifically for speech are often deficient when it comes to sound quality, and are unable to retain the emotional character of the speaker; often increasing the volume decreasing the volume in the middle of words and at the wrong times.
Other conventional methods employ audio normalizers and loudness processors that utilize an integrated target value and thus fail to level speech at the right times or miss entirely during short term and momentary speech fluctuations. Thus, conventional techniques for leveling speech are time consuming, highly manual, inaccurate, and require expert knowledge from users who are rarely audio professionals. Conventional techniques often change the level of the audio at the wrong time or by an incorrect amount; failing to retain the emotion and human characteristics of the speaker.
Audio postproduction tasks are conventionally mostly a manual process with all inaccuracies and deficient associated with manual approaches. Non-linear editing software, digital audio workstations, audio hardware and plugins have brought certain improvements in sonic quality and speed, but fail to adequately automate tasks, are tedious to utilize, and require trial and error in attempting to obtain a desired result.
When it comes to audio leveling for dialogue and speech, users currently have several conventional options to choose from, many of which have a common theme of needing manual intervention, such as drawing volume automation by hand, clip-based audio gain, audio normalization and various dynamic range processors including automatic gain control, compressors, and limiters.
Conventional automated speech leveling solutions may, add noise, increase the volume of breaths and mouth clicks, miss short, long, and momentary volume fluctuations altogether, destroy dynamic range, sound unnatural, turn up or turn down speech volume at the wrong times, compress speech may sound so that it sounds lifeless, and may produce an audible pumping effect.
Additionally, conventional speech leveling solutions may require users to choose the dynamic range needed and target loudness values, set several parameters manually and have a vast understanding of advanced audio concepts. Further, conventional speech leveling systems generally focus on meeting the integrated loudness target but may miss the short time, short time max and momentary loudness specifications completely. Further, conventional speech leveling systems may not produce a final deliverable audio file at the proper audio codec and channel format.
In order to solve some or all of the technical deficiencies of conventional techniques, disclosed are methods and systems configured to automate leveling speech, while meeting loudness and delivery file formats. Such an example system and process are illustrated in
In a first aspect of the present disclosure there is provided a method for automating various types of audio related tasks associated with volume leveling, audio loudness, audio dithering and audio file formatting. Such tasks may be performed in batch mode, where several audio records may be analyzed and processed in non-real time (e.g., when loading is otherwise relatively light for processing systems).
According to an embodiment of the first aspect, an analysis apparatus is configured to perform analysis tasks, including extracting audio loudness statistics, detecting loudness peaks, detecting speech (e.g., spoken words by one or more people) and classifying audio content (e.g., into speech and non-speech content audio types, and optionally into still additional categories).
By way of example, the systems and methods described herein may be configured to determine when and how much gain to apply to an audio signal.
With reference to
A system may be configured to receive audio loudness information and audio file format information from an audio deliverables database 100. The audio deliverables database 100 may optionally be classified by distributor, platform, and/or various audio loudness standards.
With reference to
With reference to
Referring to
The speech analyzer system 300 may utilize speech detection. The speech detection process may utilize a window size of 10 ms such that probabilities of speech (or other speech likelihood indicators) are determined for each frame and optionally other window sizes may be used such as 1 ms, 5 ms, 200 ms, 5 seconds and other values between.
The speech detection process may be configured for the time domain as input with other domains possible such as frequency domain. If the domain of the input is specified as time, the input signal may be windowed and then converted to the frequency domain according to the window and sidelobe attenuation specified. The speech detection process may utilize a HANN window although other windows may be used. The sidelobe attenuation may be 60 (dB) with other values possible such as 40 dB, 50 dB, 80 dB and other values between. The FFT (Fast Fourier Transform) length may be 480 with other lengths possible, such as 512, 1024, 2048, 4096, 8192, 48000 and other values between.
If the domain of the input is specified as frequency, the input is assumed to be a windowed Discrete Time Fourier Transform (DTFT) of an audio signal. The signal may be converted to the power domain. Noise variance is optionally estimated according to Martin, R. “Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics.” IEEE Transactions on Speech and Audio Processing. Vol. 9, No. 5, 2001, pp. 504-512, the content of which is incorporated herein by reference in its entirety.
The posterior and prior SNR are optionally estimated according to the Minimum Mean-Square Error (MMSE) formula described in Ephraim, Y., and D. Malah. “Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator.” IEEE Transactions on Acoustics, Speech, and Signal Processing. Vol. 32, No. 6, 1984, pp. 1109-1121, the content of which is incorporated herein by reference in its entirety.
A log likelihood ratio test and Hidden Markov Model (HMM)-based hang-over scheme are optionally used to determine the probability that the current frame contains speech, according to Sohn, Jongseo., Nam Soo Kim, and Wonyong Sung. “A Statistical Model-Based Voice Activity Detection.” Signal Processing Letters IEEE. Vol. 6, No. 1, 1999.
The speech detection process may optionally be implemented using an application or system that analyzes an audio file and returns probabilities of speech in a given frame or segment. The speech detection process may extract full-band and low-band frame energies, a set of line spectral frequencies, and the frame zero crossing rate, and based on the foregoing perform various initialization steps (e.g., an initialization of the long-term averages, setting of a voice activity decision, initialization for the characteristic energies of the background noise, etc.). Various difference parameters may then be calculated (e.g., a difference measure between current frame parameters and running averages of the background noise characteristics). For example, difference measures may be calculated for spectral distortion, energy, low-band energy, and zero-crossing. Using multi-boundary decision regions in the space of the foregoing difference measures, a voice activity decision may be made.
A speech decision engine may utilize a system to determine speech and non-speech using speech probabilities output. The system may utilize a speech segment rule, non-speech segment rule and pad time to accomplish this so that the segment includes a certain amount of non-speech audio to ensure that the beginning of the speech segment is included in the volume leveling process. The speech decision engine may further determine where initial non-speech starts and ends.
A speech decision engine 400 may utilize short term loudness measurements 303 to identify significant changes in volume amplitude. The system may optionally utilize non-speech timecodes to identify where to start the short-term loudness search. The search may calculate multiple (e.g., 2) different mean values, searching backward and forward in time, using a window (e.g., a 3 second window and optionally other windows sizes may be used such as 0.5 seconds, 2 seconds, 5 seconds or up to the duration of each segment). The system may optionally evaluate each non-speech segment location to determine if a change point is present. When complete, a collection of time codes and change point indicators may represent the initial start and end points of candidate speech segments to be leveled. A change point may be defined as a condition where the audio levels may change by at least a threshold amount, and a change point indicator may be associated with a given change point.
The speech decision engine 400 may optionally be configured to identify immutable change points using the resolve adjacent change points system 406. A change point may be classified as immutable, meaning once set the change point indicator is not to be removed. Immutable may be defined as when a non-speech duration exceeds a threshold period of time (e.g., 3 seconds and optionally other non-speech durations may be used such as 0.5 seconds, 1 second, 5 seconds and other up to the duration of the longest non-speech segment).
The speech decision engine 400 may optionally be configured to resolve adjacent short time loudness change points using the resolve adjacent change points system 406. Adjacent change points may be identified as those occurring within a specified minimum time distance of each other. For example, the duration between change points may be <=3 seconds, although optionally other durations may be used such as 0.5 seconds, 2 seconds, 10 seconds 60 seconds or other value between.
The speech decision engine 400 may be configured to merge, add, remove, and/or correct the end points of candidate audio speech segments to determine the final audio speech segments using the interim target audio level system 410 for leveling. For example, similar audio segments may be merged. For example, adjacent audio segments within 2.5 dB (or other specified threshold range) of each other may be merged, thereby reducing the number of audio segments.
The speech decision engine 400 may optionally determine an interim target audio level (ITAL) using the interim target audio level system 411, which may also be used to merge similar, adjacent audio segments. The ITAL may be dynamically updated based on the audio deliverables database output. The ITAL may optionally be utilized to provide audio gain instructions for the audio speech segment leveler. The ITAL may enable the dynamics audio processors threshold values to be in range more often than they would have otherwise.
Referring to
The volume leveler system 500 may utilize dynamics audio processors 502 to meet various international audio loudness requirements or specifications including target loudness, integrated loudness, short time loudness and/or max true peak. The dynamics audio processors 502 may process fames of audio continuously over time when a given condition is met, for example, when the amplitude exceeds or is less than a pre-determined value. The parameters may be pre-determined or the parameters may update dynamically, dependent on the output of the audio deliverables database
The dynamics audio processors 502 may be optimized for upward expanding 503 (e.g., to increase the dynamic range of the audio signal). The upward expander floor value may utilize the previously calculated noise floor determined from the non-speech segments within a speech detection table. The amount of gain increase in the output of the upward expander may be dependent on the upward expander 503 threshold. The upward expander may optionally utilize the output of the audio deliverables database 100 to dynamically update the threshold where needed. The upward expander 503 may utilize a range parameter. The upward expander 503 range may be used to limit the max amount of gain that can be applied to the output.
A post processing system 600 may be configured to transcode audio files (see, e.g.,
The system may optionally be configured for distributed gain staging (see, e.g., the example waveform 506 illustrated in
As noted above, the system and processes illustrated in
Optionally, the calculation excludes near silence and silence from the measurement, which improves accuracy in regard to speech volume. Once the difference is determined, the pre-processing system 200 may effectively normalize any file it receives to the same average level. While the preferred method is to filter after normalizing, it may also be beneficial to normalize after filtering. For example, when the ratio of non-speech to speech is high and SNR may be poor, normalization may preform less than adequately. Therefore, in such a scenario, normalization may be performed after filtering.
The pre-processing system 200 may optionally filter audio according to human speech. The filters may be high-pass and/or low-pass. The low-pass filter slope may be calculated in decibels per octave and be set at 48, although optionally other values may be used such as 42, 44, 52, 58 and other values between. The low-pass filter cutoff frequency may be set to 12 kHz, although optionally other slopes and cutoffs may be utilized, such as 8 kHz, 14 kHz, 18 kHz, 20 kHz, or other values. For example, when noise hiss is at a lower frequency, the filter cutoff frequency may be set to a corresponding lower value to reduce the hiss.
The pre-processing system filters settings may be pre-determined or change dynamically with the preferred method being pre-determined. The high-pass filter slope may be calculated in decibels per octave and be set at 48. The high-pass filter cutoff frequency may be set to 80 Hz, although optionally other slopes and cutoffs may be utilized such as 40 Hz, 60 Hz, 200 Hz and other values between. For example, when recorded microphones vary greatly in bass response, the filter cutoff frequency may be set to a corresponding higher value to reduce the differences in the range of bass frequencies across the various speakers. Another benefit of the filters is added precision in the dynamics processors 502 with regards to threshold. For example, excessive amounts of low frequency noise outside the human speech range have been known to artificially raise the level of audio. This in turn effects the dynamics processors threshold value in a negative way, therefore eliminating noise outside the human voice range provides an added benefit in the volume leveler system 500.
The example loudness measurements system 302 may optionally utilize the BS.1770 standard, while other standards such as the EBU R 128 or other standards may be used. A frame, or window size of a certain number of samples (e.g., 9600 samples, although optionally other numbers of samples may be utilized, such as 1024, 2048, 4096, 14400, 48000, etc.). The loudness measurements may be placed in an array consisting of time code, momentary loudness, momentary maximum loudness, short time loudness, integrated loudness, loudness range, and loudness peak, and/or the like.
The system illustrated in
As discussed above,
The speech analyzer illustrated in
The error correction process may search for peak levels in the peak statistics 307 less than the Non-Speech Peak value, and when such peak levels are found, set the corresponding speech probability within the speech detection array to a probability of 0%, which may then indicate non-speech.
Referring to
The example speech decision engine 400 illustrated in
To identify non-speech the system may utilize the following example rule. For example, non-speech segments may be defined as when speech probability falls below 75% for a minimum duration of 100 ms. Optionally other probabilities may be used that are typically less than the speech probability but may also be greater. Other non-speech durations may be used as low as 1 ms or up to the file duration.
The example speech decision engine 400 may utilize a duration of time, known as pad time, to help ensure the detected time code is located within non-speech and not speech. For example, after all the speech and non-speech segments have been identified, pad time may be applied to each segment start and end time code. The pad time may be added to the start of each non-speech segment and subtracted from the end of each non-speech segment. The pad time may be defined as 100 ms and optionally other times as short as 1 ms or as long as the previously identified non-speech segment and any value between may be used.
SoftNonSpeech=dB(rms(non-speech segment))
The speech audio segments may be measured to find the softest speech by moving through each speech segment using a window and step where the window size may be 400 ms in duration and optionally other values as short as 10 ms and as long as the speech segment. The Step size may be 100 ms in duration and optionally other values as short as 1 ms and as long as the speech segment may be used. The following describes how each speech segment is searched for the softest window:
SpeechLevel=dB(rms(speech window segment))
For each measurement of SpeechLevel it may need to pass acceptance tests before being accepted. The acceptance tests may be defined where:
If the Speech Level passes the acceptance tests the Speech Level may be set as the new Softest Speech.
This process may continue evaluating the acceptance test for the all the speech segments and when complete, Softest Speech may contain the value and location of the softest speech.
The mark change point system 405 of the speech decision engine 400 may identify immutable change points. A change point may be classified as immutable indicating the change point may never be removed. Immutable may be defined as when a non-speech duration exceeds 3 seconds with other durations possible such as 1 second, 5 seconds or any value between 0.5 seconds and up to the duration of the longest non-speech segment. Some or all of the change points may be evaluated and if a non-speech duration meets the definition of immutable the change point may be marked as immutable.
A further refinement of the identification of speech and non-speech content may be performed using the processing at blocks 406, 407, 408.
The resolve adjacent change points system 406 may check to determine if any of the pairs of change points are marked as immutable. If both change points are marked immutable the change points may be skipped. If one of the change points is immutable the change point that is not marked immutable may be removed. If none of the change points are marked immutable then a further check may be performed.
For example, if one, or both of the change points have a Diff1 value>=10 dB then the change point with the smallest Diff1value may be removed. This further check may also improve the sound such that when the speech suddenly rises or falls by an extreme amount over a short time period the volume fluctuation may be reduced. If, however, both change points have a Diff1<10 dB the system may remove the 2nd change point. When the resolve adjacent change points system 406 has completed its resolving process, the change points remaining may be another step closer to the final list.
Speech levels may at times change slower than normal and not identified, therefore the system may provide a method for identifying these slow-moving changes.
A second test may check for when both the current and previous non-speech segments are not change points. If either the first test or the second test passes, then a third test may be evaluated. The third test may check for certain conditions (e.g., 2 conditions), any of which must be true to pass:
Optionally Diff2 may be compared to other values such as 1 dB, 3 dB, 6 dB up to 24 dB with values between. Optionally Diff3 may be compared to other values such as 1 dB, 3 dB, 6 dB up to 24 dB with values between. If the third test passes it may be determined the audio has changed by a sufficient amount to justify adding a new change point for the current non-speech segment.
A significant point is that removing a change point may merge two segments into one new longer segment.
For each change point the following validation steps may be processed in the order indicated where:
The validation process may include the following acts:
If this condition is not checked, short speech segments<=3 seconds may rise or fall in volume at erratic times and by drastic amounts.
The system 409 may optionally measure the integrated loudness for each audio buffer. The system 409 may utilize the final occurrence of the integrated loudness measurements for each speech segment for determining the speech segment integrated loudness.
The system 410 may define a second condition as when the duration of a current speech segment may be less than a predefined minimum duration, such as 3 seconds. Optionally other minimum durations may be used such as 0.5 seconds, 5 seconds, and others up to 60 seconds with any value between. If the minimum speech duration is detected for a speech segment, then the speech segment may be merged into the next speech segment.
In a second aspect, the system 411 may measure the integrated loudness for each audio buffer. In a third aspect, the system 411 may utilize the final occurrence of the integrated loudness measurements for each speech segment for determining the speech segment integrated loudness.
In a second step, the gain, when applied, may raise or lower the speech segment audio level to reach the interim target audio level as determined by the output from the audio deliverables database 100. The interim target gain may be calculated by the system 411 where:
The audio gain instructions may be stored in a storage location whereby the speech segment leveler system 501 may access the audio gain instructions.
The volume leveler system 500 may utilize distributed gain staging.
The audio speech segment leveler may be configured so that the signal envelope and dynamics remain unaltered for each audio segment.
By way of example the audio speech segment leveler may utilize a max gain limit function. The max gain limit may change dynamically based on the output of the audio deliverables database 100. The max gain limit may be calculated as:
max gain limit=ABS(NTAL−ITAL)+10 dB
Additionally, the following rules may be applied to the max gain limit function. If audio segment gain instructions>max gain limit, then apply the max gain limit otherwise apply the segment gain instructions.
Referring again to
The parameters may be pre-determined or may update dynamically, dependent on the output of the audio deliverables database 100. If the parameters are not pre-determined they may be calculated or matched to an audio loudness requirement or specification. For example, if the audio deliverables database output is Broadcast ATSC/85, the threshold value for the upward expander may be calculated as:
threshold=deliverable target audio level*0.5
By way of example the above may translate to:
threshold (−12 dB)=deliverable target audio level (−24)*0.5
In another example, the audio deliverables database output is LUFS20 (Loudness Unit Full Scale 20) in accordance with the EBU R128 standard, the threshold value for the upward expander may be calculated by way of example:
threshold (−10 dB)=deliverable target audio level (−20)*0.5
Furthermore, if the audio deliverables database output is LUFS20, the threshold for the hard limiter may utilize the max true peak specification from the LUFS20 loudness standard.
threshold (−1 dB)=LUFS20 true peak (−1 db)
By way of further example, if the audio deliverables database output is Broadcast ATSC/A85, the threshold for the limiter may utilize the max true peak specification from the Broadcast ATSC/A85 loudness standard.
threshold (−2 dB)=ATSC/A85 true peak (−2 dB)
The dynamics audio processor 502, may be optimized for upward expanding using the upward expander 503. The upward expander 503 attack and release functions may be optimized for speech. The upward expander attack may control how long it takes for the gain to be increased once the signal is below the threshold. The upward expander release may be used to control how long the gain takes to return to 0 dB of gain when the signal is above the threshold, with other methods possible. The upward expander release time value may be 0.0519 seconds and the attack time value may be 0.0052 seconds with other attack and release time values possible.
The upward expander 503 may be optimized for reducing noise by increasing the output gain of the signal only when the input signal is less than the threshold and greater than the floor value. The floor in the upward expander 503 may therefore provide an added benefit of enabling background noise to be reduced or remain at the same level relative to the signal regardless of how many dB the upward expander output gain may increase. The upward expander floor value may utilize the previously calculated noise floor determined from the non-speech segments within the speech detection table with other values or methods possible. The upward expander ratio value may be pre-determined as 0.5 with many other ratio values possible. The amount of gain increase in the output may be dependent on the upward expander ratio value. The upward expander threshold may be calculated as:
(deliverable target audio level*upward expander ratio)
The upward expander gain may be calculated as:
(upward expander threshold+(signal dB−upward expander threshold)*upward expander ratio)−signal dB
The upward expander 503 may utilize a range parameter. The upward expander range may be used to limit the max amount of gain that can be applied to the output.
The upward expander range may be calculated as:
(interim target audio level−deliverable target audio level)+deliverable tolerance
The range calculation may not always be precise enough to meet the number of different deliverable audio target levels
By way of example, the dynamics audio processors 502 may utilize a compressor 504. The compressor 504 may reduce the output gain when the signal is above a threshold. The compressor threshold may be calculated as:
threshold=deliverable target audio level/2
This calculation may allow the threshold in the compressor 504 to be automatically updated to support different outputs from the audio deliverables database 100. Optionally, the compressor threshold may be set to the equivalent “short max loudness” metadata found in a given loudness specification 102, such as that illustrated in
By way of example, the dynamics audio processors 502 may utilize a limiter 505 which may include one more limiters, such as a hard limiter and/or a peak limiter. The hard limiter may be configured such that no signal will ever be louder than the threshold. The hard limiter threshold may be set to the equivalent “true peak” audio loudness metadata output from the audio deliverables and standards database 102. This calculation may allow the hard limiter threshold to be automatically updated to support various loudness audio standards depending on the output from the audio deliverables database 100. The remaining values and time parameters in the hard limiter may be pre-determined as the following: The hard limiter knee width may be 0 dB, the hard limiter release time may be 0.00519 seconds, The hard limiter attack time may be 0.000 seconds and optionally the release and attack times may contain other time based values between 0 and 10 seconds.
By way of example, the dynamics audio processors 502 may optionally utilize a peak limiter which may process the audio signal prior to the hard limiter. The peak limiter may be configured such that some signals may still pass the threshold. The peak limiter may be utilized to improve the performance of the hard limiter. For example, the peak limiter threshold value may be calculated as such to reduce the number of peaks the hard limiter must process. The peak limiter threshold value may change depending on the true peak audio loudness specification such as set forth in the audio deliverables and standards 102. Further, the peak limiter threshold may be automatically updated to support various loudness audio standards depending on the output form the audio deliverables database 100. The remaining values and time parameters in the peak limiter may be pre-determined as the following: the peak limiter knee width may be 5 dB, the peak limiter release time may be 0.05519 seconds, the peak limiter attack time may be 0.000361 seconds with other values and times possible.
The post processing dither and noise shaping functions may utilize the following methods, although other methods may be used. For example, if the audio deliverables database output file format is 24 bit, the post processing system dither and noise shaping 602 may utilize triangular_hp (triangular dither with high pass) and if the audio deliverables database output file format is 16 bit, the post processing system dither and noise shaping method may utilize low shibata. The post processing system may utilize other dither and noise shaping methods including rectangular, triangular, lipshitz, shibata, high shibata, f weighted, modified e weighted, and improved e weighted.
Additionally, the system illustrated in
Other implementations of content classification may be used in the description herein; various functionalities may be described and depicted in terms of components or modules. Furthermore it may be appreciated that certain embodiments may be configured to improve encoding/decoding of audio signals such as AAC, MPEG-2, and MPEG-4.
Optionally, certain embodiments may be used for identifying content broadcast on FM/AM digital radio bit streams.
Certain embodiments may enhance the measurement of audience viewing analytics by logging content classifications, which may be transmitted (in real-time or non-real-time) for further analysis and may derive viewing habits, trends, etc. for an individual or group of consumers.
Optionally, specific content identification information may be embedded within the audio signal(s) for the purpose of accurately determining information such as content title, start date/time, duration, channel and content classifications.
Optionally, channel and/or content may be excluded from processing from automation actions or other options. The exclusion options may be activated through the use of information within a Bitstream Control Database or from downloaded information.
Certain embodiments may also be used to enhance intelligence gathering or the interception of signals between people (“communications intelligence”—COMINT), whether involving electronic signals not directly used in communication (“electronic intelligence”—ELINT), or combinations of the two.
methods and processes described herein may have fewer or additional steps or states and the steps or states may be performed in a different order. Not all steps or states need to be reached. The methods and processes described herein may be embodied in, and fully or partially automated via, software code modules executed by one or more general purpose computers. The code modules may be stored in any type of computer-readable medium or other computer storage device. Some or all of the methods may alternatively be embodied in whole or in part in specialized computer hardware. The systems described herein may optionally include displays, user input devices (e.g., touchscreen, keyboard, mouse, voice recognition, etc.), network interfaces, etc.
The results of the disclosed methods may be stored in any type of computer data repository, such as relational databases and flat file systems that use volatile and/or non-volatile memory (e.g., magnetic disk storage, optical storage, EEPROM and/or solid state RAM).
The various illustrative logical blocks, modules, routines, and algorithm steps described in connection with the embodiments disclosed herein can be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. The described functionality can be implemented in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosure.
Moreover, the various illustrative logical blocks and modules described in connection with the embodiments disclosed herein can be implemented or performed by a machine, such as a general purpose processor device, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor device can be a microprocessor, but in the alternative, the processor device can be a controller, microcontroller, or state machine, combinations of the same, or the like. A processor device can include electrical circuitry configured to process computer-executable instructions. In another embodiment, a processor device includes an FPGA or other programmable device that performs logic operations without processing computer-executable instructions. A processor device can also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. Although described herein primarily with respect to digital technology, a processor device may also include primarily analog components. A computing environment can include any type of computer system, including, but not limited to, a computer system based on a microprocessor, a mainframe computer, a digital signal processor, a portable computing device, a device controller, or a computational engine within an appliance, to name a few.
The elements of a method, process, routine, or algorithm described in connection with the embodiments disclosed herein can be embodied directly in hardware, in a software module executed by a processor device, or in a combination of the two. A software module can reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of a non-transitory computer-readable storage medium. An exemplary storage medium can be coupled to the processor device such that the processor device can read information from, and write information to, the storage medium. In the alternative, the storage medium can be integral to the processor device. The processor device and the storage medium can reside in an ASIC. The ASIC can reside in a user terminal. In the alternative, the processor device and the storage medium can reside as discrete components in a user terminal.
Conditional language used herein, such as, among others, “can,” “may,” “might,” “may,” “e.g.,” and the like, unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements and/or steps. Thus, such conditional language is not generally intended to imply that features, elements and/or steps are in any way required for one or more embodiments or that one or more embodiments necessarily include logic for deciding, with or without other input or prompting, whether these features, elements and/or steps are included or are to be performed in any particular embodiment. The terms “comprising,” “including,” “having,” and the like are synonymous and are used inclusively, in an open-ended fashion, and do not exclude additional elements, features, acts, operations, and so forth. Also, the term “or” is used in its inclusive sense (and not in its exclusive sense) so that when used, for example, to connect a list of elements, the term “or” means one, some, or all of the elements in the list.
Disjunctive language such as the phrase “at least one of X, Y, Z,” unless specifically stated otherwise, is otherwise understood with the context as used in general to present that an item, term, etc., may be either X, Y, or Z, or any combination thereof (e.g., X, Y, and/or Z). Thus, such disjunctive language is not generally intended to, and should not, imply that certain embodiments require at least one of X, at least one of Y, or at least one of Z to each be present.
While the phrase “click” may be used with respect to a user selecting a control, menu selection, or the like, other user inputs may be used, such as voice commands, text entry, gestures, etc. User inputs may, by way of example, be provided via an interface, such as via text fields, wherein a user enters text, and/or via a menu selection (e.g., a drop down menu, a list or other arrangement via which the user can check via a check box or otherwise make a selection or selections, a group of individually selectable icons, etc.). When the user provides an input or activates a control, a corresponding computing system may perform the corresponding operation. Some or all of the data, inputs and instructions provided by a user may optionally be stored in a system data store (e.g., a database), from which the system may access and retrieve such data, inputs, and instructions. The notifications/alerts and user interfaces described herein may be provided via a Web page, a dedicated or non-dedicated phone application, computer application, a short messaging service message (e.g., SMS, MMS, etc.), instant messaging, email, push notification, audibly, a pop-up interface, and/or otherwise.
The user terminals described herein may be in the form of a mobile communication device (e.g., a cell phone), laptop, tablet computer, interactive television, game console, media streaming device, head-wearable display, networked watch, etc. The user terminals may optionally include displays, user input devices (e.g., touchscreen, keyboard, mouse, voice recognition, etc.), network interfaces, etc.
While the above detailed description has shown, described, and pointed out novel features as applied to various embodiments, it can be understood that various omissions, substitutions, and changes in the form and details of the devices or algorithms illustrated can be made without departing from the spirit of the disclosure. As can be recognized, certain embodiments described herein can be embodied within a form that does not provide all of the features and benefits set forth herein, as some features can be used or practiced separately from others. The scope of certain embodiments disclosed herein is indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.
Number | Name | Date | Kind |
---|---|---|---|
3879724 | McDonald | Apr 1975 | A |
5210820 | Kenyon | May 1993 | A |
5343251 | Nafeh | Aug 1994 | A |
5436653 | Ellis et al. | Jul 1995 | A |
5737716 | Bergstrom et al. | Apr 1998 | A |
5903482 | Iwamura | May 1999 | A |
5918223 | Blum et al. | Jun 1999 | A |
6292776 | Chengalvarayan | Sep 2001 | B1 |
6529809 | Breed et al. | Mar 2003 | B1 |
6597405 | Iggulden | Jul 2003 | B1 |
7299050 | Delaney et al. | Nov 2007 | B2 |
8249872 | Aronowitz et al. | Aug 2012 | B2 |
8369532 | Aarts | Feb 2013 | B2 |
8396705 | Bilobrov | Mar 2013 | B2 |
8825188 | Stone et al. | Sep 2014 | B2 |
8918316 | Ben et al. | Dec 2014 | B2 |
8925024 | Wright et al. | Dec 2014 | B2 |
9653094 | Stone et al. | May 2017 | B2 |
10355657 | Skovenborg | Jul 2019 | B1 |
20020133499 | Ward et al. | Sep 2002 | A1 |
20020176702 | Frantz | Nov 2002 | A1 |
20060196337 | Breebart et al. | Sep 2006 | A1 |
20060229878 | Scheirer | Oct 2006 | A1 |
20070288952 | Weinblatt | Dec 2007 | A1 |
20080103761 | Printz | May 2008 | A1 |
20080127244 | Zhang | May 2008 | A1 |
20080235011 | Archibald | Sep 2008 | A1 |
20090220109 | Crockett | Sep 2009 | A1 |
20090299750 | Yonekubo | Dec 2009 | A1 |
20100195972 | Casagrande | Aug 2010 | A1 |
20100274554 | Orr | Oct 2010 | A1 |
20130259211 | Vlack et al. | Oct 2013 | A1 |
20140313911 | Stone et al. | Jun 2014 | A1 |
20160088160 | Tan | Mar 2016 | A1 |
20180277107 | Kim | Sep 2018 | A1 |
20190334497 | Wang | Oct 2019 | A1 |
Number | Date | Country |
---|---|---|
1835073 | Sep 2006 | CN |
1006685 | Jun 2000 | EP |
1341310 | Sep 2003 | EP |
2353237 | Aug 2011 | EP |
1730105 | Jan 2012 | EP |
WO-2007127023 | Nov 2007 | WO |
WO 2013184520 | Dec 2013 | WO |
WO 2014 082812 | Jun 2014 | WO |
WO 2016172363 | Oct 2016 | WO |
WO-2018177787 | Oct 2018 | WO |
Entry |
---|
U.S. Appl. No. 14/313,911, filed Jun. 24, 2014, Stone et al. |
U.S. Appl. No. 15/496,330, filed Apr. 25, 2017, Stone et al. |
U.S. Appl. No. 17/455,874, filed Nov. 19, 2021, Stone et al. |
Andersson, Tobias, Audio classification and content description, Audio Processing & Transport Multimedia Technologies Ericsson Research, Lulea, Sweden, Mar. 2004; ISSN: 1402-1617. |
Biernacki, “Intelligent System for Commercial Block Recognition Using Audio Signal Only,” Knowledge-Based and Intelligent Information and Engineering Systems, LNCS 6276 (Proceedings of The 14th International Conference on KES 2010, Part I), pp. 360-368, Sep. 8-10, 2010. |
Doets et al., “Distortion Estimation in Compressed Music Using Only Audio Fingerprints”, Feb. 2008, IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, No. 2, pp. 302-317. |
Giannakopoulos, T. et al., Introduction to Audio Analysis, A Matlab Approach, First Edition, 2014. |
Haitsma et al., “A Highly Robust Audio Fingerprinting System”, 2002, IRCAM, pp. 1-9. |
International Search Report and Written Opinion, PCT/US2013/043737, mailed Sep. 16, 2013, 11 pages. |
Kopparapu et al., “Choice of Mel Filter Bank in Computing MFCC of a Resampled Speech,” In: The 10th International Conference on Information Sciences Signal Processing and their Applications (ISSPA 2010), pp. 121-124, May 10-13, 2010. |
PCT International Search Report and Written Opinion dated Jul. 28, 2016, Application No. PCT/US2016/028682, 7 pages. |
Number | Date | Country | |
---|---|---|---|
20220165289 A1 | May 2022 | US |
Number | Date | Country | |
---|---|---|---|
63117051 | Nov 2020 | US |