Identification of patterns in sound data, also known as pattern matching, may be utilized to support a wide variety of different usage scenarios. This may include audio source separation, which may involve identification of sound data that corresponds to different sound sources. For example, audio source separation may be performed to remove noise from a recording, separate different speakers in a dialog, and so on. In another example, pattern matching may be used to support word spotting and audio retrieval, such as a part of voice recognition (e.g., a virtual phone menu) by identifying particular keywords in the sound data, to locate sound data having desired keywords or other sounds, and so on.
Conventional techniques that were utilized to identify patterns in sound data, however, typically relied on a matrix representation of the sound data. This representation could be resource intensive to analyze, even when confronted with sparse sound data in which most of the frequency energies are close to zero. Consequently, such representations may be ill suited to real time scenarios and result in needless consumption of computational resources.
Pattern identification using convolution is described. In one or more implementations, a representation of a pattern is obtained that is described using data points that include frequency coordinates, time coordinates, and energy values. An identification is made as to whether sound data described using irregularly positioned data points includes the pattern, the identifying including use of a convolution of the frequency or time coordinates to determine correspondence with the representation of the pattern.
In one or more implementations, sound data is represented using a plurality of vectors that reference a frequency coordinate, a time coordinate, and an energy value of each time/frequency point in the sound data. Irregularly positioned underlying patterns are identified in the represented sound data in different time or frequency positions in the sound data.
In one or more implementations, a system includes at least one module implemented at least partially in hardware and configured to generate a pattern using data points that include frequency coordinates, time coordinates, and energy values. The system also includes one or more modules implemented at least partially in hardware and configured to identify whether the pattern is included in sound data using convolutions to address irregular data points in a vectorized landmark space in the sound data as part of nonnegative matrix factorization (NMF).
This Summary introduces a selection of concepts in a simplified form that are further described below in the Detailed Description. As such, this Summary is not intended to identify essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different instances in the description and the figures may indicate similar or identical items. Entities represented in the figures may be indicative of one or more entities and thus reference may be made interchangeably to single or plural forms of the entities in the discussion.
Overview
Pattern identification and other sound processing techniques may utilize matrices to represent sound data. However, in some instances sound spectrums are sparse (e.g., contain relatively low amounts of sound) and thus portions of the sound spectrum may contain frequency energies that are close to zero. Consequently, these portions that are sparse may not contribute to identification of patterns or other processing but are still processed nonetheless, thereby consuming computational resources.
Pattern identification techniques that leverage convolution are described. In one or more implementations, a compact representation of sound data is leveraged to identify patterns and minimize processing of sparse portions of sound data. For example, a representation may be formed that employs data points that describe frequency and time coordinates in a landmark space as well as energies (e.g., how loud) at those data points, e.g., as one or more vectors. Convolution may then be applied to discover underlying patterns that can appear at different time or frequency positions in the sound data, e.g., by adjustments in time and/or frequency to find patterns that correspond to each other. For example, pre-learned patterns may be used as fixed basis images to discover their activations in a landmark space as part of Nonnegative Factor Deconvolution (NFD), which is an extended version of Nonnegative Matrix Factorization that includes convolutions. Further discussion of these and other techniques may be found in the following sections.
In the following discussion, an example environment is first described that may employ the techniques described herein. Example procedures are then described which may be performed in the example environment as well as other environments. Consequently, performance of the example procedures is not limited to the example environment and the example environment is not limited to performance of the example procedures.
The computing device 102, for instance, may be configured as a desktop computer, a laptop computer, a mobile device (e.g., assuming a handheld configuration such as a tablet or mobile phone), and so forth. Thus, the computing device 102 may range from full resource devices with substantial memory and processor resources (e.g., personal computers, game consoles) to a low-resource device with limited memory and/or processing resources (e.g., mobile devices). Additionally, although a single computing device 102 is shown, the computing device 102 may be representative of a plurality of different devices, such as multiple servers utilized by a business to perform operations “over the cloud” as further described in relation to
The sound capture device 104 may also be configured in a variety of ways. Illustrated examples of one such configuration involves a standalone device but other configurations are also contemplated, such as part of a mobile phone, video camera, tablet computer, part of a desktop microphone, array microphone, and so on. Additionally, although the sound capture device 104 is illustrated separately from the computing device 102, the sound capture device 104 may be configured as part of the computing device 102, the sound capture device 104 may be representative of a plurality of sound capture devices, and so on.
The sound capture device 104 is illustrated as including respective sound capture module 106 that is representative of functionality to generate sound data 108. The sound capture device 104, for instance, may generate the sound data 108 as a recording of an audio scene 110 having one or more sources. This sound data 108 may then be obtained by the computing device 102 for processing.
The computing device 102 is illustrated as including a sound processing module 112. The sound processing module is representative of functionality to process the sound data 108. Although illustrated as part of the computing device 102, functionality represented by the sound processing module 112 may be further divided, such as to be performed “over the cloud” via a network 114 connection, further discussion of which may be found in relation to
Examples of functionality of the sound processing module 112 are represented as a pattern identification module 116, a representation generation module 118, and a deconvolution module 120. The pattern identification module 116 is representative of functionality to identify patterns in sound data 108. This may include pre-learning of patterns as well as identification of those patterns in sound data. As previously described, pattern identification may be utilized to support a wide variety of functionality, such as source separation (e.g., audio denoising, music transcription, music remixing, audio-based forensics) and sound identification (e.g., for word spotting and audio retrieval).
The representation generation module 118 is representative of functionality to generate a representation of the sound data 108 for processing by the pattern identification module 116. The representation, for instance, may describe a landmark space of the sound data 108 using data points that describe time and frequency positions as well as an energy at those positions, e.g., as a vectorized representation. In this way, the representation may provide a compact representation of the landmark space that may be processed by the pattern identification module 116 with improved efficiency and consume less computational resources.
The deconvolution module 120 is representative of functionality to discover underlying patterns in the sound data 108 in conjunction with the pattern identification module 116 that may appear at different time and/or frequency positions in the sound data 108. Further discussion of convolution is described as follows and shown in a corresponding figure.
Further, the representation may be represented using a relatively small amount of selected data points as representative of the entire comprehensive matrices to analyze audio data. In this way, a majority of the data points of a matrix form of sound signals may be ignored, because sound spectrums are often sparse with most of the frequency energies being close to zero in some instances. However, although the matrices can effectively represent the entirety of the sound data 108, it is challenging to apply existing pattern discovery methods on those irregular and sparse feature spaces since matrices with the regular grid are typically filled with zeros.
Accordingly, a Nonnegative Factor Deconvolution (NFD) technique is described in the following that may function using this sparse landmark data, which can be applied to any irregularly positioned data points. NFD may be utilized to discover underlying patterns that can appear different horizontal and vertical (time and frequency) positions as shown in
Compact representations of data (e.g., sound data 108) may be used to improve speed and efficiency of a pattern matching process, such as in audio applications. For example, audio signals may be converted for a given short time frame of the signal to a frequency domain, e.g., a Short-Time Fourier Transform (STFT), with the magnitude of the resulting complex valued matrix may being computed. However, this may be problematic in some cases.
For example, for an irregular transform a typical STFT grid may not provide desired resolution for the time or frequency dimension. Although alternative irregular transforms may be used to tackle this issue, these transforms may result in non-matrix form data structures, which prevents the use of ordinary matrix-based techniques.
In another example, for sparse landmarks each of the elements except the local maxima may be discarded to obtain a compact representation. However, the resulting representation is a sparse matrix in which a majority of the elements are zeros. When this kind of matrices are represented with a pair of their positions and the value, a compact representation may be obtained but it is not qualified for matrix-based techniques.
In the following, Nonnegative Matrix Factorization (NMF) techniques (e.g., like decomposition) are configured to support irregular data types. NMF for irregularly-sampled data is first described which is then followed by a discussion of convolution in a single and plural axes, e.g., two dimensional.
NMF for Irregularly-Sampled Data
A regular factorization of a time/frequency matrix may be defined as:
X=W·H (1)
where “Xε+M×N” is a matrix containing time/frequency energies, and “W=[w1, w2, . . . wZ]ε+M×N” and “H=[h1T, h2T, . . . , hZT]Tε+Z×N” represent “Z” frequency and time factors, respectively. NMF is a factorization technique that estimates the two factors using the following iterative process:
where “1m×n” is an “m×n” matrix of ones, and “⊙” an
stand for element-wise multiplication and division, respectively. The value “wZ” is normalized by the sum of “hz” at the end of each iteration in order to obtain a spectrum estimate that is unbiased by how much the estimate appears over time. This also sets the magnitude of “W” so that there are not multiple solutions that transfer energy between the two factors.
This formulation involves uniform sampling in the frequency and time axes, meaning that at each time point an energy reading is obtained for each the frequency values, and vice versa. However, for certain types of time/frequency transforms, such as constant-Q transforms, wavelets and reassigned spectrograms, this assumption does not hold and the resulting time/frequency energies cannot be represented using a finite-sized matrix. For such representations, a different format may be used such that for each energy value its exact frequency and time location is attached. In order to factorize such transforms, the factorization process may be redefined to accept this new format as follows.
Reformulation of NMF into a Vectorized Form
In this section, an assumption is made that the transforms that are used are regularly sampled as above. However, a different representation is used that permits extension of this formulation to non-regularly sampled transforms later. Instead of using a matrix “X” to represent the time/frequency energies, three vectors are used as follows:
Using the above-described formulation, the factorization process may be rewritten as follows:
where now the pair of vectors “v2ε+MN×1” and “gzε+MN×1” correspond to the values of the factors “W” and “H” as they are evaluated at the frequencies and times denoted by “f” and “t.” With this, the iterative multiplicative update rules turn into the following form:
Therefore, if the frequency/time indices lie on a regular integer grids, i.e., “f(i)ε(1, 2, . . . , N),” the same operations are being performed as in Equation (2) above. Additionally, Equation (4) may be rewritten to process each of the components simultaneously as follows:
where the matrices, “P,” “V,” and “G,” contain “Z” concatenated column vectors, each of which is for a latent variable “z,” e.g., “P=[p1, p2; . . . ; pz].” Additionally, “Df, Dtε{0, 1}MN×MN” denote two matrices defined as:
Multiplying with these matrices results in summing over each of the elements that have the same frequency or time value respectively. The difference between the formulation in this section and in Equation (2) is that the two factors are obtained in a different format so that:
w
z(m)=vz(i),∀i:f(i)=m
h
z(n)=gz(i),∀i:t(i)=n
vec(wz·hz)=vz⊙gz (7)
where “m” and “n” are uniform indices defined in the ranges, “{1, 2, . . . , M}” and “{1, 2, . . . , N},” respectively.
Non-Negative Non-Regular Matrix Factorization
In this example, the frequency and time vectors are real-valued and potentially comprised of unique elements. Accordingly, the summations in Equation (4) may become meaningless since as they sum over single points solely and thus do not capture the correlations that form as multiple frequencies are excited at roughly the same time.
To illustrate such a case, consider the simple example as shown in “(a)’ in
This means that “Df(i,j)=1, ∀i,j:f(i)=f(j)” and “Dt(i,j)=1, ∀i,j:t(i)=t(j)” are still maintained but in the case where two frequency or time labels are close but not exactly the same these values are still summed, albeit using a lower weight. For distant points the corresponding values in these matrices are close to zero, so no significant summation takes place.
Using this proposed approach, the results in “(b)” and “(c)” of
Nonnegative Factor Deconvolution (NFD) for Irregularly-Sampled Data
An extended version of the non-regular NMF is described in this section. Instead of assuming the linear decomposition model underlying in NMF algorithms, a set of basis vectors are used as a basis image that can be convolved with filters. In this way, frequently adjacent basis vectors may be grouped to represent a certain temporal structure of the data, which is difficult to capture using conventional NMF. Accordingly, use of NFD on irregular data points may be utilized, which was not supported using conventional techniques.
In the following sections, a model is first introduced where the convolution happens along a single axis, e.g., time. A discussion follows in which a model is introduced where the convolution is performed along both time and frequency axes.
Nonnegative Factor Deconvolution Along a Single Dimension (1D-NFD)
When basis matrices are assumed, each of which holds a unique time-varying set of spectra, the NMF problem can be extended to a deconvolution model as follows:
where “wzτ” is the “τ-th” one of the successive spectra of a basis matrix, and operation “{right arrow over (h)}zτ” shifts the matrix “hz” to the right by “τ” elements while filling the leftmost “τ” columns with zeros. Then, input “X” is reconstructed with a sum of filtered basis matrices. Here, a filter “hz” is utilized per a latent component, which is convolved with the basis matrix. This new reconstruction model leads to a new set of update rules involving those temporal dynamics:
Reformulation of 1D-NFD into a Vectorized Form
As for the same types of vectorized inputs “f(i),” “t(i),” and “x(i)” as before, the deconvolution is defined as:
where “vzτε+MN×1” and “gzτε+MN×1.” The multiplicative update rules are now:
Note the shift notation is not used here as the input is not a grid anymore. But, “gzτ” in Equation (15) is a shifted version of “gz” from the previous iteration using Equation (19).
This may be rewritten with matrix notation as follow:
where “X=x·11×Z.” The kernel matrices “Df, Dt” are the same with the previous ones in Equation (6), but “Dτ,” and “(Dτ)−1” are configured as considering that time lacks “τ” and both operations involve shifts to the left and right as well:
Non-Negative Non-Regular Factor Deconvolution (Non-Regular 1D-NFD)
The non-regular version of 1D-NFD discussed here uses the proposed vectorized update rules in Equation (20) except that kernel matrices in Equation (21) are replaced with corresponding Gaussians, such as in Equation (22):
NFD Along Both Dimensions (2D-NFD)
The reconstruction model may be expressed as:
where “Kzφ,τε+F×T” is the discretized kernel at the position “(φ, τ),” and operation “” shifts the matrix “Az” to the right by “τ” elements and to up by “φ” while filling the leftmost “τ” columns and bottom “φ” rows with zeros. Then, the input “X” is then reconstructed with a sum of filtered basis matrices. Here a two dimensional filter “Az” is used per latent component, which is convolved with the basis matrix. This reconstruction model leads to a set of update rules involving those temporal and frequency dynamics:
Note that now the posterior probabilities “P” are represented via a five dimensional tensor with axis for “z,” “t,” “f,” “φ” and “τ”
Reformulation of 2D-NFD into a Vectorized Form
As for the same types of vectorized inputs “f(i),” “t(i),” and “x(i)” as before, the deconvolution is defined as:
where “gzφ,τε+MN×1.” The multiplicative update rules are now:
Note that as before the shift notation is not used here as the input is not a grid anymore. But, “gzφ,τ” in Equation (29) is introduced to represent the shifted version of “gz” along both directions from the previous iteration using Equation (33).
Equations (32) and (33) may now be rewritten with matrix notation as follows:
The kernel matrices “Dφτ” and “Dφτ−1” may also consider time lacks “r” and frequency shifts “φ,” and so are defined by:
Non-Negative Non-Regular Factor Deconvolution (Non-Regular 2D-NFD)
Another non-regular version of 2D-NFD is now described that uses the proposed vectorized update rules from Equations (29) to (31), and (34) except that the kernel matrices in Equation (35) are replaced with corresponding Gaussians, such as in Equation (22):
The following discussion describes pattern matching techniques that may be implemented utilizing the previously described systems and devices. Aspects of each of the procedures may be implemented in hardware, firmware, or software, or a combination thereof. The procedures are shown as a set of blocks that specify operations performed by one or more devices and are not necessarily limited to the orders shown for performing the operations by the respective blocks. In portions of the following discussion, reference will be made to
An identification is made as to whether sound data described using irregularly positioned data points includes the pattern, the identification including use of a convolution of the frequency or time coordinates to determine correspondence with the representation of the pattern (block 604). The identification, for instance, may involve changing an amount of time over which the pattern is performed and/or a number of frequencies involving in the pattern, e.g., by shrinking or stretching corresponding ranges. In this way, irregular patterns may be identified in the sound data as described above.
Irregularly positioned underlying patterns are identified in the represented sound data in different time or frequency positions in the sound data (block 704). Continuing with the previous example, these vectors may then be used to identify irregular patterns that have differences along time and/or frequency axes, e.g., such as to consume different amounts of either one and thus may be distinguished from regular patters. A variety of other examples are also contemplated without departing from the spirit and scope thereof.
The example computing device 802 as illustrated includes a processing system 804, one or more computer-readable media 806, and one or more I/O interface 808 that are communicatively coupled, one to another. Although not shown, the computing device 802 may further include a system bus or other data and command transfer system that couples the various components, one to another. A system bus can include any one or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures. A variety of other examples are also contemplated, such as control and data lines.
The processing system 804 is representative of functionality to perform one or more operations using hardware. Accordingly, the processing system 804 is illustrated as including hardware element 810 that may be configured as processors, functional blocks, and so forth. This may include implementation in hardware as an application specific integrated circuit or other logic device formed using one or more semiconductors. The hardware elements 810 are not limited by the materials from which they are formed or the processing mechanisms employed therein. For example, processors may be comprised of semiconductor(s) and/or transistors (e.g., electronic integrated circuits (ICs)). In such a context, processor-executable instructions may be electronically-executable instructions.
The computer-readable storage media 806 is illustrated as including memory/storage 812. The memory/storage 812 represents memory/storage capacity associated with one or more computer-readable media. The memory/storage component 812 may include volatile media (such as random access memory (RAM)) and/or nonvolatile media (such as read only memory (ROM). Flash memory, optical disks, magnetic disks, and so forth). The memory/storage component 812 may include fixed media (e.g., RAM, ROM, a fixed hard drive, and so on) as well as removable media (e.g., Flash memory, a removable hard drive, an optical disc, and so forth). The computer-readable media 806 may be configured in a variety of other ways as further described below.
Input/output interface(s) 808 are representative of functionality to allow a user to enter commands and information to computing device 802, and also allow information to be presented to the user and/or other components or devices using various input/output devices. Examples of input devices include a keyboard, a cursor control device (e.g., a mouse), a microphone, a scanner, touch functionality (e.g., capacitive or other sensors that are configured to detect physical touch), a camera (e.g., which may employ visible or non-visible wavelengths such as infrared frequencies to recognize movement as gestures that do not involve touch), and so forth. Examples of output devices include a display device (e.g., a monitor or projector), speakers, a printer, a network card, tactile-response device, and so forth. Thus, the computing device 802 may be configured in a variety of ways as further described below to support user interaction.
Various techniques may be described herein in the general context of software, hardware elements, or program modules. Generally, such modules include routines, programs, objects, elements, components, data structures, and so forth that perform particular tasks or implement particular abstract data types. The terms “module,” “functionality,” and “component” as used herein generally represent software, firmware, hardware, or a combination thereof. The features of the techniques described herein are platform-independent, meaning that the techniques may be implemented on a variety of commercial computing platforms having a variety of processors.
An implementation of the described modules and techniques may be stored on or transmitted across some form of computer-readable media. The computer-readable media may include a variety of media that may be accessed by the computing device 802. By way of example, and not limitation, computer-readable media may include “computer-readable storage media” and “computer-readable signal media.”
“Computer-readable storage media” may refer to media and/or devices that enable persistent and/or non-transitory storage of information in contrast to mere signal transmission, carrier waves, or signals per se. Thus, computer-readable storage media refers to non-signal bearing media. The computer-readable storage media includes hardware such as volatile and non-volatile, removable and non-removable media and/or storage devices implemented in a method or technology suitable for storage of information such as computer readable instructions, data structures, program modules, logic elements/circuits, or other data. Examples of computer-readable storage media may include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, hard disks, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or other storage device, tangible media, or article of manufacture suitable to store the desired information and which may be accessed by a computer.
“Computer-readable signal media” may refer to a signal-bearing medium that is configured to transmit instructions to the hardware of the computing device 802, such as via a network. Signal media typically may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as carrier waves, data signals, or other transport mechanism. Signal media also include any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared, and other wireless media.
As previously described, hardware elements 810 and computer-readable media 806 are representative of modules, programmable device logic and/or fixed device logic implemented in a hardware form that may be employed in some embodiments to implement at least some aspects of the techniques described herein, such as to perform one or more instructions. Hardware may include components of an integrated circuit or on-chip system, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a complex programmable logic device (CPLD), and other implementations in silicon or other hardware. In this context, hardware may operate as a processing device that performs program tasks defined by instructions and/or logic embodied by the hardware as well as a hardware utilized to store instructions for execution, e.g., the computer-readable storage media described previously.
Combinations of the foregoing may also be employed to implement various techniques described herein. Accordingly, software, hardware, or executable modules may be implemented as one or more instructions and/or logic embodied on some form of computer-readable storage media and/or by one or more hardware elements 810. The computing device 802 may be configured to implement particular instructions and/or functions corresponding to the software and/or hardware modules. Accordingly, implementation of a module that is executable by the computing device 802 as software may be achieved at least partially in hardware, e.g., through use of computer-readable storage media and/or hardware elements 810 of the processing system 804. The instructions and/or functions may be executable/operable by one or more articles of manufacture (for example, one or more computing devices 802 and/or processing systems 804) to implement techniques, modules, and examples described herein.
The techniques described herein may be supported by various configurations of the computing device 802 and are not limited to the specific examples of the techniques described herein. This functionality may also be implemented all or in part through use of a distributed system, such as over a “cloud” 814 via a platform 816 as described below.
The cloud 814 includes and/or is representative of a platform 816 for resources 818. The platform 816 abstracts underlying functionality of hardware (e.g., servers) and software resources of the cloud 814. The resources 818 may include applications and/or data that can be utilized while computer processing is executed on servers that are remote from the computing device 802. Resources 818 can also include services provided over the Internet and/or through a subscriber network, such as a cellular or Wi-Fi network.
The platform 816 may abstract resources and functions to connect the computing device 802 with other computing devices. The platform 816 may also serve to abstract scaling of resources to provide a corresponding level of scale to encountered demand for the resources 818 that are implemented via the platform 816. Accordingly, in an interconnected device embodiment, implementation of functionality described herein may be distributed throughout the system 800. For example, the functionality may be implemented in part on the computing device 802 as well as via the platform 816 that abstracts the functionality of the cloud 814.
Although the invention has been described in language specific to structural features and/or methodological acts, it is to be understood that the invention defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as example forms of implementing the claimed invention.