Signal classification arises in a variety of applications. In signal classification, an input signal is received, and it is desired to determine to which of multiple classes the signal most likely belongs. For example, a simple classification task may be to automatically determine whether a received email is spam or is not spam. When an email is received, information about the email (e.g. the text of the email, the sender, an internet protocol address of the sender) may be processed using algorithms and models to classify the email as spam or as not spam.
Another example of signal classification relates to determining the identity of a speaker. A class may exist for each speaker of a set of speakers, and a model for each class may be created by processing speech samples of each speaker. To perform classification, a received speech signal may be compared to models for each class. The received signal may be assigned to a class based on a best match between the received signal and the class models.
In some instances, it may be desired to verify the identity of a speaker. A speaker may assert his identity (e.g., by providing a user name) and a speech sample. A model for the asserted identity may be obtained, and the received speech signal may be compared to the model. The classification task may be to determine whether the speech sample corresponds to the asserted identity.
In some instances, it may be desired to determine the identity of an unknown speaker using a speech sample of the unknown speaker. The speaker may be unknown, but it may be likely that the speaker is of a known set of speakers (e.g., the members of a household). The speech sample may be compared to models for each class (e.g., a model for each person in the household), and the classification task may be to determine which class best matches the speech sample or that no class matches the speech sample.
When performing signal classification, it is desired that the signal classification techniques have a low error rate, and the signal classification techniques described herein may have lower error rates than existing techniques.
The invention and the following detailed description of certain embodiments thereof may be understood by reference to the following figures:
Described herein are techniques for performing classification of signals, and
Computing device 110 may, in some implementations, perform classification as described by the flow chart of
At step 220, a sequence of input features is computed from the input signal. Each item in the sequence may be a single feature or multiple features, such as a feature vector. In the description below, a sequence of feature vectors will be used as an example, but the sequence of features need not necessarily be a sequence of feature vectors. Each item of the sequence may be referred to as a step of the sequence. In some implementations, computing the sequence of feature vectors may comprise computing feature vectors at regular time intervals, such as every 10 milliseconds (for higher dimensional input signals the sequence may comprise moving in any suitable dimension).
For an audio signal, a feature vector may comprise harmonic amplitudes, mel-frequency cepstral coefficients, or any other suitable features. As an example, a feature vector of harmonic amplitudes may include an estimate of an amplitude of each harmonic in a portion of a signal. For a portion of the audio signal, harmonic amplitudes may be computed as follows: (i) estimate a pitch of the portion of the signal (optionally using a fractional chirp rate); (ii) estimate an amplitude of each harmonic of the portion of the signal where the first harmonic is at the pitch, the second harmonic is at twice the pitch, and so forth; and (iii) construct a vector of the estimated amplitudes. Each harmonic amplitude vector may be represented as a point in an n-dimensional space (e.g., as shown in
At step 230, a plurality of reference sequences of features is obtained where each reference sequence corresponds to a class. The number and type of classes may depend on a particular classification task. For example, for speaker classification, a class may exist for each known speaker. Each class may have one or more reference sequences. For example, for a speaker classification task, a speaker may be required to provide several examples of speaking a specific phrase, such as “knock knock who's there.” The several examples of how a single speaker speaks the phrase may provide information about variability in the speaker's voice. A reference sequence of features may be created for each recording of the phrase. The reference sequences of features may comprise the same type of features as used for the input sequence.
In
At step 240, the input signal is classified by comparing a trajectory of the input sequence with the trajectories of the reference sequences in a multi-dimensional space. A trajectory of a sequence of features is the path of the sequence trough the multi-dimensional space. The trajectories may be compared using any suitable techniques, including any of the techniques described below. For example, a score (such as a probability measure, likelihood, or a metric) may be computed that measures similarity (e.g., a distance) of the input sequence to the reference sequences of each class, and the input signal may be classified by selecting a highest (or, depending on the implementation, a lowest) score. In some implementations, the reference sequences may be aligned with the input sequence (described in greater detail below) prior to comparing trajectories. For example, with a speaker classification task, it may be desired to remove the effect of different speaking rates and/or pauses in speech. As can be seen in
At step 250, a classification result is output. For example, the result may be the most likely class of the input signal and may include other statistics such as a probability measure or a confidence interval. The classification result may be output to recipient device 160.
Other configurations of source device, 150, computing device 110, and recipient device 160 are possible. In some implementations, source device 150, computing device 110, and recipient device 160 may all be the same device. For example, a user could speak to a personal device to unlock the device with speaker verification, and the personal device may capture an audio signal, perform classification on the audio signal, and then use the results of the classification to decide whether to unlock the device.
Computing device 110 may include any components typical of a computing device, such as volatile or nonvolatile memory 111, one or more processors 112, and one or more network interfaces 113 (e.g., for communicating with source device 150 or recipient device 160). Computing device 110 may also include any input and output components, such as displays, keyboards, and touch screens. Computing device 100 may also include a variety of components or modules providing specific functionality, and these components or modules may be implemented in software, hardware, or a combination thereof. For example, computing device 110 may have a pre-processing component 121 for performing any needed operations on an input signal, such as segmentation, analog-to-digital conversion, encoding, or decoding. Computing device 110 may have an alignment component 122 that aligns reference sequences to an input sequence. Computing device 110 may also have a classification component 123 that classifies an input signal by comparing a trajectory of the input signal with trajectories of reference signals. Computing device 110 may have or may have access to a database of reference sequences 131 to be used in performing classification. It should be noted that although the pre-processing component 121, alignment component 122 and classification component 123 are shown in separate boxes on
Further details of the above techniques are now described, including segmentation of input signals, alignment of reference sequences with an input sequence, and additional details of classification.
In some implementations, the input signal may be processed before computing the input sequence of features. For example, where the input signal is an audio signal, it may be desired to remove portions of the audio signal before computing features. Portions that may be removed may include any of the following: portions that do not comprise speech, portions with significant noise, portions with a low signal-to-noise ratio, or portions corresponding to certain types of speech (e.g., it may be desired to remove unvoiced speech and compute features only for voiced speech). Any suitable techniques may be used to identify portions of an input signal to be removed. For other types of input signals (e.g., non-audio signals), less useful or less desirable portions may be similarly removed before computing features. The process of removing portions of the input signal may be referred to as segmentation and the result of segmentation may be referred to as a segmented input signal.
The segmentation process may be performed before or after computing a sequence of features from the input signal. For example, where the input signal is an audio signal, the segmentation may be performed on a time-series representation of the audio signal and the sequence of features may be computed from the segmented time-series signal. Alternatively, a sequence of features may be computed for the entire input signal, and the segmentation process may remove features from the sequence of features.
The above segmentation process may also be performed on a reference signal when computing a reference sequence of features. It may be desired to segment a reference signal for the same reasons as for segmenting an input signal. For example, for a reference signal that is an audio signal, it may be desired to remove silence from the reference signal when computing the reference sequence of features.
Other processing that may be performed includes aligning a reference sequence of features with an input sequence of features. Alignment may allow for a more accurate comparison of the input sequence of features with a reference sequence of features. For example, the alignment process may remove undesired differences due to scale, due to a speeding up or slowing down of a signal in time, or due to unexpected or undesired deviations in the input sequence or a reference sequence.
For example, consider classification of a speaker. The input signal and each reference signal may correspond to a person speaking a phrase, such as “knock knock who's there.” Different speakers may have different speaking rates or a single speaker may speak with a different speaking rate at different times. Speakers may also deviate from the fixed phrase, for example by stuttering (“knock kn-knock who's there”), saying different words (“knock knock who is there”), or leaving out words or portions of words (“knock who's there”). In some implementations, it may be desired to remove differences due to speaking rates and due to unintended or unexpected deviations in an input sequence or a reference sequence.
The alignment of a reference sequence of features with an input sequence of features may be performed using any suitable techniques. For example, the sequences may be aligned using an alignment graph, such as using Dijkstra's algorithm with an alignment graph.
To align the reference sequence with the input sequence, a distance (or some other metric) between the feature vectors of the input sequence and the feature vectors of the input sequence may be computed. In some implementations, all possible distances may be computed (e.g., N times M total distances) and in other implementations fewer distances may be computed (e.g., by using dynamic programming algorithms). Any appropriate distance (or other similarity metric) may be used, such as a Euclidean distance, a measure of information, or an estimate of probabilistic likelihood. An alignment of a reference sequence with an input sequence may be chosen that maximizes similarity between feature vectors of the reference sequence and feature vectors of the input sequence, such as by minimizing aggregate distance, or by optimizing the chosen metric appropriately.
In performing an alignment, there may be feature vectors of the reference sequence that do not match any feature vectors of the input sequence and vice versa. For example, if the reference sequence corresponds to “knock knock who's there” and the input sequence has a stutter (“kn-knock”) then the feature vectors of the stutter in the input sequence may not match any feature vectors of the reference sequence. Depending on the frame of reference, this may be referred to as an “insertion” in the input sequence or a “deletion” in the reference sequence.
In the example of
In determining a best alignment, a path is determined from the top-left corner of the graph (corresponding to the beginnings of the sequences) to the bottom-right corner of graph (corresponding to the ends of the sequences). This path may optimize an overall similarity function of the path, such as by minimizing an average or a total distance, maximizing total joint likelihood, or maximizing a measure of mutual information.
The above alignment process may be performed for each reference sequence of a collection of reference sequences (e.g., for speaker recognition, the collection many include several reference sequences for each known speaker) so that each reference sequence is aligned with a received input sequence. After the reference sequences have been aligned, the trajectory of the input sequence may be compared with the trajectories of the reference sequences to perform classification of the input sequence.
Several variations of a trajectory classifier are described below with increasing complexity. For clarity of explanation, the initial description will assume that there are no excursions (in either the reference sequences or the input sequence). Afterwards, excursions will be addressed. Note that alignment is not required when performing classification and that classification may be performed using aligned or unaligned reference sequences.
In some implementations, a trajectory classifier may be implemented as a sequential process. Suppose we have J classes, and denote each class as ωj for j between 1 and J. Each class may have a prior probability that the input sequence is a member of the class, and this prior probability may be specified as Pr(ωj) for j between 1 and J. In some implementations, there are no prior probabilities for the classes or the classes are assumed to have equal probabilities and Pr(ωj) is not used (or have equal values). Given a first feature vector of an input sequence, x1, the probability that the input sequence corresponds to class j may be specified as Pr(ωj|x1) and this probability may be determined using Pr(ωj). Given the second feature vector of an input sequence, x2, the probability that the input sequence corresponds to class j may be specified as Pr(ωj|x1, x2) and this probability may be determined using Pr(ωj|x1). More generally, given the first t feature vectors of the input sequence, the probability that the input sequence corresponds to class j may be specified as Pr(ωj|xi, . . . , xt-1, xt) and this probability may be determined using Pr(ωj|x1, . . . , xt-1). The probability Pr(ωj|x1, . . . , xt-1, xt) may be referred to as a posterior probability.
The probability that the input sequence is a member of a class may further be determined using a probability density estimation procedure. One such procedure involves identifying nearest neighbors of the input sequence for each feature vector in the input sequence among the feature vectors in the set of reference sequences. Consider the sequence of feature vectors in
The radius of the n-sphere may be determined using any appropriate techniques. In some implementations, the radius of the sphere may be specified using distances between the input feature vector and reference feature vectors (among all classes). For example, for a specified parameter k, the radius may be set to the distance between the input feature vector and the kth closest reference feature vector. For example, if k is two, the radius may be set to the distance between the input feature vector and the second closest reference feature vector to the input feature vector. The number k may be chosen using any appropriate techniques. In some implementations, k may be selected as one less than the smaller of the total number of reference feature vectors and a specified maximum number.
In some implementations, probability densities may be determined using other density estimation procedures, such as using a joint Gaussian distribution or an information-based approach.
Nearest neighbors may be used in a variety of ways when performing classification. In the following, several non-limiting examples will be provided for using nearest neighbors when performing classification, but the techniques described herein are not limited to these specific examples, and other implementations are possible.
The following equation gives one example of using nearest neighbors to determine a probability that an input sequence is a member of class j:
In this equation, T is the total number of steps in the input sequence, R1 indicates the number of reference sequences for class j, and kj,t indicates the number of reference sequences of class j that are nearest neighbors to the input feature vector at step t.
The probability that the input sequence is a member of class j is determined using the following ratio
for each step t. This ratio has its highest value when all of the nearest neighbors of the input feature vector at step t are members of class j. The ratio has its lowest value when none of the reference sequences of class j are among the nearest neighbors to the input feature vector at step t. For the example of
In some implementations, nearest neighbors may be determined using two steps of the sequences simultaneously. Let xt represent an input feature vector at step t and xt-1 represent an input feature vector at step t−1. An augmented input feature vector for step t may be constructed by concatenating xt-1 and xt. The augmented input feature vector at step t may be denoted as Xt. Where each of xt-1 and xt are length N, Xt will have length 2N. Similarly, augmented reference vectors may be constructed from reference sequences. Let at represent a feature vector of a reference sequence at step t and let at-1 represent a feature vector of the same reference sequence at step t−1. An augmented reference feature vector for step t may be constructed by concatenating at-1 and at. The augmented reference feature vector at step t may be denoted as At. Where each of at-1 and at are length N, At will have length 2N. Determining nearest neighbors for augmented feature vectors may be performed in a similar manner as determining nearest neighbors for unaugmented feature vectors.
Any number of feature vectors may be used to create augmented feature vectors. The number of feature vectors in an augmented feature vector may be specified by M, and such augmented feature vectors may be referred to as Mth-order augmented feature vectors. For example, a second-order augmented feature vector at step t may be constructed from the feature vectors at steps t and t−1, and a third-order augmented feature vector at step t may be constructed from the feature vectors at steps t, t−1, and t−2.
The following equation gives an example of using nearest neighbors with second-order augmented feature vectors to determine a probability that an input sequence is a member of class j:
In this equation, T, R1 and kj,t have the same meanings as above. Further, in this equation, Kj,t indicates a number of nearest neighbors using second-order augmented feature vectors. In particular, Kj,t indicates a number of second-order augmented feature vectors for class j at step t that are nearest neighbors to the second-order augmented input vector at step t.
At the first step (t=1), augmented feature vectors are not used since there are no previous feature vectors available to create augmented features. For later steps, the probabilities are determined using the following ratio:
for each step t. The numerator of this ratio uses the number of second-order augmented reference feature vectors of class j that are nearest neighbors to the augmented input feature vector at step t. The denominator uses the number of unaugmented reference feature vectors of class j that are nearest neighbors to the unaugmented input feature vector at time t−1. This ratio represents the local probability density of class j reference feature vectors in a neighborhood about the input feature vector at time step t, given the proximity of those vectors to the input feature vector at time step t−1. This ensures that the density estimation of class j vectors at the location of the input feature vector is conditional on the historic path of the input feature vector prior to that time step. This provides an estimate of the similarity of the entire input feature path through the vector space to each class of reference feature paths, rather than the similarity of individual input feature vectors to a static series of vectors clustered in space.
More generally, the following equation gives an example of using nearest neighbors with Mth-order augmented feature vectors to determine a probability that an input sequence is a member of class j:
In this equation, T, R1 and kj,t have the same meanings as above. Further, in this equation, Kj,tM is a generalization Kj,t to other orders of augmented feature vectors. Kj,tM indicates a number of nearest neighbors using Mth-order augmented feature vectors. In particular, Kj,tM indicates a number of Mth-order augmented feature vectors for class j at step t that are nearest neighbors to the Mth-order augmented input vector at step t.
As in the previous example, augmented vectors are not used at the first step (t=1). For step 2 to step M−1, augmented feature vectors are used but the order of the augmented feature vectors is reduced. At step M, enough previous steps are available to use Mth-order augmented feature vectors in the computations. At step M and later steps, the probabilities are determined using the following ratio:
for each step t. The numerator of this ratio uses the number of Mth-order augmented reference feature vectors of class j that are nearest neighbors to the augmented input feature vector at step t. The denominator uses the number of (M−1)th-order augmented reference feature vectors of class j that are nearest neighbors to the (M−1)th-order augmented input feature vector at time t−1.
The above computations may be performed for each class. For example, where there are J classes, Pr(ωj|x1, . . . , xt-1, xt) may be computed for each value of j from 1 to J. The input signal may be classified by selecting a class corresponding to a highest probability.
The techniques described above have not explicitly accounted for the possibility of excursions in the input sequence or reference sequences. When there are excursions, the computations may be modified as follows.
For an excursion in an input sequence, an aligned reference sequence may not have any feature vectors aligned to the excursion in the input sequence as described in
Consider the computation of
To account for excursions, this is changed to
where Rj,t indicates the number of reference sequences of class j that have a feature vector at step t. To compute kj,t and Rj,t, we need, for each reference sequence of class j, a feature vector from step t. Accordingly, for steps 1-3 and 6 of
Now consider the computation of
TO compute Kj,t, we need, for each reference sequence of class j, a feature vector from step t and a feature vector from step t−1. To compute k1,t-1 for class j, we need a feature vector from step t−1. These cannot be computed at step 1 since there is no previous step. At steps 2 and 3, Kj,t and kj,t-1 will each be between zero and six because all feature vectors are available. At step 4, Kj,t will be between 0 and 5 because reference sequence 6 does not have a feature vector at step 4, and kj,t-1 will be between 0 and 6 because all feature vectors are available at the previous step. At step 5, Kj,t will be between 0 and 3 because reference sequences 4-6 do not have feature vectors at step 5, and kj,t-1 will be between 0 and 5 because reference sequence 6 does not have a feature vector at the previous step. At step 6, Kj,t will be between 0 and 3 because reference sequences 4-6 do not have feature vectors at the previous step, and kj,t-1 will be between 0 and 3 for the same reason.
The same principles apply for the computation of
except that to compute Ki,tM, feature vectors are needed for the current step and the M−1 previous steps. Accordingly, for larger values of M, a single missing feature vector will have a greater effect since it will reduce the number of available reference sequences for a total of M steps.
The determination of the radius of the n-sphere may also take into account the number of available reference feature vectors at a step. For example, instead of determining k using the total number of reference feature vectors, it may be determined by using the total number of available reference feature vectors at a step.
The above addresses an excursion in an input sequence. Where there is an excursion in a reference sequence, the feature vectors corresponding to the excursion may be removed during the alignment process. Where the feature vectors corresponding to an excursion in a reference sequence are not removed (e.g., alignment is not performed) these feature vectors may simply be ignored or passed over when implementing the techniques described above.
At block 910, an input sequence of feature vectors is computed from an input signal. The input signal may be any type of signal that may be classified. For example, the input signal may be an audio signal, an image, or a video. The feature vectors may include any type of feature vectors that may be computed from the input signal, including but not limited to harmonic amplitudes or mel-frequency cepstral coefficients. The sequence may span any appropriate dimension. For an audio signal, the sequence may span time, and feature vectors may be computed at regular time intervals, such as every 10 milliseconds. In some implementations, a segmentation operation may be performed on the input signal or the input sequence as described above.
At block 920, a plurality of reference sequences is obtained. Each reference sequence may be a sequence of feature vectors, such as any of the types of sequences of feature vectors described above for the input sequence. Each reference sequence may correspond to a class of a plurality of classes. In some implementations, a segmentation operation may have been performed on the reference sequences as described above.
At block 930, each reference sequence of the plurality of reference sequences is aligned with the input sequence as described above. In some implementations, block 930 may not be performed and block 940 may follow block 920.
Blocks 940 to 980 may be performed for one or more steps of the input sequence.
At block 940, a first center vector is obtained for a first step of the input sequence. The first center vector may comprise the feature vector of the input sequence at the first step. In some implementations, the first center vector may also comprise other feature vectors of the input sequence as described above. For example, the first center vector may correspond to the feature vector of the input sequence from step t concatenated with the feature vector of the input sequence from step t−1.
At block 950, a plurality of candidate vectors is obtained. Each candidate vector may be computed from a reference sequence and may comprise the feature vector of the corresponding reference sequence from step t. In some implementations, each candidate vector may comprise additional feature vectors. For example, a candidate vector may correspond to the feature vector of a reference sequence from step t concatenated with the feature vector of that reference sequence from step t−1.
At block 960, a plurality of the candidate vectors is selected as nearest neighbors to the center vector. Any of the techniques describe above may be used to select the nearest neighbors. In some implementations, the nearest neighbors may consist of the k candidate vectors that are closest to the center vector according to some distance measure (e.g., Euclidean distance).
At block 970, a number of nearest neighbors that correspond to a first class is determined. For example, if there are 10 nearest neighbors, 8 of them may correspond to the first class and the other two may correspond to other classes. The number of nearest neighbors that correspond to other classes may also be determined. For example, if there are 10 nearest neighbors, 8 of them may correspond to the first class, 1 may correspond to a second class, and 1 may correspond to a third class. In some implementations, the number of nearest neighbors may correspond to Kj,t, as described above.
At block 980, a score is computed that indicates a similarity between the input sequence (or a portion of the input sequence that has been processed) and the first class. This score may be computed using any of the techniques described above. In some implementations, the score for the first step may be computed recursively using a score from the previous step. In some implementations, the score may be computed as Pr(ωj|x1, . . . , xt-1, xt) using any of the techniques described above (where the first class is ωj and the first step is step t). Other scores may also be computed that indicate similarities between the input sequence and other classes.
Blocks 940 to 980 may be performed iteratively for any number of steps of the input sequence. In some implementations, blocks 940 to 980 are performed for each step of the input sequence.
At block 990, it is determined that the input signal corresponds to (e.g., best matches) the first class. In some implementations, the score from the final iteration of blocks 940 to 980 may be used to determine that the input signal corresponds to the first class. For example, after the final iteration of blocks 940 to 980, scores may be computed indicating a similarity between the input signal and each class of a plurality of classes, and the class having a highest score of the plurality of scores may be selected as the first class.
Depending on the implementation, blocks of any of the techniques described above may be performed in a different sequence, may be combined, may be split into multiple blocks, or may not be performed at all. The blocks may be performed by a general purpose computer, may be performed by a computer specialized for a particular application, may be performed by a single computer or processor, may be performed by multiple computers or processers, may be performed sequentially, or may be performed simultaneously.
The techniques described above may be implemented in hardware (e.g., field-programmable gate array (FPGA), application specific integrated circuit (ASIC)), in software, or a combination of hardware and software. The choice of implementing any portion of the above techniques in hardware or software may depend on the requirements of a particular implementation. A software module or program code may reside in volatile memory, non-volatile memory, RAM, flash memory, ROM, EPROM, or any other form of a non-transitory computer-readable storage medium.
Conditional language used herein, such as, “can,” “could,” “might,” “may,” “e.g.,” is intended to convey that certain implementations include, while other implementations do not include, certain features, elements and/or steps. Thus, such conditional language indicates that that features, elements and/or steps are not required for some implementations. The terms “comprising,” “including,” “having,” and the like are synonymous, used in an open-ended fashion, and do not exclude additional elements, features, acts, operations. The term “or” is used in its inclusive sense (and not in its exclusive sense) so that when used, for example, to connect a list of elements, the term “or” means one, some, or all of the elements in the list.
Conjunctive language such as the phrase “at least one of X, Y and Z,” unless specifically stated otherwise, is to be understood to convey that an item, term, etc. may be either X, Y or Z, or a combination thereof. Thus, such conjunctive language is not intended to imply that certain embodiments require at least one of X, at least one of Y and at least one of Z to each be present.
While the above detailed description has shown, described and pointed out novel features as applied to various implementations, it can be understood that various omissions, substitutions and changes in the form and details of the devices or techniques illustrated may be made without departing from the spirit of the disclosure. The scope of inventions disclosed herein is indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.