Signal recognition has been traditionally performed on signals arising from single domains, such as pictures or sounds. The recognition of a particular image of a person as being a constituent of a given picture and a particular utterance of a speaker as being a constituent of a given sound has been typically accomplished by separate analyses of pictures and sounds.
According to an embodiment of the disclosed subject matter, a method of determining the identity of a speaker includes reading a first video clip for training a neural network, the first video clip including a first audio content and a first video content, the first audio content including a prescribed utterance of a first speaker who is identified by a speaker identifier and the first video content including an image of the first speaker; extracting a first audio feature from the first audio content; extracting a first video feature from the first video content; obtaining, by the neural network, an authentication signature based on the first audio feature and the first video feature; storing the authentication signature and the speaker identifier that corresponds to the authentication signature in a memory; reading a second video clip including a second audio content and a second video content, the second audio content including an utterance of a second speaker who is not pre-identified and the second video content including an image of the second speaker; extracting a second audio feature from the second audio content; extracting a second video feature from the second video content; obtaining, by the neural network, a signature of the second speaker based on the second audio feature and the second video feature; determining, by the neural network, a difference between the signature of the second speaker and the authentication signature; and determining, by the neural network, whether the second speaker in the second video clip is the same as the first speaker in the first video clip based on the difference between the signature of the second speaker and the authentication signature.
According to an embodiment of the disclosed subject matter, an apparatus for determining the identity of a speaker in a video clip includes a memory and a processor communicably coupled to the memory. In an embodiment, the processor is configured to execute instructions to read a first video clip for training a neural network, the first video clip including a first audio content and a first video content, the first audio content including a prescribed utterance of a first speaker who is identified by a speaker identifier and the first video content including an image of the first speaker; extract a first audio feature from the first audio content; extract a first video feature from the first video content; obtain an authentication signature based on the first audio feature and the first video feature; store the authentication signature and the speaker identifier that corresponds to the authentication signature in the memory; read a second video clip including a second audio content and a second video content, the second audio content including an utterance of a second speaker who is not pre-identified and the second video content including an image of the second speaker; extract a second audio feature from the second audio content; extract a second video feature from the second video content; obtain a signature of the second speaker based on the second audio feature and the second video feature; determine a difference between the signature of the second speaker and the authentication signature; and determine whether the second speaker in the second video clip is the same as the first speaker in the first video clip based on the difference between the signature of the second speaker and the authentication signature.
According to an embodiment of the disclosed subject matter, a method of estimating the direction of a sound includes reading a first video clip for training a neural network, the first video clip including a first audio content and a first video content; extracting a first audio feature from the first audio content; extracting a first video feature from the first video content; determining, by the neural network, a label indicating at least a direction of a first sound from a sound source in the first video clip based on the first audio feature and the first video feature; storing the first audio feature and the first video feature corresponding to the label in a memory; reading a second video clip including a second audio content and a second video content, the second audio content including a second sound from the sound source, wherein the direction of the second sound is not pre-identified; extracting a second audio feature from the second audio content; extracting a second video feature from the second video content; and obtaining, by the neural network, a probable direction of the second sound based on a comparison of the second audio feature to the first audio feature and a comparison of the second video feature to the first video feature.
According to an embodiment of the disclosed subject matter, an apparatus for estimating the direction of a sound in a video clip includes a memory and a processor communicably coupled to the memory. In an embodiment, the processor is configured to execute instructions to read a first video clip for training a neural network, the first video clip including a first audio content and a first video content; extract a first audio feature from the first audio content; extract a first video feature from the first video content; determine a label indicating at least a direction of a first sound from a sound source in the first video clip based on the first audio feature and the first video feature; store the first audio feature and the first video feature corresponding to the label in a memory; read a second video clip including a second audio content and a second video content, the second audio content including a second sound from the sound source, wherein the direction of the second sound is not pre-identified; extract a second audio feature from the second audio content; extract a second video feature from the second video content; and obtain a probable direction of the second sound based on a comparison of the second audio feature to the first audio feature and a comparison of the second video feature to the first video feature.
According to an embodiment of the disclosed subject matter, means for determining the identity of a speaker are provided, which include means for reading a first video clip for training a neural network, the first video clip including a first audio content and a first video content, the first audio content including a prescribed utterance of a first speaker who is identified by a speaker identifier and the first video content including an image of the first speaker; means for extracting a first audio feature from the first audio content; means for extracting a first video feature from the first video content; means for obtaining an authentication signature based on the first audio feature and the first video feature; means for storing the authentication signature and the speaker identifier that corresponds to the authentication signature in a memory; means for reading a second video clip including a second audio content and a second video content, the second audio content including an utterance of a second speaker who is not pre-identified and the second video content including an image of the second speaker; means for extracting a second audio feature from the second audio content; means for extracting a second video feature from the second video content; means for obtaining a signature of the second speaker based on the second audio feature and the second video feature; means for determining a difference between the signature of the second speaker and the authentication signature; and means for determining whether the second speaker in the second video clip is the same as the first speaker in the first video clip based on the difference between the signature of the second speaker and the authentication signature.
According to an embodiment of the disclosed subject matter, means for estimating the direction of a sound are provided, which include means for reading a first video clip for training a neural network, the first video clip including a first audio content and a first video content; means for extracting a first audio feature from the first audio content; means for extracting a first video feature from the first video content; means for determining a label indicating at least a direction of a first sound from a sound source in the first video clip based on the first audio feature and the first video feature; means for storing the first audio feature and the first video feature corresponding to the label in a memory; means for reading a second video clip including a second audio content and a second video content, the second audio content including a second sound from the sound source, wherein the direction of the second sound is not pre-identified; means for extracting a second audio feature from the second audio content; means for extracting a second video feature from the second video content; and means for obtaining a probable direction of the second sound based on a comparison of the second audio feature to the first audio feature and a comparison of the second video feature to the first video feature.
Additional features, advantages, and embodiments of the disclosed subject matter may be set forth or apparent from consideration of the following detailed description, drawings, and claims. Moreover, it is to be understood that both the foregoing summary and the following detailed description are illustrative and are intended to provide further explanation without limiting the scope of the claims.
The accompanying drawings, which are included to provide a further understanding of the disclosed subject matter, are incorporated in and constitute a part of this specification. The drawings also illustrate embodiments of the disclosed subject matter and together with the detailed description serve to explain the principles of embodiments of the disclosed subject matter. No attempt is made to show structural details in more detail than may be necessary for a fundamental understanding of the disclosed subject matter and various ways in which it may be practiced.
It is desirable to recognize signals of different types in composite domains rather than separate domains for improved efficiency. Signals of different types in different domains may be recognized for various purposes, for example, to determine the identity of a person or to estimate the direction of a sound or the location of a speaker or sound source based on audio and video features extracted from a video clip that includes a soundtrack as well as a video content. Although various examples described below relate to recognition of audio and video signals in composite audio/video domains, the principles of the disclosed subject matter may be applicable to other types of signals indicative of measurable or quantifiable characteristics. For example, signals representing quantifiable characteristics based on sensory inputs, such as tactile, olfactory, or gustatory inputs, may also be analyzed according to embodiments of the disclosed subject matter. As alternatives or in addition, the principles of the disclosed subject matter may be applicable to signals produced by various types of electrical, mechanical or chemical sensors or detectors, such as temperature sensors, carbon dioxide detectors or other types of toxic gas detectors, infrared sensors, ultraviolet sensors, motion detectors, position sensors, accelerometers, gyroscopes, compasses, magnetic sensors, Reed switches, or the like.
In some implementations, recognition of signals from different types of sensors may be accomplished by a neural network. A sensor may generate an output that is indicative of a measured quantity. For example, a video camera may respond to received light over prescribed bands of sensitivity and provide a map of illumination data based on a sampling of the received light over space and time. Likewise, a microphone may respond to received sound over a frequency range and provide a map of perturbations in atmospheric pressure based on a sampling of the received sound over time. A stereo system of two or more microphones may provide a map of perturbations in atmospheric pressure based on a sampling of the received sound over space and time. Thus, the domain of the video camera is illumination over a region of space and time, and the domain of the stereo microphone system is atmospheric pressure perturbation over a region of space and time.
As a generalization, each sensor S may have its own domain D, such that its input to the neural network is S(D). The neural network may be trained to perform recognition of the signal S(D). The neural network NN may apply an activation function A to a linear combination of a data vector and a weight vector W to generate a result R:
R=NN[A(S(D)·W)]
Assuming that a signal recognition system has a total number of i domains and a total number of j sensors, the domains may be denoted as D1, D2, . . . Di and the sensors may be denoted as S1, S2, . . . Sj. The result R may be considered as a composition of multiple neural networks each operating in a respective domain:
R=NN[D
1
]·NN[D
2
]· . . . NN[D
i]
In addition or as an alternative, the result R may be formed by the operation of another neural network on the outputs of the individual neural networks in order to achieve a reduction in dimensionality for recognition:
R=NN
1
[NN
2
[D
1
], NN
3
[D
2
], . . . ,NN
j[Di]]
where each of NN1, NN2, . . . NNj is a unique neural network.
According to embodiments of the disclosed subject matter, a single neural network may be trained for signal recognition in a composite domain even if signals of different types belong to different domains:
R=NN [D
1
,D
2
, . . . ,D
i]
Two specific examples of signal recognition in the audio/video domains performed by an audio/video system of
In this example, the audio and video features are transmitted to a neural network to determine the identity of a person. In one implementation, speaker identification may involve three phases, including a first phase of generating audio/video features from video clips that include prescribed utterances of one or more known speakers to train the neural network, as illustrated in
As used herein, “features” are efficient numerical representations of signals or characteristics thereof for training a neural network in one or more domains. An audio “feature” may be one of various expressions of a complex value representing an extracted audio signal in a normalized audio frame. For example, the feature may be an expression of a complex value with real and imaginary components, or with a magnitude and a phase. The magnitude may be expressed in the form of a linear magnitude, a log magnitude, or a log-mel magnitude as known in music, for example.
In
In
After the extracted and time-aligned audio and video features are stored along with the speaker identifier as a label in block 220, a determination is made as to whether an additional video clip is to available to be read in block 222. If it is determined that an additional video clip is to available to be read for training the neural network in block 222, then the processes of extracting audio and video features from the additional video clip, time-aligning the extracted audio and video features, and storing the time-aligned audio and video features with the associated speaker identifier as a label in blocks 204-220 are repeated, as shown in
In some implementations, two or more video clips featuring the same speaker may be used to train the neural network for determining the identity of the speaker. For example, two or more training clips each featuring a slightly different speech and a slightly different pose of the same speaker may be provided to train the neural network to recognize or to determine the identity of the speaker who is not pre-identified in a video stream that is not part of a training video clip. In some implementations, additional video clips featuring different speakers may be provided. In these implementations, audio and video features may be extracted from the audio and video contents, time-aligned, and stored along with their associated speaker identifiers as labels in a table that includes training data for multiple speakers. In some implementations, more than one training video clip may be provided for each of the multiple speakers to allow the neural network to differentiate effectively and efficiently between the identities of multiple speakers in a video stream that is not part of a training video clip.
If it is determined that no additional training video clip is to be read in block 222, the audio and video features and the associated labels in the table are passed to the neural network for training in block 224, and the first phase of training the neural network for identifying a human speaker as illustrated in
In some implementations, in order to generate additional data for training the neural network, the video features extracted from the video content of one speaker may be time-aligned with the audio features extracted from the audio content of another speaker to generate a new set of data with associated labels corresponding to the identity of the speaker who provided the video content and the identity of the other speaker who provided the audio content. Such new sets of data with their associated labels may be entered into a table for cross-referencing of the identities different speakers. By using these sets of data with cross-referencing of different speakers, the neural network may be trained to recognize which human utterance is not associated with a given video image, for example. In some implementations, time-alignment of audio and video features of different speakers may be achieved by using warping algorithms such as hidden Markov models or dynamic time warping algorithms known to persons skilled in the art. In some implementations, the neural network architecture may be a deep neural network with one or more LCN, CNN, or LSTM layers, or any combination thereof.
The authentication signature of a given speaker may be stored in a template table for training the neural network. In one implementation, each authentication signature and its associated label, that is, the speaker identifier, may be stored as a key-value pair in a template table, as shown in block 310. The speaker identifier or label may be stored as the key and the authentication signature may be stored as the value in the key-value pair, for example. Multiple sets of key-value pairs for multiple speakers may be stored in a relational database. The authentication signatures and the labels indicating the corresponding speaker identities of multiple speakers may be stored in a database in various other manners as long as the authentication signatures are correctly associated with their corresponding labels or speaker identities.
After a given key-value pair is stored in block 310, a determination is made as to whether an additional video clip is available to be read in block 312. If an additional video clip is available to be read for obtaining an additional authentication signature, then the process steps in blocks 304-310 are repeated to obtain the additional authentication signature, as shown in
As described above with respect to
In block 412, a determination is made as to whether the difference between the signature of the human speaker and the authentication signature is sufficiently small. As known to persons skilled in the art, the Hamming distance between two binary strings is zero if the two binary strings are identical to each other, whereas a large Hamming distance indicates a large number of mismatches between corresponding bits of the two binary strings. In some implementations, the determination of whether the difference between the signature of the human speaker and the authentication signature is sufficiently small may be based on determining whether the Hamming distance between the two signatures is less than or equal to a predetermined threshold distance. For example, if the signature of the human speaker and the authentication signature each comprise a 16-bit string, the difference between the two signatures may be deemed sufficiently small if the Hamming distance between the two strings is 2 or less.
If it is determined that the difference between the signature of the human speaker and the authentication signature is sufficiently small in block 412, then the identity of the human speaker in the non-training video clip may be determined based on a complete or at least a substantial match between the two signatures. In some implementations, an identity flag of the human speaker in the non-training video clip may be set as identity_flag=TRUE, and the identity of the human speaker may be set to equal to the speaker identifier associated with the authentication signature having the smallest Hamming distance from the signature of the human speaker, that is, identity=template_speaker_id_with_min_dist, as shown in block 414. After the identity of the human speaker is determined in block 414, the process concludes in block 418. On the other hand, if it is determined that the difference between the signature of the human speaker and the authentication signature is not sufficiently small in block 412, then the identity flag may be set as identity_flag=FALSE, indicating a mismatch between the two signatures, as shown in block 416.
As described above, in some implementations, more than one authentication signature may be associated with a given speaker identifier in the template table. The signature of a human speaker in a non-training video clip may match one but not the other authentication signatures associated with that speaker identifier stored in the template table. The identity of the human speaker may be set to equal to that speaker identifier as long as one of the authentication signatures is a sufficiently close match to the signature of the human speaker.
If the template table stores authentication signatures associated with multiple speaker identifiers, the process of determining the difference between the signature of the human speaker and each of the authentication signatures stored in the template table in blocks 410 and 412 may be repeated until an authentication signature that has a sufficiently small difference from the signature of the human speaker is found and the identity of the human speaker is determined. For example, a determination may be made as to whether an additional authentication signature is available for comparison with the signature of the human speaker in block 420 if the current authentication signature is not a sufficiently close match to the signature of the human speaker. If an additional authentication signature is available, then the steps of determining the difference between the additional authentication signature and the signature of the human speaker in block 410 and determining whether the difference is sufficiently small in block 412 are repeated. If no additional authentication signature is available for comparison, the process concludes in block 418.
In this example, the audio and video features are transmitted to a neural network to estimate the direction of arrival of a sound or the location of a sound source based on both audio and video contents of a video clip. Although specific examples are described below for estimating the direction of arrival of a human speech or the location of a human speaker based on audio and video contents, the principles of the disclosed subject matter may also be applicable for estimating the direction or location of other types of sound sources, such as sources of sounds made by animals or machines. In one implementation, the estimation of the direction of arrival of a speech or the location of a speaker may involve two phases, including a first phase of using audio and video features for training a neural network to estimate the direction of arrival of the speech or the location of the speaker, as illustrated in
If it is determined that the video clip contains a human speech in block 506, then a direction or location label may be assigned to the video clip, or at least to the speech portion of the video clip. In the example shown in
After the direction or location label is determined, time-aligned audio and video features may be extracted from the training video clip in block 512, and the time-aligned audio and video features in each audio/video frame may be stored with a corresponding direction or location label in a table in block 514. In some implementations, the time-aligned audio and video features and their corresponding labels may be stored as key-value pairs, in which the labels are the keys and the audio and video features are the values, in a relational database, for example. In some implementations, the direction label may indicate the azimuth and elevation angles of the direction of sound propagation in three-dimensional spherical coordinates. In addition or as an alternative, the location of the human speaker in a given time-aligned audio/video frame may be provided as a label. For example, the location of the speaker may be expressed as the azimuth angle, the elevation angle, and the distance of the speaker with respect to a reference point which serves as the origin in three-dimensional spherical coordinates. Other types of three-dimensional coordinates such as Cartesian coordinates or cylindrical coordinates may also be used to indicate the location of the speaker.
In some instances, the speaker may remain at a fixed location in the training video clip, such that the location of the speaker may be used as a reference point or the ground truth for the label. In other instances, the speaker may move from one position to another in the training video clip, and the audio and video features within each time-aligned audio/video frame may be associated with a distinct label. The varying directions of sound propagation or the varying locations of the sound source in the training video clip may be tracked over time by distinct direction or location labels associated with their respective time-aligned audio/video frames in the table generated in block 514. As described above, the direction or location labels and their corresponding audio and video features in time-aligned audio/video frames may be stored as key-value pairs in a template table or a relational database, or in various other manners as long as the labels are correctly associated with their corresponding audio/video frames.
In
After the time-aligned audio and video features are extracted from the video clip in block 606, the audio and video features are passed through the neural network to obtain a maximum probability vector of the direction of the sound or speech, as shown in block 608. The maximum probability vector may be obtained by finding the closest match between the time-aligned audio and video features extracted from the non-training video clip obtained in
In embodiments in which the direction of arrival of the speech is to be estimated, the probability vector may be a two-dimensional vector with one dimension representing an azimuth and the other dimension representing an elevation in spherical coordinates. In such embodiments, the maximum probability vector may be indicative of the highest likelihood of an exact or at least the closest match between the actual direction of arrival of the speech and one of the direction labels stored in a table or database, based on comparisons of the time-aligned audio and video features extracted from the non-training video clip in
In embodiments in which the location of the speaker is to be estimated, the probability vector may be a three-dimensional vector with one dimension representing an azimuth, another dimension representing an elevation, and yet another dimension representing a distance in spherical coordinates. In such embodiments, the maximum probability vector may be indicative of the highest likelihood of an exact or at least the closest match between the actual location of the speaker and one of the location labels stored in a table or database, based on comparisons of the time-aligned audio and video features extracted from the non-training video clip in
Embodiments of the presently disclosed subject matter may be implemented in and used with a variety of component and network architectures. For example, the neural network 16 as shown in
The bus 21 allows data communication between the central processor 24 and one or more memory components, which may include RAM, ROM, and other memory, as previously noted. Typically RAM is the main memory into which an operating system and application programs are loaded. A ROM or flash memory component can contain, among other code, the Basic Input-Output system (BIOS) which controls basic hardware operation such as the interaction with peripheral components. Applications resident with the computer 20 are generally stored on and accessed via a computer readable medium, such as a hard disk drive (e.g., fixed storage 23), an optical drive, floppy disk, or other storage medium.
The fixed storage 23 may be integral with the computer 20 or may be separate and accessed through other interfaces. The network interface 29 may provide a direct connection to a remote server via a wired or wireless connection. The network interface 29 may provide such connection using any suitable technique and protocol as will be readily understood by one of skill in the art, including digital cellular telephone, Wi-Fi, Bluetooth®, near-field, and the like. For example, the network interface 29 may allow the computer to communicate with other computers via one or more local, wide-area, or other communication networks, as described in further detail below.
Many other devices or components (not shown) may be connected in a similar manner (e.g., document scanners, digital cameras and so on). Conversely, all of the components shown in
More generally, various embodiments of the presently disclosed subject matter may include or be embodied in the form of computer-implemented processes and apparatuses for practicing those processes. Embodiments also may be embodied in the form of a computer program product having computer program code containing instructions embodied in non-transitory or tangible media, such as floppy diskettes, CD-ROMs, hard drives, USB (universal serial bus) drives, or any other machine readable storage medium, such that when the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing embodiments of the disclosed subject matter. Embodiments also may be embodied in the form of computer program code, for example, whether stored in a storage medium, loaded into or executed by a computer, or transmitted over some transmission medium, such as over electrical wiring or cabling, through fiber optics, or via electromagnetic radiation, such that when the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing embodiments of the disclosed subject matter. When implemented on a general-purpose microprocessor, the computer program code segments configure the microprocessor to create specific logic circuits.
In some configurations, a set of computer-readable instructions stored on a computer-readable storage medium may be implemented by a general-purpose processor, which may transform the general-purpose processor or a device containing the general-purpose processor into a special-purpose device configured to implement or carry out the instructions. Embodiments may be implemented using hardware that may include a processor, such as a general purpose microprocessor or an Application Specific Integrated Circuit (ASIC) that embodies all or part of the techniques according to embodiments of the disclosed subject matter in hardware or firmware. The processor may be coupled to memory, such as RAM, ROM, flash memory, a hard disk or any other device capable of storing electronic information. The memory may store instructions adapted to be executed by the processor to perform the techniques according to embodiments of the disclosed subject matter.
In some embodiments, the microphones 10a and 10b as shown in
In general, a “sensor” as disclosed herein may include multiple sensors or sub-sensors, such as where a position sensor includes both a global positioning sensor (GPS) as well as a wireless network sensor, which provides data that can be correlated with known wireless networks to obtain location information. Multiple sensors may be arranged in a single physical housing, such as where a single device includes movement, temperature, magnetic, or other sensors. Such a housing also may be referred to as a sensor or a sensor device. For clarity, sensors are described with respect to the particular functions they perform or the particular physical hardware used, when such specification is necessary for understanding of the embodiments disclosed herein.
A sensor may include hardware in addition to the specific physical sensor that obtains information about the environment.
Sensors as disclosed herein may operate within a communication network, such as a conventional wireless network, or a sensor-specific network through which sensors may communicate with one another or with dedicated other devices. In some configurations one or more sensors may provide information to one or more other sensors, to a central controller, or to any other device capable of communicating on a network with the one or more sensors. A central controller may be general- or special-purpose. For example, one type of central controller is a home automation network that collects and analyzes data from one or more sensors within the home. Another example of a central controller is a special-purpose controller that is dedicated to a subset of functions, such as a security controller that collects and analyzes sensor data primarily or exclusively as it relates to various security considerations for a location. A central controller may be located locally with respect to the sensors with which it communicates and from which it obtains sensor data, such as in the case where it is positioned within a home that includes a home automation or sensor network. Alternatively or in addition, a central controller as disclosed herein may be remote from the sensors, such as where the central controller is implemented as a cloud-based system that communicates with multiple sensors, which may be located at multiple locations and may be local or remote with respect to one another.
Moreover, the smart-home environment may make inferences about which individuals live in the home and are therefore users and which electronic devices are associated with those individuals. As such, the smart-home environment may “learn” who is a user (e.g., an authorized user) and permit the electronic devices associated with those individuals to control the network-connected smart devices of the smart-home environment, in some embodiments including sensors used by or within the smart-home environment. Various types of notices and other information may be provided to users via messages sent to one or more user electronic devices. For example, the messages can be sent via email, short message service (SMS), multimedia messaging service (MMS), unstructured supplementary service data (USSD), as well as any other type of messaging services or communication protocols.
A smart-home environment may include communication with devices outside of the smart-home environment but within a proximate geographical range of the home. For example, the smart-home environment may communicate information through the communication network or directly to a central server or cloud-computing system regarding detected movement or presence of people, animals, and any other objects and receives back commands for controlling the lighting accordingly.
The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit embodiments of the disclosed subject matter to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to explain the principles of embodiments of the disclosed subject matter and their practical applications, to thereby enable others skilled in the art to utilize those embodiments as well as various embodiments with various modifications as may be suited to the particular use contemplated.