The present description relates generally to electronic devices including, for example, efficient embedding for acoustic models.
Audio classification models can be trained to classify general categories of sounds using training datasets gathered by hundreds, thousands, or potentially millions of devices. However, audio classification can be computationally expensive to run in real time, particularly, for example, when attempting to classify multiple different sounds in an acoustic environment on an ongoing basis.
Certain features of the subject technology are set forth in the appended claims. However, for purpose of explanation, several embodiments of the subject technology are set forth in the following figures.
The detailed description set forth below is intended as a description of various configurations of the subject technology and is not intended to represent the only configurations in which the subject technology can be practiced. The appended drawings are incorporated herein and constitute a part of the detailed description. The detailed description includes specific details for the purpose of providing a thorough understanding of the subject technology. However, the subject technology is not limited to the specific details set forth herein and can be practiced using one or more other implementations. In one or more implementations, structures and components are shown in block diagram form in order to avoid obscuring the concepts of the subject technology.
Acoustic models, such as audio classification models, can be trained for detecting specific sounds. These acoustic models may be trained to generate learned embeddings of audio inputs and to generate labels of the audio inputs using the learned embeddings. However, in, for example, environments in which monitoring for multiple different sounds in an acoustic environment is desired, it can be inefficient (e.g., in terms of power usage, memory usage, and/or processing resources) to repeatedly generate embeddings by the various acoustic models for detecting the multiple different sounds.
Aspects of the subject technology, provide an efficient embedding system for acoustic models. In one or more implementations, an embeddings cache is provided that stores embeddings that can be provided to various different classifier models. The embeddings may be learned embeddings generated using a trained embeddings model, and may each correspond to an input audio sample. In accordance with one or more implementations, the embeddings may be stored in connection with an encoded version (e.g., a hash) of the corresponding input audio sample.
In accordance with one or more implementations, when a new input audio sample is obtained, if the embedding for that new input sample exists in the cache, the cached example can be provided to the downstream classification models (also referred to herein as sound detection models or detection models) for subsequent classification operations. If no embedding exists in the cache for the new input sample, a new embedding can be generated (e.g., and stored in the embeddings cache in association with a hash of the input sample).
The network environment 100 includes electronic devices 102, 103, 104, 105, 106 and 107 (hereinafter “the electronic devices 102-107”), a local area network (“LAN”) 108, a network 110, and one or more servers, such as server 114.
In one or more implementations, one, two, or more than two (e.g., all) of the electronic devices 102-107 may be associated with (e.g., registered to and/or signed into) a common account, such as an account (e.g., user account) with the server 114. As examples, the account may be an account of an individual user or a group account. In one or more implementations, the devices can be registered to different user accounts and the user accounts themselves may be grouped or otherwise associated with one another (e.g., user accounts of a family). As illustrated in
In one or more implementations, the electronic devices 102-107 may form part of a connected home environment 116, and the LAN 108 may communicatively (directly or indirectly) couple any two or more of the electronic devices 102-107 within the connected home environment 116. Moreover, the network 110 may communicatively (directly or indirectly) couple any two or more of the electronic devices 102-107 with the server 114, for example, in conjunction with the LAN 108. Electronic devices such as electronic device 106 and electronic device 105 may communicate directly over a secure direct connection in some scenarios, such as when electronic device 106 is in proximity to electronic device 105. Although the electronic devices 102-107 are depicted in
In one or more implementations, the LAN 108 may include one or more different network devices/network medium and/or may utilize one or more different wireless and/or wired network technologies, such as Ethernet, optical, Wi-Fi, Bluetooth, Zigbee, Powerline over Ethernet, coaxial, Ethernet, Z-Wave, cellular, or generally any wireless and/or wired network technology that may communicatively couple two or more devices.
In one or more implementations, the network 110 may be an interconnected network of devices that may include, and/or may be communicatively coupled to, the Internet. For explanatory purposes, the network environment 100 is illustrated in
One or more of the electronic devices 102-107 may be, for example, a portable computing device such as a laptop computer, a smartphone, a smart speaker, a peripheral device (e.g., a digital camera, headphones), a digital media player, a tablet device, a wearable device such as a smartwatch or a band, a connected home device, such as a wireless camera, a router and/or wireless access point, a wireless access device, a smart thermostat, smart light bulbs, home security devices (e.g., motion sensors, door/window sensors, etc.), smart outlets, smart switches, and the like, or any other appropriate device that includes and/or is communicatively coupled to, for example, one or more wired or wireless interfaces, such as WLAN radios, cellular radios, Bluetooth radios, Zigbee radios, near field communication (NFC) radios, and/or other wireless radios.
By way of example, in
In one or more implementations, one or more of the electronic devices 102-107 may provide a system for training a machine learning model using training data, where the trained machine learning model is subsequently deployed to that electronic device and/or other one of the electronic device 102-107. Further, the electronic device 106 may provide one or more machine learning frameworks for training machine learning models and/or developing applications using such machine learning models. In an example, such machine learning frameworks can provide various machine learning algorithms and models for different problem domains in machine learning. In an example, one or more of the electronic devices 102-107 may include a deployed machine learning model that provides an output of data corresponding to a prediction or transformation or some other type of machine learning output.
As shown in
In one or more implementations, one of more of the electronic devices 102-107 may be configured to detect one or more specific sounds (e.g., the sound of the doorbell 123, a sound associated with an appliance 121, a sound of an object or device in operation or ceasing operation, a smoke alarm sound, a fire alarm sound, a carbon monoxide alarm sound, or the sound of a pet 125) and to generate an alert, a notification, or other output when a specific sound is detected. For example, one or more of the electronic devices 102-107 may include one or more machine-learning models trained as sound classifiers. For example, one or more of the electronic devices 102-107 may include a pre-trained general sound classifier trained at another device or server and deployed to the electronic device (e.g., for general detection of general sounds, and which may not be able to detect to the specific sounds generated in a specific acoustic environment). As another example, one or more of the electronic devices 102-107 may use one or more detections models (also referred to as classification models or sound detection models), trained at that electronic device using audio samples obtained by that electronic device and/or one or more others of the electronic devices 102-107.
In one or more implementations, one or more of the electronic devices 102-107 may include an embedding model configured to generate learned embeddings from audio inputs to the embedding model. In one or more implementations, one or more of the electronic devices 102-107 may include an embeddings cache that stores one or more learned embeddings. The learned embeddings may be stored in connection with respective encoded versions of the audio inputs from which the learned embeddings were generated. By storing encoded versions of the audio inputs and/or learned embeddings of the audio inputs, without storing unencoded audio samples, the privacy of users of the electronic devices 102-107 and/or other persons in the environment of the electronic devices 102-107 can be protected. For example, the encoded versions of the audio samples and/or the learned embeddings can be unrecognizable to a human eye or ear, and thus cannot be used to identify individuals or voices that may be present in the acoustic environment of the electronic devices 102-107. In one or more implementations, learned embeddings may be generated by one or more of the electronic devices 102-107 and provided to one or more others of the electronic devices 102-107, as described in further detail hereinafter. In some aspects, to protect the user's privacy, the encoded version of the audio inputs and/or learned embeddings are only stored locally on electronic devices 102-107, without any back-ups to remote servers. Moreover, because the embedding model is specifically trained to classify sounds of objects, the embedding model that generates the learned embeddings may be unable to extract the identity of a speaker or spoken words (e.g., which could only be identified by a different kind of model with different training data and objectives), thereby further protecting the user's privacy.
In one or more implementations, the server 114 may be configured to perform operations in association with user accounts such as: storing data (e.g., user settings/preferences, files such as documents and/or photos, etc.) with respect to user accounts, sharing and/or sending data with other users with respect to user accounts, backing up device data with respect to user accounts, and/or associating devices and/or groups of devices with user accounts.
One or more of the servers such as the server 114 may be, and/or may include all or part of the device discussed below with respect to
The device 200 may include a processor 202, a memory 204, a communication interface 206, an input device 208, and an output device 210. The processor 202 may include suitable logic, circuitry, and/or code that enable processing data and/or controlling operations of the device 200. In this regard, the processor 202 may be enabled to provide control signals to various other components of the device 200. The processor 202 may also control transfers of data between various portions of the device 200. Additionally, the processor 202 may enable implementation of an operating system or otherwise execute code to manage operations of the device 200.
The memory 204 may include suitable logic, circuitry, and/or code that enable storage of various types of information such as received data, generated data, code, and/or configuration information. The memory 204 may include, for example, random access memory (RAM), read-only memory (ROM), flash, and/or magnetic storage.
In one or more implementations, in a case where the device 200 corresponds to one of the electronic devices 102-107, the memory 204 may store one or more sound detection models, encoded versions of audio inputs or sounds, learned embeddings of one or more audio inputs or sounds, and/or information associated with one or more user accounts for one or more applications and/or services, using data stored locally in memory 204. Moreover, the input device 208 may include suitable logic, circuitry, and/or code for capturing input, such as audio input, sound input remote control input, touchscreen input, keyboard input, etc. The output device 210 may include suitable logic, circuitry, and/or code for generating notifications, alerts and/or other output, such as audio output, display output, light output, and/or haptic and/or other tactile output (e.g., vibrations, taps, etc.).
The communication interface 206 may include suitable logic, circuitry, and/or code that enables wired or wireless communication, such as between any of the electronic devices 102-107 and/or the server 114 over the network 110 (e.g., in conjunction with the LAN 108). The communication interface 206 may include, for example, one or more of a Bluetooth communication interface, a cellular interface, an NFC interface, a Zigbee communication interface, a WLAN communication interface, a USB communication interface, or generally any communication interface.
In one or more implementations, one or more of the processor 202, the memory 204, the communication interface 206, the input device 208, and/or one or more portions thereof, may be implemented in software (e.g., subroutines and code), may be implemented in hardware (e.g., an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Programmable Logic Device (PLD), a controller, a state machine, gated logic, discrete hardware components, or any other suitable devices) and/or a combination of both.
The pre-processing engine 300 may perform one or more pre-processing operations on the audio signal. For example, the pre-processing operations may include performing a frequency transform (e.g., a Fourier transform) that transforms an audio signal in the time domain to an audio signal in a frequency domain. In one or more implementations, the pre-processing operations may include combining frequency domain signals to generate a spectrogram.
In the example, of
In the example of
In the example, of
In one or more implementations, each of the detection models 402 may be configured to receive a number M of the N stored embeddings in the embeddings cache 404 and to generate a corresponding detection output based on the number M of the stored embeddings. In this way, real-time sound detection can be performed in parallel by multiple sound detectors (e.g., detection models 402) at the electronic device 106 using a desired number of recent embeddings, without each of the detection models having to compute the embeddings for the incoming real-time audio inputs.
In one or more implementations, when a new embedding is generated by the embedding model 400 for storage in the embeddings cache 404, the oldest stored embedding in the embeddings cache 404 may be deleted from the electronic device 106 and replaced by the new embedding. In this way, the electronic device 106 can, in a manner that protects the privacy of the user of the electronic device and/or any other people that may be in the vicinity of the electronic device 106, generate an embeddings cache 404 that may be used for efficient sound detection by multiple sound detectors (e.g., detection models 402) the electronic device 106. For example, by storing the learned embeddings without storing the corresponding audio inputs, and by continually deleting older embeddings, the electronic device 106 can be configured to efficiently provide embeddings to the detection models without permanently storing user-identifiable data associated with audio inputs. Moreover, deleting the older embeddings can also be advantageous in freeing memory resources, which can also reduce power consumption and improve battery life.
In addition to, or alternatively to, simply providing a number M of the most recent embeddings stored in the embeddings cache 404 to multiple detection models 402, the electronic device 106 can also selectively provide stored embeddings to one or more detection models 402 at the device when the electronic device determines that an embedding of a new audio input or sound input already exists in the embeddings cache. This can be useful for real-time sound detection operations and/or offline sound detection operations, such as for detecting sounds in stored audio files (e.g., including video files with audio content).
As shown in
As shown, the embedding model 400 may generate a learned embedding of the audio input and provide the learned embedding to the detection model 402. As in
As shown in
In this way, the electronic device 106 can, in a manner that protects the privacy of the user of the electronic device and/or any other people that may be in the vicinity of the electronic device 106, generate an embeddings cache 404 that may be used for real-time and/or subsequent efficient sound detection by the electronic device 106. By storing the encoded versions of audio inputs in connection with the learned embeddings for those inputs, the electronic device 106 can be configured to efficiently provide embeddings to the detection model, when sounds for which embeddings have already been generated are received in future sound input by the input device 208.
As illustrated in
As shown, the pre-processing engine 300 may generate an encoded version of the second audio sample. The pre-processing engine 300 may provide the encoded version of the second audio sample to the comparator 403. In this example, the comparator 403 may compare the encoded version of the second audio sample with the encoded version of the first audio sample. For example, the comparator 403 may receive the encoded version of the second audio sample, obtain the encoded version of the first audio sample and any other encoded versions of other previous audio samples that are stored in the embeddings cache 404, and compare the encoded version of the second audio sample with the encoded version of the first audio sample and/or any other obtained encoded versions of other previous audio sample.
In the example of
In the example of
In the example of
Although the examples of
At block 1002, a device (e.g., electronic device 106) at which a learned embedding of a first audio sample is stored in connection with an encoded version of the first audio sample (e.g., in an embeddings cache such as embeddings cache 404), may generate an encoded version of a second audio sample. For example, the audio sample may be a sample sound input obtained by a microphone (e.g., a microphone 152) that is installed in the device or a microphone that is communicatively coupled (e.g., by a wired or wireless connection) to the device. In various use cases, the sound input may correspond to the sound of an appliance, a pet, a siren, an alarm, or another sound in an acoustic scene or environment around the electronic device. In various use cases, the sound input may include a sound that was previously enrolled for detection by the electronic device 106 by training a detection model at the electronic device. As an example, the encoded version of the first audio sample may be a hash of the first audio sample or any other encoding of the first audio sample. For example, generating the encoded version of the first audio sample may include generating a hash of the first audio sample. As an example, the encoded version of the second audio sample may be a hash of the second audio sample or any other encoding of the second audio sample. In various implementations, the sound input may be converted (e.g., transformed) into frequency space prior to encoding (e.g., hashing) of the sound input.
At block 1004, the device (e.g., a comparator 403) may compare the encoded version of the second audio sample with the encoded version of the first audio sample. For example, comparing the encoded version of the second audio sample with the encoded version of the first audio sample may include performing a hash comparison of the encoded version of the second audio sample and the encoded version of the first audio sample (e.g., by comparing corresponding values of hashes of the first audio sample and the second audio sample, and/or by determining whether the encoded version of the first audio sample and the encoded version of the second audio sample contain the same number of keys and whether each of one or more key-value pairs of the encoded version of the first audio sample is equal to the corresponding elements in the encoded version of the second audio sample).
At block 1006, the device may, responsive to a determination that the encoded version of the second audio sample matches the encoded version of the first audio sample, provide the learned embedding of the first audio sample to a first machine learning model (e.g., a detection model 402) at the device (e.g., as discussed herein in connection with
At block 1008, responsive to a determination that the encoded version of the second audio sample is different from the encoded version of the first audio sample, the device may generate, using a second machine learning model (e.g., the embedding model 400), a learned embedding of the second audio sample and provide the learned embedding of the second audio sample to the first machine learning model (e.g., the detection model). In one or more implementations, the device may also store a label, generated by the first machine learning model, for the first audio sample in connection with the encoded version of the first audio sample. In one or more implementations, the stored label and the stored encoded version of the first audio sample can be used to identify another audio signal containing the same sound as the first audio sample, without again operating the detection model.
In one or more implementations, the process 1000 may also include, responsive to the determination that the encoded version of the second audio sample is different from the encoded version of the first audio sample, storing the learned embedding of the second audio sample (e.g., in the embeddings cache 404), generating an encoded version of the second audio sample, and storing the learned embedding of the second audio sample in connection with the encoded version of the second audio sample (e.g., in the embeddings cache 404).
In one or more implementations, prior to generating the encoded version of the second audio sample at block 1002, the device may obtain, from a microphone of the device, the first audio sample, and generate, using the second machine learning model at the device, the learned embedding of the first audio sample. The device (e.g., pre-processing engine 300) may also generate the encoded version of the first audio sample. The device may also store the encoded version of the first audio sample and the learned embedding of the first audio sample (e.g., in an embeddings cache 404) at the device.
In one or more implementations, the process 1000 may also include, prior to generating the encoded version of the second audio sample at block 1002, providing the learned embedding of the first audio sample to a third machine learning model (e.g., the same or another detection model 402) at the device, and obtaining, by the device, a label (e.g., a detection output) for the first audio sample based on an output of the third machine learning model. For example, the third machine learning model may be the same as the first machine learning model or may be different from the first machine learning model. In one illustrative example, the first machine learning model may be a fire alarm detector (e.g., may be a neural network that has been trained as a fire alarm detector) and the third machine learning model may be a carbon monoxide alarm detector (e.g., may be a neural network that has been trained as a carbon monoxide alarm detector).
In one or more implementations, the process 1000 may also include deleting, after a period of time, the learned embedding of the first audio sample and the encoded version of the first audio sample from the device. For example, the device may include an embeddings cache, such as embeddings cache 404, that is managed as a rolling or loop buffer in which, once a predetermined number of learned embeddings are stored in the cache, a new incoming embedding causes the oldest embedding in the cache to be deleted from the cache. In another example, the device may include an embeddings cache, such as embeddings cache 404, that is managed as a LRU buffer in which, once a predetermined number of learned embeddings are stored in the cache, a new incoming embedding causes a least recently used embedding in the cache to be deleted from the cache. In this way, the privacy of the user of the electronic device and/or any persons in the vicinity of the electronic device during acquisition of audio samples can be protected by preventing long-term storage of user-identifiable information relating to audio samples (e.g., in addition to the privacy protections provided by storing the encoded versions and learned embeddings of audio samples, rather than storing the audio samples themselves).
In one or more implementations, the process 1000 and/or another process performed by the device may include continually updating an embeddings cache (e.g., embeddings cache 404) at the electronic device to include a number N of recent learned embeddings generated by an embedding model (e.g., embedding model 400) at the device, and providing one or more of the N recent learned embeddings to multiple detection models (e.g., detection models 402) at the electronic device and/or another device (e.g., with or without storing and/or comparing encoded versions of the audio inputs from which the embeddings were generated.
As described above, one aspect of the present technology is the gathering and use of data available from specific and legitimate sources for generating and using an embeddings cache, such as for sound detection. The present disclosure contemplates that in some instances, this gathered data may include personal information data that uniquely identifies or can be used to identify a specific person. Such personal information data can include audio data, voice data, demographic data, location-based data, online identifiers, telephone numbers, email addresses, home addresses, encryption information, data or records relating to a user's health or level of fitness (e.g., vital signs measurements, medication information, exercise information), date of birth, or any other personal information.
The present disclosure recognizes that the use of such personal information data, in the present technology, can be used to the benefit of users. For example, the personal information data can be used for generating and using an embeddings cache, such as for sound detection. Accordingly, use of such personal information data may facilitate authentication operations. Further, other uses for personal information data that benefit the user are also contemplated by the present disclosure. For instance, health and fitness data may be used, in accordance with the user's preferences to provide insights into their general wellness, or may be used as positive feedback to individuals using technology to pursue wellness goals.
The present disclosure contemplates that those entities responsible for the collection, analysis, disclosure, transfer, storage, or other use of such personal information data will comply with well-established privacy policies and/or privacy practices. In particular, such entities would be expected to implement and consistently apply privacy practices that are generally recognized as meeting or exceeding industry or governmental requirements for maintaining the privacy of users. Such information regarding the use of personal data should be prominently and easily accessible by users, and should be updated as the collection and/or use of data changes. Personal information from users should be collected for legitimate uses only. Further, such collection/sharing should occur only after receiving the consent of the users or other legitimate basis specified in applicable law. Additionally, such entities should consider taking any needed steps for safeguarding and securing access to such personal information data and ensuring that others with access to the personal information data adhere to their privacy policies and procedures. Further, such entities can subject themselves to evaluation by third parties to certify their adherence to widely accepted privacy policies and practices. In addition, policies and practices should be adapted for the particular types of personal information data being collected and/or accessed and adapted to applicable laws and standards, including jurisdiction-specific considerations which may serve to impose a higher standard. For instance, in the US, collection of or access to certain health data may be governed by federal and/or state laws, such as the Health Insurance Portability and Accountability Act (HIPAA); whereas health data in other countries may be subject to other regulations and policies and should be handled accordingly.
Despite the foregoing, the present disclosure also contemplates embodiments in which users selectively block the use of, or access to, personal information data. That is, the present disclosure contemplates that hardware and/or software elements can be provided to prevent or block access to such personal information data. For example, in the case of generating and using an embeddings cache, such as for sound detection, the present technology can be configured to allow users to select to “opt in” or “opt out” of participation in the collection of personal information data during registration for services or anytime thereafter. In addition to providing “opt in” and “opt out” options, the present disclosure contemplates providing notifications relating to the access or use of personal information. For instance, a user may be notified upon downloading an app that their personal information data will be accessed and then reminded again just before personal information data is accessed by the app.
Moreover, it is the intent of the present disclosure that personal information data should be managed and handled in a way to minimize risks of unintentional or unauthorized access or use. Risk can be minimized by limiting the collection of data and deleting data once it is no longer needed. In addition, and when applicable, including in certain health related applications, data de-identification can be used to protect a user's privacy. De-identification may be facilitated, when appropriate, by removing identifiers, controlling the amount or specificity of data stored (e.g., collecting location data at city level rather than at an address level), controlling how data is stored (e.g., aggregating data across users), and/or other methods such as differential privacy.
Therefore, although the present disclosure broadly covers use of personal information data to implement one or more various disclosed embodiments, the present disclosure also contemplates that the various embodiments can also be implemented without the need for accessing such personal information data. That is, the various embodiments of the present technology are not rendered inoperable due to the lack of all or a portion of such personal information data.
The bus 1108 collectively represents all system, peripheral, and chipset buses that communicatively connect the numerous internal devices of the electronic system 1100. In one or more implementations, the bus 1108 communicatively connects the one or more processing unit(s) 1112 with the ROM 1110, the system memory 1104, and the permanent storage device 1102. From these various memory units, the one or more processing unit(s) 1112 retrieves instructions to execute and data to process in order to execute the processes of the subject disclosure. The one or more processing unit(s) 1112 can be a single processor or a multi-core processor in different implementations.
The ROM 1110 stores static data and instructions that are needed by the one or more processing unit(s) 1112 and other modules of the electronic system 1100. The permanent storage device 1102, on the other hand, may be a read-and-write memory device. The permanent storage device 1102 may be a non-volatile memory unit that stores instructions and data even when the electronic system 1100 is off. In one or more implementations, a mass-storage device (such as a magnetic or optical disk and its corresponding disk drive) may be used as the permanent storage device 1102.
In one or more implementations, a removable storage device (such as a floppy disk, flash drive, and its corresponding disk drive) may be used as the permanent storage device 1102. Like the permanent storage device 1102, the system memory 1104 may be a read-and-write memory device. However, unlike the permanent storage device 1102, the system memory 1104 may be a volatile read-and-write memory, such as random access memory. The system memory 1104 may store any of the instructions and data that one or more processing unit(s) 1112 may need at runtime. In one or more implementations, the processes of the subject disclosure are stored in the system memory 1104, the permanent storage device 1102, and/or the ROM 1110. From these various memory units, the one or more processing unit(s) 1112 retrieves instructions to execute and data to process in order to execute the processes of one or more implementations.
The bus 1108 also connects to the input and output device interfaces 1114 and 1106. The input device interface 1114 enables a user to communicate information and select commands to the electronic system 1100. Input devices that may be used with the input device interface 1114 may include, for example, microphones, alphanumeric keyboards, touchscreens, touchpads, and pointing devices (also called “cursor control devices”). The output device interface 1106 may enable, for example, the display of images generated by electronic system 1100. Output devices that may be used with the output device interface 1106 may include, for example, speakers, printers and display devices, such as a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, a flexible display, a flat panel display, a solid state display, a projector, a light source, a haptic components, or any other device for outputting information. One or more implementations may include devices that function as both input and output devices, such as a touchscreen. In these implementations, feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
Finally, as shown in
In accordance with aspects of the disclosure, a method is provided that includes generating, by a device at which a learned embedding of a first audio sample is stored in connection with an encoded version of the first audio sample, an encoded version of a second audio sample; comparing, by the device, the encoded version of the second audio sample with the encoded version of the first audio sample; responsive to a determination that the encoded version of the second audio sample matches the encoded version of the first audio sample, providing the learned embedding of the first audio sample to a first machine learning model at the device; and responsive to a determination that the encoded version of the second audio sample is different from the encoded version of the first audio sample, generating, using the first machine learning model, a learned embedding of the second audio sample and providing the learned embedding of the second audio sample to the first machine learning model.
In accordance with aspects of the disclosure, an electronic device is provided that includes a memory storing a learned embedding of a first audio sample in connection with an encoded version of the first audio sample; and one or more processors configured to: generate an encoded version of a second audio sample; compare the encoded version of the second audio sample with the encoded version of the first audio sample; responsive to a determination that the encoded version of the second audio sample matches the encoded version of the first audio sample, provide the learned embedding of the first audio sample to a first machine learning model at the device; and responsive to a determination that the encoded version of the second audio sample is different from the encoded version of the first audio sample, generate, using the first machine learning model, a learned embedding of the second audio sample and providing the learned embedding of the second audio sample to the first machine learning model.
In accordance with aspects of the disclosure, a non-transitory computer-readable medium is provided storing instructions which, when executed by one or more processors, cause the one or more processors to perform operations that include generating, by a device at which a learned embedding of a first audio sample is stored in connection with an encoded version of the first audio sample, an encoded version of a second audio sample; compare, by the device, the encoded version of the second audio sample with the encoded version of the first audio sample; responsive to a determination that the encoded version of the second audio sample matches the encoded version of the first audio sample, providing the learned embedding of the first audio sample to a first machine learning model at the device; and responsive to a determination that the encoded version of the second audio sample is different from the encoded version of the first audio sample, generating, using the first machine learning model, a learned embedding of the second audio sample and providing the learned embedding of the second audio sample to the first machine learning model.
Implementations within the scope of the present disclosure can be partially or entirely realized using a tangible computer-readable storage medium (or multiple tangible computer-readable storage media of one or more types) encoding one or more instructions. The tangible computer-readable storage medium also can be non-transitory in nature.
The computer-readable storage medium can be any storage medium that can be read, written, or otherwise accessed by a general purpose or special purpose computing device, including any processing electronics and/or processing circuitry capable of executing instructions. For example, without limitation, the computer-readable medium can include any volatile semiconductor memory, such as RAM, DRAM, SRAM, T-RAM, Z-RAM, and TTRAM. The computer-readable medium also can include any non-volatile semiconductor memory, such as ROM, PROM, EPROM, EEPROM, NVRAM, flash, nvSRAM, FeRAM, FeTRAM, MRAM, PRAM, CBRAM, SONOS, RRAM, NRAM, racetrack memory, FJG, and Millipede memory.
Further, the computer-readable storage medium can include any non-semiconductor memory, such as optical disk storage, magnetic disk storage, magnetic tape, other magnetic storage devices, or any other medium capable of storing one or more instructions. In one or more implementations, the tangible computer-readable storage medium can be directly coupled to a computing device, while in other implementations, the tangible computer-readable storage medium can be indirectly coupled to a computing device, e.g., via one or more wired connections, one or more wireless connections, or any combination thereof.
Instructions can be directly executable or can be used to develop executable instructions. For example, instructions can be realized as executable or non-executable machine code or as instructions in a high-level language that can be compiled to produce executable or non-executable machine code. Further, instructions also can be realized as or can include data. Computer-executable instructions also can be organized in any format, including routines, subroutines, programs, data structures, objects, modules, applications, applets, functions, etc. As recognized by those of skill in the art, details including, but not limited to, the number, structure, sequence, and organization of instructions can vary significantly without varying the underlying logic, function, processing, and output.
While the above discussion primarily refers to microprocessor or multi-core processors that execute software, one or more implementations are performed by one or more integrated circuits, such as ASICs or FPGAs. In one or more implementations, such integrated circuits execute instructions that are stored on the circuit itself.
Those of skill in the art would appreciate that the various illustrative blocks, modules, elements, components, methods, and algorithms described herein may be implemented as electronic hardware, computer software, or combinations of both. To illustrate this interchangeability of hardware and software, various illustrative blocks, modules, elements, components, methods, and algorithms have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application. Various components and blocks may be arranged differently (e.g., arranged in a different order, or partitioned in a different way) all without departing from the scope of the subject technology.
It is understood that any specific order or hierarchy of blocks in the processes disclosed is an illustration of example approaches. Based upon design preferences, it is understood that the specific order or hierarchy of blocks in the processes may be rearranged, or that all illustrated blocks be performed. Any of the blocks may be performed simultaneously. In one or more implementations, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the implementations described above should not be understood as requiring such separation in all implementations, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
As used in this specification and any claims of this application, the terms “base station”, “receiver”, “computer”, “server”, “processor”, and “memory” all refer to electronic or other technological devices. These terms exclude people or groups of people. For the purposes of the specification, the terms “display” or “displaying” means displaying on an electronic device.
As used herein, the phrase “at least one of” preceding a series of items, with the term “and” or “or” to separate any of the items, modifies the list as a whole, rather than each member of the list (i.e., each item). The phrase “at least one of” does not require selection of at least one of each item listed; rather, the phrase allows a meaning that includes at least one of any one of the items, and/or at least one of any combination of the items, and/or at least one of each of the items. By way of example, the phrases “at least one of A, B, and C” or “at least one of A, B, or C” each refer to only A, only B, or only C; any combination of A, B, and C; and/or at least one of each of A, B, and C.
The predicate words “configured to”, “operable to”, and “programmed to” do not imply any particular tangible or intangible modification of a subject, but, rather, are intended to be used interchangeably. In one or more implementations, a processor configured to monitor and control an operation or a component may also mean the processor being programmed to monitor and control the operation or the processor being operable to monitor and control the operation. Likewise, a processor configured to execute code can be construed as a processor programmed to execute code or operable to execute code.
Phrases such as an aspect, the aspect, another aspect, some aspects, one or more aspects, an implementation, the implementation, another implementation, some implementations, one or more implementations, an embodiment, the embodiment, another embodiment, some implementations, one or more implementations, a configuration, the configuration, another configuration, some configurations, one or more configurations, the subject technology, the disclosure, the present disclosure, other variations thereof and alike are for convenience and do not imply that a disclosure relating to such phrase(s) is essential to the subject technology or that such disclosure applies to all configurations of the subject technology. A disclosure relating to such phrase(s) may apply to all configurations, or one or more configurations. A disclosure relating to such phrase(s) may provide one or more examples. A phrase such as an aspect or some aspects may refer to one or more aspects and vice versa, and this applies similarly to other foregoing phrases.
The word “exemplary” is used herein to mean “serving as an example, instance, or illustration”. Any embodiment described herein as “exemplary” or as an “example” is not necessarily to be construed as preferred or advantageous over other implementations. Furthermore, to the extent that the term “include”, “have”, or the like is used in the description or the claims, such term is intended to be inclusive in a manner similar to the term “comprise” as “comprise” is interpreted when employed as a transitional word in a claim.
All structural and functional equivalents to the elements of the various aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims. No claim element is to be construed under the provisions of 35 U.S.C. § 112(f) unless the element is expressly recited using the phrase “means for” or, in the case of a method claim, the element is recited using the phrase “step for”.
The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein, but are to be accorded the full scope consistent with the language claims, wherein reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more”. Unless specifically stated otherwise, the term “some” refers to one or more. Pronouns in the masculine (e.g., his) include the feminine and neuter gender (e.g., her and its) and vice versa. Headings and subheadings, if any, are used for convenience only and do not limit the subject disclosure.