The subject matter disclosed herein generally relates to the processing of data. Specifically, the present disclosure addresses systems and methods to facilitate audio fingerprinting.
Audio information (e.g., sounds, speech, music, or any suitable combination thereof) may be represented as digital data (e.g., electronic, optical, or any suitable combination thereof). For example, a piece of music, such as a song, may be represented by audio data, and such audio data may be stored, temporarily or permanently, as all or part of a file (e.g., a single-track audio file or a multi-track audio file). In addition, such audio data may be communicated as all or part of a stream of data (e.g., a single-track audio stream or a multi-track audio stream).
Some embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings.
Example methods and systems are directed to generating and utilizing one or more audio fingerprints. Examples merely typify possible variations. Unless explicitly stated otherwise, components and functions are optional and may be combined or subdivided, and operations may vary in sequence or be combined or subdivided. In the following description, for purposes of explanation, numerous specific details are set forth to provide a thorough understanding of example embodiments. It will be evident to one skilled in the art, however, that the present subject matter may be practiced without these specific details.
A machine (e.g., an audio processing machine) may form all or part of an audio fingerprinting system, and such a machine may be configured (e.g., by software modules) to generate one or more audio fingerprints of one or more segments of audio data. According to various example embodiments, the machine may access audio data to be fingerprinted and divide the audio data into segments (e.g., overlapping segments). For any given segment (e.g., for each segment), the machine may generate a spectral representation (e.g., spectrogram) from the segment of audio data; generate a vector (e.g., a sparse binary vector) from the spectral representation; generate an ordered set of permutations of the vector; generate an ordered set of numbers from the permutations of the vector; and generate a fingerprint of the segment of the audio data (e.g., a sub-fingerprint of the audio data).
In addition, the machine (e.g., the audio processing machine) may form all or part of an audio identification system, and the machine may be configured (e.g., by software modules) to determine a likelihood that candidate audio data (e.g., an unidentified song submitted as a candidate to be identified) matches reference audio data (e.g., a known song). According to various example embodiments, the machine may access the candidate audio data and the reference audio data, and the machine may generate fingerprints from multiple segments of each. For example, the machine may generate first and second reference fingerprints from first and second segments of the reference audio data, and the machine may generate first and second candidate fingerprints from first and second segments of the candidate audio data. Based on these four fingerprints (e.g., based on at least these four fingerprints), the machine may determine a likelihood that the candidate audio data matches the reference audio data and cause a device (e.g., user device) to present the determined likelihood (e.g., as a response to a query from a user).
The database 115 may store one or more pieces of audio data (e.g., for access by the audio processing machine 110). The database 115 may store one or more pieces of reference audio data (e.g., audio files, such as songs, that have been previously identified), candidate audio data (e.g., audio files of songs having unknown identity, for example, submitted by users as candidates for identification), or any suitable combination thereof.
The audio processing machine 110 may be configured to access audio data from the database 115, from the device 130, from the device 150, or any suitable combination thereof. One or both of the devices 130 and 150 may store one or more pieces of audio data (e.g., reference audio data, candidate audio data, or both). The audio processing machine 110, with or without the database 115, may form all or part of a network-based system 105. For example, the network-based system 105 may be or include a cloud-based audio processing system (e.g., a cloud-based audio identification system).
Also shown in
Any of the machines, databases, or devices shown in
The network 190 may be any network that enables communication between or among machines, databases, and devices (e.g., the audio processing machine 110 and the device 130). Accordingly, the network 190 may be a wired network, a wireless network (e.g., a mobile or cellular network), or any suitable combination thereof. The network 190 may include one or more portions that constitute a private network, a public network (e.g., the Internet), or any suitable combination thereof. Accordingly, the network 190 may include one or more portions that incorporate a local area network (LAN), a wide area network (WAN), the Internet, a mobile telephone network (e.g., a cellular network), a wired telephone network (e.g., a plain old telephone system (POTS) network), a wireless data network (e.g., WiFi network or WiMax network), or any suitable combination thereof. Any one or more portions of the network 190 may communicate information via a transmission medium. As used herein, “transmission medium” shall be taken to include any intangible medium that is capable of storing, encoding, or carrying instructions for execution by a machine, and includes digital or analog communication signals or other intangible media to facilitate communication of such software.
The audio processing machine 110 is shown as including a frequency module 210, a vector module 220, a scrambler module 230, a coder module 240, a fingerprint module 250, and a match module 260, all configured to communicate with each other (e.g., via a bus, shared memory, or a switch). Any one or more of the modules described herein may be implemented using hardware (e.g., a processor of a machine) or a combination of hardware and software. For example, any module described herein may configure a processor to perform the operations described herein for that module. Moreover, any two or more of these modules may be combined into a single module, and the functions described herein for a single module may be subdivided among multiple modules. Furthermore, according to various example embodiments, modules described herein as being implemented within a single machine, database, or device may be distributed across multiple machines, databases, or devices.
As shown by a curved arrow in the upper portion of
As shown by curved arrow in the lower portion of
As shown in
As shown in
As shown in
As shown in
In operation 710, the frequency module 210 generates the spectral representation 320 of the segment 310 of the audio data 300. As noted above, the spectral representation 320 indicates energy values for a set of frequencies (e.g., frequency bins).
In operation 720, the vector module 220 generates the vector 400 from the spectral representation 320 generated in operation 710. As noted above, the vector 400 may be a sparse vector, binary vector, or both. Moreover, as described above with respect to
In operation 730, the scrambler module 230 generates the ordered set 410 of permutations of the vector 400. As noted above, with respect to
In operation 740, the coder module 240 generates the ordered set 420 of numbers from the ordered set 410 of permutations of the vector 400. As noted above with respect to
In operation 750, the fingerprint module 250 generates the fingerprint 560 of the segment 310 of the audio data 300. The generating of the fingerprint 560 may be based on the ordered set 420 of numbers generated in operation 740. As noted above with respect to
As shown in
In operation 810, the vector module 220 multiplies each energy value in the spectral representation 320 by a corresponding weight factor. The weight factor for an energy value may be determined based on a position (e.g., ordinal position) of the energy value's corresponding frequency (e.g., frequency bin) within a set of frequencies represented in the spectral representation 320. As noted above with respect to
In operation 812, the vector module 220 determines a representative group of highest energy values (e.g., top X energy values, such as the top 0.5% energy values or the top four energy values) from the upper portion 324 of the spectral representation 320 (e.g., weighted as described above with respect operation 810). This may enable the vector module 220 to set this representative group of highest energy values to the single common non-zero value (e.g., 1) in generating the vector 400 in operation 720. In some example embodiments, operation 812 includes ranking energy values for frequencies at or above a predetermined threshold frequency (e.g., 1700 Hz) in the spectral representation 320 and determining the representative group from the upper portion 324 based on the ranked energy values.
In operation 814, the vector module 220 determines a representative group of highest energy values (e.g., top Y energy values, such as the top 0.5% energy values or the top six energy values) from the lower portion 322 of the spectral representation 320 (e.g., weighted as described above with respect operation 810). This may enable the vector module 220 to set this representative group of highest energy values to the single common non-zero value (e.g., 1) in generating the vector 400 in operation 720. In certain example embodiments, operation 814 includes ranking energy values for frequencies below a predetermined threshold frequency (e.g., 1700 Hz) in the spectral representation 320 and determining the representative group from the lower portion 322 based on the ranked energy values.
Operation 830 may be performed as part (e.g., a precursor task, a subroutine, or a portion) of operation 730, in which the scrambler module 230 generates the ordered set 410 of permutations of the vector 400. As noted above with respect to
One or both of operations 840 and 842 may be performed as part of operation 740, in which the coder module 240 generates the ordered set 420 of numbers from the ordered set 410 of permutations. In operation 840, the coder module 240 generates each number in the ordered set 420 of numbers based on a position (e.g., a frequency bin number) of an instance of the single common non-zero value (e.g., 1) within the corresponding permutation for that number. For example, the coder module 240 may generate each number in the ordered set 420 of numbers based on the lowest position (e.g., lowest frequency bin number) of any instance of the single common non-zero value (e.g., 1) within the corresponding permutation for the number that is being generated.
In operation 842, the coder module 240 calculates a remainder from a modulo operation performed on a numerical representation of the position (e.g., the frequency bin number) discussed above with respect to operation 840. For example, the coder module 240, in generating a number in the ordered set 420 of numbers, may calculate the remainder of a modulo 256 operation performed on the frequency bin number of the lowest frequency bin occupied by the single common non-zero value (e.g., 1) in the permutation that corresponds to the number being generated.
Operation 850 may be performed as part of operation 750, in which the fingerprint module 250 generates the fingerprint 560. In operation 850, the fingerprint module 250 stores the ordered set 420 of numbers in the database 115 with a reference to the timestamp 550 of the segment 310 of the audio data 300 (e.g., as discussed above with respect to
As shown in
In
Similarly, the candidate audio data 920 is shown as including segments 921, 922, 923, 924, and 925. Examples of the candidate audio data 920 include an audio file, an audio stream, or any portion thereof. Segments 921, 922, 923, 924, and 925 of the candidate audio data 920 are shown as overlapping segments 921-925. For example, the segments 921-925 may be half-second portions of the candidate audio data 920, and the segments 921-925 may overlap such that adjacent segments (e.g., segments 924 and 925) overlap each other by a sixteenth of a second (e.g., 512 audio samples, sampled at 8 KHz). In some example embodiments, a different amount of overlap is used (e.g., 448 milliseconds or 3584 samples, sampled at 8 KHz). As shown in
According to various example embodiments, an individual sub-fingerprint (e.g., fingerprint 560) represents a small time-domain audio segment (e.g., segment 310) and includes results of permutations (e.g., ordered set 420 of numbers) as described above with respect to
As shown in
As shown in
In operation 1110, which may be performed as part (e.g., a precursor task, a subroutine, or a portion) of operation 750, the fingerprint module 250 generates a first reference fingerprint (e.g., similar to the fingerprint 560) of a first reference segment (e.g., segment 911, which may be the same as the segment 310) of the reference audio data 910, which may be the same as audio data 300. The generating of the first reference fingerprint may be based on an ordered set of numbers (e.g., similar to the ordered set 420 of numbers).
In operation 1120, the fingerprint module 250 generates a second reference fingerprint (e.g., similar to the fingerprint 560) of a second reference segment (e.g., second 914) of the reference audio data 910. This may be performed in a manner similar to that described above with respect to operation 1110. Accordingly, first and second reference fingerprints may be generated off-line stored in the database 115 (e.g., prior to receiving any queries from users), and the first and second reference fingerprints may be accessed from the database 115 in response to receiving a query.
In operation 1130, the fingerprint module 250 accesses the candidate audio data 920 (e.g., from the database 115, from the device 130, from the device 150, or any suitable combination thereof). For example, the candidate audio data 920 may be accessed in response to a query submitted by the user 132 by the device 130. Such a query may request identification of the candidate audio data 920.
In operation 1140, the fingerprint module 250 generates a first candidate fingerprint (e.g., similar to the fingerprint 560) of a first candidate segment (e.g., segment 922) of the candidate audio data 920. This may be performed in a manner similar to that described above with respect operation 1110.
In operation 1150, the fingerprint module 250 generates a second candidate fingerprint (e.g., similar to the fingerprint 560) of a second candidate segment (e.g., segment 925) of the candidate audio data 920. This may be performed in a manner similar to that described above with respect operation 1120.
In operation 1160, the match module 260 determines a likelihood (e.g., probability, a score, or both) that the candidate audio data 920 matches the reference audio data 910. This determination may be based on one or more of the following factors: the first candidate fingerprint (e.g., of the segment 922) matching the first reference fingerprint (e.g., of the segment 911); the second candidate fingerprint (e.g., of the second 925) matching the second reference fingerprint (e.g., of the segment 914); the first reference segment (e.g., segment 911) preceding the second reference segment (e.g., segment 914); and the first candidate segment (e.g., segment 922) preceding the second candidate segment (e.g., segment 925). According to various example embodiments, the combination (e.g., conjunction) of one or more of these factors may be a basis for performing operation 1160. In some example embodiments, a further basis for performing operation 1160 is the reference time span 919 being equivalent to the candidate time span 929. In certain example embodiments, the further basis for performing operation 1160 is the reference time span 919 being distinct but approximately equivalent to the candidate time span 929 (e.g., within one segment, two segments, or ten segments).
In operation 1170, the match module 260 causes the device 130 to present the likelihood that the candidate audio data 920 matches the reference audio data 910 (e.g., as determined in operation 1160). For example, the match module 260 may communicate the likelihood (e.g., within a message or an alert) to the device 130 in response to a query sent from the device 130 by the user 132. The device 130 may be configured to present the likelihood as a level of confidence (e.g., a confidence score) that the candidate audio data 920 matches the reference audio data 910. Moreover, the match module 260 may access metadata that describes the reference audio data 910 (e.g., song name, artist, genre, release date, album, lyrics, duration, or any suitable combination thereof). Such metadata may be accessed from the database 115. The match module 260 may also communicate some or all of such metadata to the device 130 for presentation to the user 132. Accordingly, performance of one or more of operations 1110-1170 may form all or part of an audio identification service.
According to various example embodiments, one or more of the methodologies described herein may facilitate the fingerprinting of audio data (e.g., generation of a unique identifier or representation of audio data). Moreover, one or more of the methodologies described herein may facilitate identification of an unknown piece of audio data. Hence, one or more the methodologies described herein may facilitate efficient provision of audio fingerprinting services, audio identification services, or any suitable combination thereof.
When these effects are considered in aggregate, one or more of the methodologies described herein may obviate a need for certain efforts or resources that otherwise would be involved in fingerprinting audio data and identifying audio data. Efforts expended by a user in identifying audio data may be reduced by one or more of the methodologies described herein. Computing resources used by one or more machines, databases, or devices (e.g., within the network environment 100) may similarly be reduced. Examples of such computing resources include processor cycles, network traffic, memory usage, data storage capacity, power consumption, and cooling capacity.
The machine 1200 includes a processor 1202 (e.g., a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP), an application specific integrated circuit (ASIC), a radio-frequency integrated circuit (RFIC), or any suitable combination thereof), a main memory 1204, and a static memory 1206, which are configured to communicate with each other via a bus 1208. The processor 1202 may contain microcircuits that are configurable, temporarily or permanently, by some or all of the instructions 1224 such that the processor 1202 is configurable to perform any one or more of the methodologies described herein, in whole or in part. For example, a set of one or more microcircuits of the processor 1202 may be configurable to execute one or more modules (e.g., software modules) described herein.
The machine 1200 may further include a graphics display 1210 (e.g., a plasma display panel (PDP), a light emitting diode (LED) display, a liquid crystal display (LCD), a projector, a cathode ray tube (CRT), or any other display capable of displaying graphics or video). The machine 1200 may also include an alphanumeric input device 1212 (e.g., a keyboard or keypad), a cursor control device 1214 (e.g., a mouse, a touchpad, a trackball, a joystick, a motion sensor, an eye tracking device, or other pointing instrument), a storage unit 1216, an audio generation device 1218 (e.g., a sound card, an amplifier, a speaker, a headphone jack, or any suitable combination thereof), and a network interface device 1220.
The storage unit 1216 includes the machine-readable medium 1222 (e.g., a tangible and non-transitory machine-readable storage medium) on which are stored the instructions 1224 embodying any one or more of the methodologies or functions described herein. The instructions 1224 may also reside, completely or at least partially, within the main memory 1204, within the processor 1202 (e.g., within the processor's cache memory), or both, before or during execution thereof by the machine 1200. Accordingly, the main memory 1204 and the processor 1202 may be considered machine-readable media (e.g., tangible and non-transitory machine-readable media). The instructions 1224 may be transmitted or received over the network 190 via the network interface device 1220. For example, the network interface device 1220 may communicate the instructions 1224 using any one or more transfer protocols (e.g., hypertext transfer protocol (HTTP)).
In some example embodiments, the machine 1200 may be a portable computing device, such as a smart phone or tablet computer, and have one or more additional input components 1230 (e.g., sensors or gauges). Examples of such input components 1230 include an image input component (e.g., one or more cameras), an audio input component (e.g., a microphone), a direction input component (e.g., a compass), a location input component (e.g., a global positioning system (GPS) receiver), an orientation component (e.g., a gyroscope), a motion detection component (e.g., one or more accelerometers), an altitude detection component (e.g., an altimeter), and a gas detection component (e.g., a gas sensor). Inputs harvested by any one or more of these input components may be accessible and available for use by any of modules described herein.
As used herein, the term “memory” refers to a machine-readable medium able to store data temporarily or permanently and may be taken to include, but not be limited to, random-access memory (RAM), read-only memory (ROM), buffer memory, flash memory, and cache memory. While the machine-readable medium 1222 is shown in an example embodiment to be a single medium, the term “machine-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, or associated caches and servers) able to store instructions. The term “machine-readable medium” shall also be taken to include any medium, or combination of multiple media, that is capable of storing the instructions 1224 for execution by the machine 1200, such that the instructions 1224, when executed by one or more processors of the machine 1200 (e.g., processor 1202), cause the machine 1200 to perform any one or more of the methodologies described herein, in whole or in part. Accordingly, a “machine-readable medium” refers to a single storage apparatus or device, as well as cloud-based storage systems or storage networks that include multiple storage apparatus or devices. The term “machine-readable medium” shall accordingly be taken to include, but not be limited to, one or more tangible data repositories in the form of a solid-state memory, an optical medium, a magnetic medium, or any suitable combination thereof.
Throughout this specification, plural instances may implement components, operations, or structures described as a single instance. Although individual operations of one or more methods are illustrated and described as separate operations, one or more of the individual operations may be performed concurrently, and nothing requires that the operations be performed in the order illustrated. Structures and functionality presented as separate components in example configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements fall within the scope of the subject matter herein.
Certain embodiments are described herein as including logic or a number of components, modules, or mechanisms. Modules may constitute either software modules (e.g., code embodied on a machine-readable medium or in a transmission signal) or hardware modules. A “hardware module” is a tangible unit capable of performing certain operations and may be configured or arranged in a certain physical manner. In various example embodiments, one or more computer systems (e.g., a standalone computer system, a client computer system, or a server computer system) or one or more hardware modules of a computer system (e.g., a processor or a group of processors) may be configured by software (e.g., an application or application portion) as a hardware module that operates to perform certain operations as described herein.
In some embodiments, a hardware module may be implemented mechanically, electronically, or any suitable combination thereof. For example, a hardware module may include dedicated circuitry or logic that is permanently configured to perform certain operations. For example, a hardware module may be a special-purpose processor, such as a field programmable gate array (FPGA) or an ASIC. A hardware module may also include programmable logic or circuitry that is temporarily configured by software to perform certain operations. For example, a hardware module may include software encompassed within a general-purpose processor or other programmable processor. It will be appreciated that the decision to implement a hardware module mechanically, in dedicated and permanently configured circuitry, or in temporarily configured circuitry (e.g., configured by software) may be driven by cost and time considerations.
Accordingly, the phrase “hardware module” should be understood to encompass a tangible entity, be that an entity that is physically constructed, permanently configured (e.g., hardwired), or temporarily configured (e.g., programmed) to operate in a certain manner or to perform certain operations described herein. As used herein, “hardware-implemented module” refers to a hardware module. Considering embodiments in which hardware modules are temporarily configured (e.g., programmed), each of the hardware modules need not be configured or instantiated at any one instance in time. For example, where a hardware module comprises a general-purpose processor configured by software to become a special-purpose processor, the general-purpose processor may be configured as respectively different special-purpose processors (e.g., comprising different hardware modules) at different times. Software may accordingly configure a processor, for example, to constitute a particular hardware module at one instance of time and to constitute a different hardware module at a different instance of time.
Hardware modules can provide information to, and receive information from, other hardware modules. Accordingly, the described hardware modules may be regarded as being communicatively coupled. Where multiple hardware modules exist contemporaneously, communications may be achieved through signal transmission (e.g., over appropriate circuits and buses) between or among two or more of the hardware modules. In embodiments in which multiple hardware modules are configured or instantiated at different times, communications between such hardware modules may be achieved, for example, through the storage and retrieval of information in memory structures to which the multiple hardware modules have access. For example, one hardware module may perform an operation and store the output of that operation in a memory device to which it is communicatively coupled. A further hardware module may then, at a later time, access the memory device to retrieve and process the stored output. Hardware modules may also initiate communications with input or output devices, and can operate on a resource (e.g., a collection of information).
The various operations of example methods described herein may be performed, at least partially, by one or more processors that are temporarily configured (e.g., by software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, such processors may constitute processor-implemented modules that operate to perform one or more operations or functions described herein. As used herein, “processor-implemented module” refers to a hardware module implemented using one or more processors.
Similarly, the methods described herein may be at least partially processor-implemented, a processor being an example of hardware. For example, at least some of the operations of a method may be performed by one or more processors or processor-implemented modules. Moreover, the one or more processors may also operate to support performance of the relevant operations in a “cloud computing” environment or as a “software as a service” (SaaS). For example, at least some of the operations may be performed by a group of computers (as examples of machines including processors), with these operations being accessible via a network (e.g., the Internet) and via one or more appropriate interfaces (e.g., an application program interface (API)).
The performance of certain operations may be distributed among the one or more processors, not only residing within a single machine, but deployed across a number of machines. In some example embodiments, the one or more processors or processor-implemented modules may be located in a single geographic location (e.g., within a home environment, an office environment, or a server farm). In other example embodiments, the one or more processors or processor-implemented modules may be distributed across a number of geographic locations.
Some portions of the subject matter discussed herein may be presented in terms of algorithms or symbolic representations of operations on data stored as bits or binary digital signals within a machine memory (e.g., a computer memory). Such algorithms or symbolic representations are examples of techniques used by those of ordinary skill in the data processing arts to convey the substance of their work to others skilled in the art. As used herein, an “algorithm” is a self-consistent sequence of operations or similar processing leading to a desired result. In this context, algorithms and operations involve physical manipulation of physical quantities. Typically, but not necessarily, such quantities may take the form of electrical, magnetic, or optical signals capable of being stored, accessed, transferred, combined, compared, or otherwise manipulated by a machine. It is convenient at times, principally for reasons of common usage, to refer to such signals using words such as “data,” “content,” “bits,” “values,” “elements,” “symbols,” “characters,” “terms,” “numbers,” “numerals,” or the like. These words, however, are merely convenient labels and are to be associated with appropriate physical quantities.
Unless specifically stated otherwise, discussions herein using words such as “processing,” “computing,” “calculating,” “determining,” “presenting,” “displaying,” or the like may refer to actions or processes of a machine (e.g., a computer) that manipulates or transforms data represented as physical (e.g., electronic, magnetic, or optical) quantities within one or more memories (e.g., volatile memory, non-volatile memory, or any suitable combination thereof), registers, or other machine components that receive, store, transmit, or display information. Furthermore, unless specifically stated otherwise, the terms “a” or “an” are herein used, as is common in patent documents, to include one or more than one instance. Finally, as used herein, the conjunction “or” refers to a non-exclusive “or,” unless specifically stated otherwise.
This application is a continuation of U.S. patent application Ser. No. 18/049,882, filed Oct. 26, 2022, which is a continuation of U.S. patent application Ser. No. 16/926,286, filed Jul. 10, 2020, which is a continuation of U.S. patent application Ser. No. 16/270,113, filed Feb. 7, 2019, now U.S. Pat. No. 10,714,105, which is a continuation of U.S. patent application Ser. No. 15/008,042, filed Jan. 27, 2016, now U.S. Pat. No. 10,229,689, which is a continuation of U.S. patent application Ser. No. 14/107,923, filed Dec. 16, 2013, now U.S. Pat. No. 9,286,902, all of which are incorporated herein by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
Parent | 18049882 | Oct 2022 | US |
Child | 18500764 | US | |
Parent | 16926286 | Jul 2020 | US |
Child | 18049882 | US | |
Parent | 16270113 | Feb 2019 | US |
Child | 16926286 | US | |
Parent | 15008042 | Jan 2016 | US |
Child | 16270113 | US | |
Parent | 14107923 | Dec 2013 | US |
Child | 15008042 | US |