The disclosed embodiments relate generally to determining the language of media content, and, in particular, to using listenership to determine the language of media content that include audio.
Access to electronic media, such as music and video content, has expanded dramatically over time. As a departure from physical media, media content providers stream media to electronic devices across wireless networks, improving the convenience with which users can digest and experience such content.
As it becomes easier for users to find content, media content providers can organize media content items and group related content items together in order to provide users with a convenient and straightforward way to find relevant content. In many cases, information included in metadata or text data corresponding to the media content can be searchable in order to identify media content relevant to a user's query. Additionally, it can be useful for the media content itself to be searchable. One method of providing access to information in the media content is by transcribing audio of the media content. For example, audio from a song, podcast, audiobook, or video may be transcribed into text, allowing information stored in the audio to be cataloged and queried. Additionally, transcription of audio when the language of the audio is known leads to improved accuracy in the transcription. Conventional methods of determining a language of audio content include manual (e.g., human) labeling and natural language processing.
There is a need for systems and methods for determining a language of media content (also referred to herein as media content item). This technical problem is further exacerbated by incorrectly manually-labeled metadata, incorrect determinations based on natural language processing methods, or the use of different languages in a title or a description of media content compared to the language of the audio in the media content itself.
Some embodiments described herein offer a technical solution to these problems by determining and updating metadata indicating the language of media content based on languages of listeners of the media content (e.g., using statistical methods). To do so, the systems and methods described herein determine a language for a media content item based on the languages of users that listen to the media content item. By determining a language of a media content item using information other than metadata provided by a creator (e.g., producer, author) of the media content item, the systems and methods mitigate the problem of incorrect language assignment due to human error or inaccuracies in natural language processing methods. Additionally, by providing an accurate language identifier associated with a media content item, transcription of the media content item can be performed with improved accuracy and fewer errors.
For example, a podcast may include a title and/or a description that is written in English, and/or metadata that specifies that the podcast is in English. However, the podcast may be in a different language from English, such as Dutch. While information in the metadata (e.g., the title and description) do not accurately reflect the language of the podcast, the language of a podcast is reflected in the languages of its listeners since a listener of a podcast is likely to be able to understand the language of the podcast. Additionally, a person is unlikely to listen to a podcast in a language that they do not understand.
To that end, in accordance with some embodiments, a method is performed at an electronic device that is associated with a media-providing service. The electronic device has one or more processors and memory storing instructions for execution by the one or more processors. The method includes obtaining metadata for a collection of media content items that include audio. The metadata specifies, for a respective media content item of the collection of media content items, an initial value for a language of the audio. The method includes obtaining a listening history for a plurality of users of the media-providing service. The listening history specifies, for each respective user of the plurality of users, which media content items of the collection of media content items the respective user has listened to. For a first user of the plurality of users, the method includes determining one or more languages corresponding to the first user based on the initial values of the languages of the audio of the media content items that the respective user has listened to. For the respective media content item of the collection of media content items, the method includes determining an updated value for the language of the audio based on the one or more languages corresponding to the users that have listened to the respective media content item.
In accordance with some embodiments, a computer system that is associated with a media-providing service includes one or more processors and memory storing one or more programs configured to be executed by the one or more processors. The one or more programs include instructions for obtaining metadata for a collection of media content items that include audio. The metadata specifies, for a respective media content item of the collection of media content items, an initial value for a language of the audio. The one or more programs further include instructions for obtaining a listening history for a plurality of users of the media-providing service. The listening history specifies, for each respective user of the plurality of users, which media content items of the collection of media content items the respective user has listened to. The one or more programs further include instructions for determining, for a first user of the plurality of users, one or more languages corresponding to the first user based on the initial values of the languages of the audio of the media content items that the respective user has listened to. The one or more programs further include instructions for determining, for the respective media content item of the collection of media content items, an updated value for the language of the audio based on the one or more languages corresponding to the users that have listened to the respective media content item.
In accordance with some embodiments, a computer-readable storage medium has stored therein instructions that, when executed by a server system that is associated with a media-providing service, cause the server system to obtain metadata for a collection of media content items that include audio. The metadata specifies, for a respective media content item of the collection of media content items, an initial value for a language of the audio. The instructions further cause the server system to obtain a listening history for a plurality of users of the media-providing service. The listening history specifies, for each respective user of the plurality of users, which media content items of the collection of media content items the respective user has listened to. The instructions further cause the server system to determine, for a first user of the plurality of users, one or more languages corresponding to the first user based on the initial values of the languages of the audio of the media content items that the respective user has listened to. The instructions further cause the server system to determine, for the respective media content item of the collection of media content items, an updated value for the language of the audio based on the one or more languages corresponding to the users that have listened to the respective media content item.
Thus, systems are provided with improved methods for determining the language of media content items that are provided by a media-providing service.
The embodiments disclosed herein are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings. Like reference numerals refer to corresponding parts throughout the drawings and specification.
Reference will now be made to embodiments, examples of which are illustrated in the accompanying drawings. In the following description, numerous specific details are set forth in order to provide an understanding of the various described embodiments. However, it will be apparent to one of ordinary skill in the art that the various described embodiments may be practiced without these specific details. In other instances, well-known methods, procedures, components, circuits, and networks have not been described in detail so as not to unnecessarily obscure aspects of the embodiments.
It will also be understood that, although the terms first, second, etc. are, in some instances, used herein to describe various elements, these elements should not be limited by these terms. These terms are used only to distinguish one element from another. For example, a first set of parameters could be termed a second set of parameters, and, similarly, a second set of parameters could be termed a first set of parameters, without departing from the scope of the various described embodiments. The first set of parameters and the second set of parameters are both sets of parameters, but they are not the same set of parameters.
The terminology used in the description of the various embodiments described herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used in the description of the various described embodiments and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
As used herein, the term “if” is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.
In some embodiments, an electronic device 102 is associated with one or more users. In some embodiments, an electronic device 102 is a personal computer, mobile electronic device, wearable computing device, laptop computer, tablet computer, mobile phone, feature phone, smart phone, digital media player, a speaker, television (TV), digital versatile disk (DVD) player, and/or any other electronic device capable of presenting media content (e.g., controlling playback of media items, such as music tracks, videos, etc.). Electronic devices 102 may connect to each other wirelessly and/or through a wired connection (e.g., directly through an interface, such as an HDMI interface). In some embodiments, an electronic device 102 is a headless client. In some embodiments, electronic devices 102-1 and 102-s are the same type of device (e.g., electronic device 102-1 and electronic device 102-s are both speakers). Alternatively, electronic device 102-1 and electronic device 102-s include two or more different types of devices.
In some embodiments, electronic devices 102-1 and 102-s send and receive media-control information through network(s) 112. For example, electronic devices 102-1 and 102-s send media control requests (e.g., requests to play music, movies, videos, or other media items, or playlists thereof) to media content server 104 through network(s) 112. Additionally, electronic devices 102-1 and 102-s, in some embodiments, also send indications of media content items to media content server 104 through network(s) 112. In some embodiments, the media content items are uploaded to electronic devices 102-1 and 102-s before the electronic devices forward the media content items to media content server 104.
In some embodiments, electronic device 102-1 communicates directly with electronic device 102-s (e.g., as illustrated by the dotted-line arrow), or any other electronic device 102. As illustrated in
In some embodiments, electronic device 102-1 and/or electronic device 102-s include a media application 222 (
In some embodiments, the CDN 106 stores and provides media content (e.g., media content requested by the media application 222 of electronic device 102) to electronic device 102 via the network(s) 112. Content (also referred to herein as “media items,” “media content items,” and “content items”) is received, stored, and/or served by the CDN 106. In some embodiments, content includes audio (e.g., music, spoken word, podcasts, etc.), video (e.g., short-form videos, music videos, television shows, movies, clips, previews, etc.), text (e.g., articles, blog posts, emails, etc.), image data (e.g., image files, photographs, drawings, renderings, etc.), games (e.g., 2- or 3-dimensional graphics-based computer games, etc.), or any combination of content types (e.g., web pages that include any combination of the foregoing types of content or other content not explicitly listed). In some embodiments, content includes one or more audio media items (also referred to herein as “audio items,” “tracks,” and/or “audio tracks”).
In some embodiments, media content server 104 receives media requests (e.g., commands) from electronic devices 102. In some embodiments, media content server 104 and/or CDN 106 stores one or more playlists (e.g., information indicating a set of media content items). For example, a playlist is a set of media content items defined by a user and/or defined by an editor associated with a media-providing service. The description of the media content server 104 as a “server” is intended as a functional description of the devices, systems, processor cores, and/or other components that provide the functionality attributed to the media content server 104. It will be understood that the media content server 104 may be a single server computer, or may be multiple server computers. Moreover, the media content server 104 may be coupled to CDN 106 and/or other servers and/or server systems, or other devices, such as other client devices, databases, content delivery networks (e.g., peer-to-peer networks), network caches, and the like. In some embodiments, the media content server 104 is implemented by multiple computing devices working together to perform the actions of a server system (e.g., cloud computing).
In some embodiments, the electronic device 102 includes a user interface 204, including output device(s) 206 and/or input device(s) 208. In some embodiments, the input devices 208 include a keyboard, mouse, or track pad. Alternatively, or in addition, in some embodiments, the user interface 204 includes a display device that includes a touch-sensitive surface, in which case the display device is a touch-sensitive display. In electronic devices that have a touch-sensitive display, a physical keyboard is optional (e.g., a soft keyboard may be displayed when keyboard entry is needed). In some embodiments, the output devices (e.g., output device(s) 206) include an audio jack 250 (or other physical output connection port) for connecting to speakers, earphones, headphones, or other external listening devices and/or speaker 252 (e.g., speakerphone device). Furthermore, some electronic devices 102 use a microphone and voice recognition device to supplement or replace the keyboard. Optionally, the electronic device 102 includes an audio input device (e.g., a microphone 254) to capture audio (e.g., speech from a user).
Optionally, the electronic device 102 includes a location-detection device 207, such as a global navigation satellite system (GNSS) (e.g., GPS (global positioning system), GLONASS, Galileo, BeiDou) or other geo-location receiver, and/or location-detection software for determining the location of the electronic device 102 (e.g., module for finding a position of the electronic device 102 using trilateration of measured signal strengths for nearby devices).
In some embodiments, the one or more network interfaces 210 include wireless and/or wired interfaces for receiving data from and/or transmitting data to other electronic devices 102, a media content server 104, a CDN 106, and/or other devices or systems. In some embodiments, data communications are carried out using any of a variety of custom or standard wireless protocols (e.g., NFC, RFID, IEEE 802.15.4, Wi-Fi, ZigBee, 6LoWPAN, Thread, Z-Wave, Bluetooth, ISA100.11a, WirelessHART, MiWi, etc.). Furthermore, in some embodiments, data communications are carried out using any of a variety of custom or standard wired protocols (e.g., USB, Firewire, Ethernet, etc.). For example, the one or more network interfaces 210 include a wireless interface 260 for enabling wireless data communications with other electronic devices 102, and/or or other wireless (e.g., Bluetooth-compatible) devices (e.g., for streaming audio data to the electronic device 102 of an automobile). Furthermore, in some embodiments, the wireless interface 260 (or a different communications interface of the one or more network interfaces 210) enables data communications with other WLAN-compatible devices (e.g., electronic device(s) 102) and/or the media content server 104 (via the one or more network(s) 112,
In some embodiments, electronic device 102 includes one or more sensors including, but not limited to, accelerometers, gyroscopes, compasses, magnetometer, light sensors, near field communication transceivers, barometers, humidity sensors, temperature sensors, proximity sensors, range finders, and/or other sensors/devices for sensing and measuring various environmental conditions.
Memory 212 includes high-speed random-access memory, such as DRAM, SRAM, DDR RAM, or other random-access solid-state memory devices; and may include non-volatile memory, such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid-state storage devices. Memory 212 may optionally include one or more storage devices remotely located from the CPU(s) 202. Memory 212, or alternately, the non-volatile memory solid-state storage devices within memory 212, includes a non-transitory computer-readable storage medium. In some embodiments, memory 212 or the non-transitory computer-readable storage medium of memory 212 stores the following programs, modules, and data structures, or a subset or superset thereof:
Memory 306 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid-state memory devices; and may include non-volatile memory, such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid-state storage devices. Memory 306 optionally includes one or more storage devices remotely located from one or more CPUs 302. Memory 306, or, alternatively, the non-volatile solid-state memory device(s) within memory 306, includes a non-transitory computer-readable storage medium. In some embodiments, memory 306, or the non-transitory computer-readable storage medium of memory 306, stores the following programs, modules and data structures, or a subset or superset thereof:
In some embodiments, the media content server 104 includes web or Hypertext Transfer Protocol (HTTP) servers, File Transfer Protocol (FTP) servers, as well as web pages and applications implemented using Common Gateway Interface (CGI) script, PHP Hyper-text Preprocessor (PHP), Active Server Pages (ASP), Hyper Text Markup Language (HTML), Extensible Markup Language (XML), Java, JavaScript, Asynchronous JavaScript and XML (AJAX), XHP, Javelin, Wireless Universal Resource File (WURFL), and the like.
Each of the above identified modules stored in memory 212 and 306 corresponds to a set of instructions for performing a function described herein. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, or modules, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 212 and 306 optionally store a subset or superset of the respective modules and data structures identified above. Furthermore, memory 212 and 306 optionally store additional modules and data structures not described above. In some embodiments, memory 212 stores one or more of the above identified modules described with regard to memory 306. In some embodiments, memory 306 stores one or more of the above identified modules described with regard to memory 212.
Although
In some embodiments, the method of determining a language of a media content item 124 as outlined in
A model used to perform the Bayesian updating includes:
For the ith user, the distribution ui is expressed as ui∈Δk−1, where Δk−1={x∈[0,1]k:Σk xk=1}. In other words, the distribution ui of languages for an ith user includes k number of affinity values x, one affinity value x for each language. The affinity value x can be any value from 0 to 1, and the sum of all affinity values for the ith user equals to 1. In some embodiments, as reflected in the examples shown herein, an affinity value x of zero indicates that a user does not know that language and a non-zero affinity value x indicates that the user knows (or has some familiarity with) that language. Thus, each user has a language distribution ui that includes k number of affinity values. Thus, for a distribution over 10 languages (e.g., k=10), a user who knows 2 languages would have 10 affinity values x, where 2 of the affinity values would be non-zero and the other 8 affinity values would be zero.
For the jth media content item, the language is vj∈[k]. In other words, vj is a single value that corresponds to a specific language. For example, for a jth media content item, vj=Spanish. The value of vj may also be a numerical value, such as vj=1 for Arabic or vj=22 for Spanish.
Thus, the set of distributions u of the languages of all n number of users is u=(u1, u2, . . . , un), and the set of languages v of the languages of all m number of media content items is v=(v1, v2, . . . , vm). Initial values (e.g., a priori beliefs) of the set of distributions u of languages over n number of users and the set of v of languages over m number of media content items are dependent on user information (e.g., information from user profiles and/or user listening histories) and metadata of media content items, respectively.
In some embodiments, the initial language value vj for the jth media content item is determined using metadata associated with the media content item. For example, the language value vj for the jth media content item may be determined based on a language indicator that is provided by a creator of the media content item. In another example, the language value vj for the jth media content item may be determined based on natural language processing of information in the metadata (e.g., a title or description) of the media content item. In a third example, the language value vj for the jth media content item may be determined based on a location of a producer (e.g., production company) that is listed in the metadata associated with the media content item.
In some embodiments, for a given user, the initial affinity values x for each language are determined based on the languages of media content items that the user engages with (e.g., interacts with, listens, plays, watches). For example, a user that listens to podcasts in English and movies in Tagalog may have affinity values that are non-zero for English and Tagalog (and affinity values of zero for other languages). In another example, a user that listens only to podcasts in Kazakh may have affinity value of 1 for Kazakh (and affinity values of zero for other languages). In some embodiments, a user's affinity value for a language is determined based on a frequency (e.g., how often) or an amount (e.g., how much time) with which the user engages with a media content item in that language. For example, a first user that listens to 10 podcasts in Arabic and 10 podcasts in French may have affinity values of 0.5 for Arabic and 0.5 for French. Alternatively, the first user may have listened to 70 hours of Arabic podcasts in the past year compared to 30 hours of podcasts in French and thus, the first user may have affinity values of 0.7 for Arabic and 0.3 for French.
In some embodiments, for a given user, the initial affinity values x for each language are determined based on information in the user's profile. For example, a user who lists Belgium as his/her country may have initial affinity values of 0.5 for Flemish and 0.5 for French. In another example, a user who includes, in his/her user profile, that he/she would like recommendations of media content items that are in German (or that are produced by a specific production company that is known to be based in Germany) may have an affinity value of 1 for German. In a third example, a user who includes that he/she knows (e.g., speaks) Russian in his/her user profile may have an affinity value of 1 for Russian. More than one piece of information in the user's profile may be used to determine initial affinity values. For example, a second user that lists Belgium as his/her country and that he/she speaks English and Japanese may have initial affinity values of 0.25 for each of French, Flemish, English, and Japanese. In some cases, different information in the user profile may have different weights. Following the example of the second user, the initial language affinity values for the first user may be 0.2 for French and Flemish and 0.4 for English and Japanese in the case that identified languages in a user profile are weighted more heavily than inferred languages (e.g., inferred from country or location information).
The distribution value ui (which include affinity values x of languages) for each user and the language value vj for each media content item can be determined (e.g., computed, generated, calculated) using Bayesian updating and the following information:
Thus, knowledge of the playback data (also called listenership data, e.g., which users listen to which media content items) is vital to the determination (e.g., computation) of the languages of users and languages of media content items. The playback data can be illustrated (or represented) using a bipartite graph (shown below in
Note that not all users 120 of the media providing service and not all media content items provided by the media providing service are shown. Thus, each user 120 may interact with more media content items 124 than shown in
In some embodiments, the initial language value of a respective media content item 124 is determined based on information in metadata associated with the respective media content item 124. For example, metadata of a respective media content item 124 may include information such as a title of the respective media content item 124, a description of the respective media content item 124, a producer or producing company of the respective media content item 124, and a language indicator for the respective media content item 124 that has been input by a creator or producer of the respective media content item 124 or that has been determined via natural language processing of information included in the metadata. For example, a media content item 124-2, which is a podcast, may include the name of a podcast producing company that is known to be based in Canada. Thus, the initial language value of the podcast may be determined to be English or French, or both (since English and French are the national languages of Canada). In this example, the initial language value is determined to be English. In another example, the creator of media content item 124-4 (e.g., author, producer, artist, person who uploads) may provide an indicator that the media content item 124-4 is in Chinese. Thus, the initial language of the media content item 124-4 is determined to be Chinese. In yet another example, a title of a media content item may include letters from the Russian alphabet and thus, the initial language value of the media content item 124—may be determined to be in Russian.
In some embodiments, as shown in language profiles 122-1 and 122-2, a user may be determined to know one or more languages and each language is associated with an affinity value that is representative of a frequency at which that user listens to or interacts with media content items in that language. For example, since 60% of the podcasts that user 120-1 listens to have an initial language that is English and 40% of the podcasts that user 120-1 listens to have an initial language that is French, the languages in the language profile 122-2 of user 120-1 have affinity values (0.6 for English, 0.4 for French, 0.0 for other languages) that are representative of the language distribution in a listening history of user 120-1 (e.g., 60% English and 40% French). Languages with an affinity value that is zero (e.g., 0.0) are not illustrated (e.g., only languages with affinity values that are non-zero are shown) in
In some embodiments, a user 120 is determined to be a listener of a media content item 124 if the user 120 is subscribed to the media content item 124. For example, a user is considered to be a listener of a podcast if the user subscribes to a podcast and/or adds a podcast to a favorite list. In such cases, the user may be considered to be a listener of a podcast even if the user's listening history or listening pattern does not meet other thresholds (e.g., even if the user has not yet listened to at least 30 minutes of the podcast in the last month).
Note that the bipartite graph shown in
Similarly, listening history of the users 120 of the media-providing service may show that the majority (e.g., the greatest percentage) of listeners of media content item 124-4 are determined to know English. Thus, based on the languages of the listeners of media content item 124-4, the media content item 124-4 is determined to have an updated language that is English, and the language indicator 126-4 associated with media content item 124-4 is updated to “English”. In this case, for example, it may be that the creator of media content item 124-4 incorrectly indicated that this podcast is in Chinese.
The distribution of the languages of the listeners of a media content item 124 may change over time. Some examples of how or why the distribution of the languages of the listeners of a media content item 124 may change include: the addition of new listeners (e.g., new listeners of a podcast, new subscribers), the removal of listeners (e.g., listeners stopped listening or unsubscribed), the addition or removal of users from the media-providing service, and users 120 of the media-providing service interacting with new media content items 124 or ceasing to interact with media content items 124 thereby causing their language profiles to change. Thus, any changes in the listenership of a media content item 124 or changes to a language profile 122 of a listener of a media content item 124 may prompt a language that is different from the initial language to be determined as the updated language of the media content item 124.
In performing the method 600, an electronic device obtains (620) metadata for a collection of media content items 124 that include audio. The metadata specifies, for a respective media content item 124 of the collection of media content items, an initial value for a language of the audio (e.g., as shown in language indicator 126).
The electronic device obtains (630) a listening history (e.g., via listening history module 240, from listening history database 334) for a plurality of users 120 of the media-providing service. The listening history specifies, for each respective user 120 of the plurality of users, which media content items 124 of the collection of media content items the respective user has listened to.
For a first user of the plurality of users, the electronic device determines (640) one or more languages corresponding to the first user (e.g., user 120-1) based on the initial values of the languages of the audio of the media content items 124 that the respective user 120 has listened to.
For the respective media content item 124 of the collection of media content items, the electronic device determines (650) an updated value for the language of the audio based on the one or more languages corresponding to the users 120 that have listened to (e.g., are listeners of) the respective media content item 124.
In some embodiments, (621) the metadata includes text describing the respective media content item 124. For example, the metadata of a podcast may include a title and description of the podcast. In another example, the metadata of an audiobook may include the title and creator (e.g., producer, reader, author) of the audiobook.
In some embodiments, (622) the initial value for the language of the respective media content item 124 is based on natural language processing of the text describing the respective media content item 124. For example, the metadata may include a title of a podcast, such as “Welcome to ΩΓΣ.” Natural language processing of the title may determine (incorrectly or correctly) that the podcast is in Greek since it includes characters from the Greek alphabet. In another example, the metadata for a podcast may include a description, and natural language processing of the description may determine that the podcast is in English.
In some embodiments, (623) the text includes a title of the respective media content item 124 and/or a description of the respective media content item 124.
In some embodiments, (624) the language of the audio is different from the language of the text describing the respective media content item 124. For example, a podcast may have audio that is in Japanese, but the title may be in English.
In some embodiments, (625) the initial value for the language of the respective media content item 124 is based on a country of origin of a producer of the respective media content item 124. For example, the metadata for a podcast may include information that the podcast was produced by Acme Podcast Corporation (a fictitious Japanese company). In such cases, the initial value for the language of the podcast may be determined to be Japanese.
In some embodiments, obtaining (630) a listening history for a plurality of users 120 of the media-providing service includes (631) determining the listening history for each respective user 120 of the plurality of users, including: (632) determining a total time duration that the respective user 120 has listened to the respective media content item 124 over a predetermined period of time and (634) comparing the total time duration to a threshold time duration. In accordance with a determination that the total time duration exceeds the threshold time duration (635), the electronic device determines that the respective user 120 is a listener of the respective media content item 124 and includes the respective media content item 124 in the listening history of the respective user 120. In accordance with a determination that the total time duration does not exceed the threshold duration time (636), the electronic device determines that the respective user 120 is not a listener of the respective media content item 124. For example, a user 120 is determined (e.g., considered) to be a listener of a media content item 124 (e.g., a podcast) if the user 120 has listened to at least 30 minutes of the podcast (across any number of episodes) within the calendar year (e.g., since 0:00:00 AM on Jan. 1, 2020).
In some embodiments, (633) the predetermined time period is a moving window that is based on a current time. For example a user 120 is determined (e.g., considered) to be a listener of a media content item 124 (e.g., a podcast) if the user 120 has listened to at least 10 minutes of the podcast (across any number of episodes) within the last two months from a current date and time.
In some embodiments, determining (640) one or more languages corresponding to the first user includes determining (642) a distribution over a set of languages. As described above with respect to
In some embodiments, determining (640) one or more languages corresponding to the first user (e.g., user 120-1) includes assigning (644) a primary language to the first user based on the initial values of the languages of the audio of the media content items 124 (e.g., media content items 124-1, 124-2, and 124-3, shown in
In some embodiments, (652) the determination of the updated value for the language of the audio is further based on physical locations of the users 120 that have listened to the respective media content item 124. For example, physical locations of users 120 or physical locations of devices that the user uses to interact with media content items 124 may be used to determine initial affinity values for the user. Thus, the languages of media content items 124 may be based upon the physical locations of users 120 or the physical locations of devices that the user uses to interact with media content items 124.
In some embodiments, (654) the updated value for the language of the audio is a language that corresponds to a majority of users that have listened to the respective media content item 124.
In some embodiments, determining (650) an updated value for the language of the audio is based on the one or more languages corresponding to the users 120 that have listened to the respective media content item 124 includes (656) determining a most common language among listeners of the respective media content item 124. The updated value for the language of the audio is determined based on the most common language. For example, a podcast called “Secrets to Great Italian Meals” may have 975 listeners. Of the 975 listeners, all 975 listeners have a non-zero affinity value for English, 900 listeners have a non-zero affinity value for Italian, and 100 listeners have a non-zero affinity value for Japanese. Since English is the most common language across all languages and across all listeners (in this case, all listeners are determined to know English) for this podcast, the most common language is determined to be English and thus, the podcast is determined to be in English.
In some embodiments, the electronic device also updates (660) the metadata for the respective media content item 124 in accordance with the updated value for the language of the audio. For example when the language of a podcast is determined to be English (based on the method described above with respect to
In some embodiments, the electronic device also transcribes (670) at least a portion of the audio based on the updated value for the language of the audio, associates (672) a transcription of the at least a portion of the audio with the respective media content item, and stores (674) the transcription for access by one or more users of the media-providing service. For example, after the language of a podcast has been updated to German, the podcast is transcribed into German text and stored so that a user 120 can access (e.g., open, read, edit) the transcribed text.
Although
As described above, in some embodiments, metadata corresponding to a media content item 124 may include contradicting information when it comes to language identification or language determination. For example, as shown in
As described above, in some embodiments, metadata corresponding to a media content item 124 may include incorrect or mislabeled information. For example, the podcast shown in
As described above, in some embodiments, metadata corresponding to a media content item 124 may include incorrect or mislabeled information. For example, the audiobook shown in
The graphical user interfaces 700-1 through 700-4 are displayed on an electronic device, such as a computer, a smart phone, tablet, etc. In some embodiments, the graphical user interfaces 700-1 through 700-4 are displayed as part of an application (such as an application on a phone, tablet, smart device, or a desktop application). In some other embodiments, the graphical user interfaces 700-1 through 700-4 are displayed as part of a web application that is launched in a web browser.
The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles and their practical applications, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.