The accompanying drawings illustrate a number of exemplary embodiments and are a part of the specification. Together with the following description, these drawings demonstrate and explain various principles of the instant disclosure.
Throughout the drawings, identical reference characters and descriptions indicate similar, but not necessarily identical, elements. While the exemplary embodiments described herein are susceptible to various modifications and alternative forms, specific embodiments have been shown by way of example in the drawings and will be described in detail herein. However, the exemplary embodiments described herein are not intended to be limited to the particular forms disclosed. Rather, the instant disclosure covers all modifications, equivalents, and alternatives falling within the scope of the appended claims.
Features from any of the embodiments described herein may be used in combination with one another in accordance with the general principles described herein. These and other embodiments, features, and advantages will be more fully understood upon reading the following detailed description in conjunction with the accompanying drawings and claims.
The present disclosure is generally directed to systems and methods for identifying engaging video content that can be served as an audio-only or audio-primary experience. For example, a video where a relationship expert gives advice to viewers may be engaging to listeners in an audio-only format (e.g., as a podcast) while a video where an automotive expert demonstrates how to install a headlight may not be engaging or valuable in an audio-only or audio-primary format. In some cases, a platform may have a rich library of video content (e.g., videos uploaded by users to a social media platform) that may include potential candidate videos for audio-only or audio-primary experiences related to and/or hosted by the platform. In some examples, the systems described herein may analyze videos in the library and identify candidate videos for an audio-primary experience via machine learning. In one example, a machine learning algorithm may initially use heuristics such as quantity of speech in a video, level of visual complexity, and/or topic. Over time, the machine learning algorithm may be trained to identify additional characteristics that are indicative of an engaging audio-primary experience. In one embodiment, the systems described herein may automatically edit videos to be more engaging, for example by removing lengthy pauses where no audio occurs. In some embodiments, the systems described herein may use user engagement metrics to determine whether a video was successfully identified as engaging for an audio-only experience.
In some embodiments, the systems described herein may improve the functioning of a computing device by enabling the computing device to identify candidate videos for audio-primary experiences. In one embodiment, the systems described herein may improve the functioning of a computing device by providing the computing device with videos suitable for an audio-primary experience in an audio and/or video player of the computing device. Additionally, the systems described herein may improve the fields of streaming video and/or streaming audio by automatically identifying candidate videos for audio-primary experiences, increasing the amount of content available for streaming audio services and/or streaming video services intended to function in the background.
The following will provide detailed descriptions of systems and methods for identifying candidate videos for audio experiences with reference to
In some embodiments, the systems described herein may identify candidate videos that will be presented as audio-primary experiences via interfaces on end-user devices.
Server 106 generally represents any type or form of backend computing device that may store, process, and/or analyze media files. Examples of server 106 may include, without limitation, application servers, database servers, media servers, and/or any other relevant type of server. Although illustrated as a single entity in
Computing device 102 generally represents any type or form of computing device capable of reading computer-executable instructions. For example, computing device 102 may represent an end-user computing device. Additional examples of computing device 102 may include, without limitation, a laptop, a desktop, a tablet, a smart television, a smartphone, a wearable device, a smart device, an embedded device (e.g., a media player in a vehicle), an artificial reality device, a personal digital assistant (PDA), etc.
Video 114 generally represents any type or form of digital media that includes non-static visual content as well as audio content. In some examples, video 114 may be a live-action video (as opposed to, e.g., an animated video created digitally). In some embodiments, video 114 may be a video created and/or uploaded by a user of a platform, such as a user of a social media platform. In some examples, video 114 may have various attributes that are manually assigned by the creator and/or detected automatically, such as the topic of video 114 and/or tags applied to video 114.
Audio content 116 generally refers to one or more audio tracks of a video, such as video 114. In some embodiments, audio content 116 may be stored as part of a digital file that represents video 114. Additionally or alternatively, audio content 116 may be stored as a separate file from visual content and/or other content of video 114.
Interface 118 generally represents any type or form of user interface and/or media player capable of presenting audio and/or video to a user. In some embodiments, interface 118 may be a video player that is capable of presenting videos for a standard video experience (e.g., where a user watches the video while listening to the audio) and/or an audio-primary user experience (e.g., where a user listens to the audio without continuously watching the video). Additionally or alternatively, interface 118 may be an audio player that presents audio but does not present video. In some embodiments, interface 118 may be a specialized interface designed to present video for audio-primary experiences (e.g., in the background of other applications).
Machine learning algorithm 120 generally represents any type or form of machine learning algorithm, model, and/or classification system. In one example, machine learning algorithm 120 may include a neural network. In some embodiments, machine learning algorithm 120 may be trained on a set of labeled data (e.g., videos labeled as suitable or not suitable for an audio-primary experience) before being used to classify unlabeled data. Additionally or alternatively, machine learning algorithm 120 may be pre-configured with heuristics with which to classify videos. In some embodiments, machine learning algorithm 120 may be trained during use via feedback about the accuracy of classifications performed by machine learning algorithm 120. In some embodiments, the systems described herein may supplement machine learning algorithm 120 with online learning (e.g., learning based on user engagement metrics).
As illustrated in
As illustrated in
The term “audio-primary user experience” or “audio-primary experience” may generally refer to any interaction with audio content of a video in which presenting the audio content of the video the user is the primary function of the interface, in place of or secondary to presenting visual content of the video. For example, the systems described herein may facilitate an audio-primary user experience by presenting the audio content of a video to a user via an audio player that does not present the video content (e.g., as an audio-only user experience). In another example, the systems described herein may facilitate an audio-primary user experience by presenting both the visual and audio content of a video in an interface that requires minimal interaction from the user, enabling the user to listen to the audio content of the video without continuously watching the visual content of the video and/or the interface presenting the video. In some examples, an interface for an audio-primary user experience may automatically play videos in a sequence, enabling a user to listen to videos while performing other activities without interrupting those activities to interact with the interface.
Identification module 108 may identify the video in a variety of ways and/or contexts. For example, identification module 108 may access a library of videos stored on and/or related to a platform (e.g., videos uploaded by users of a video hosting service, media streaming service, and/or social networking platform) and may identify videos in the library. In some embodiments, identification module 108 may identify candidate videos in a library of videos not previously categorized relative to suitability for audio-primary experiences. In other embodiments, identification module 108 may identify candidate videos in a library of videos that has been pre-screened for suitability in some way (e.g., by removing any videos with no audio, by including only videos with suitable distribution rights, etc.).
At step 204, one or more of the systems described herein may determine, at least in part by analyzing the video via a machine learning algorithm, that the audio content of the video is suitable for the audio-primary user experience. For example, determination module 110 may, as part of server 106 in
Determination module 110 may determine that the audio content of the video is suitable in a variety of ways. For example, determination module 110 may determine that the audio content of the video is suitable based solely on the classification arrived at by the machine learning algorithm. In some examples, determination module 110 may apply one or more heuristics before or after analysis by the machine learning model, such as pre-emptively filtering out videos tagged with certain topics generally not suitable to audio-primary experiences (e.g., visual tutorials such as cooking, makeup, or automotive maintenance). In some embodiments, determination module 110 may incorporate manually applied tags and/or classifications by one or more analysts. For example, determination module 110 may flag a video for manual review and then collect metrics such as whether the audio content of the video was enjoyable, understandable, and/or engaging to the analyst.
In some embodiments, determination module 110 may use various heuristics to determine if a video is suitable for an audio-primary experience. In one embodiment, determination module 110 may use heuristics such as the topic or category of the video, the visual complexity of the video, and/or the amount of human speech in the audio of the video. For example, the systems described herein may have a list of topics that are generally not suitable for an audio-primary experience due to relying heavily on visual content (e.g., visual tutorials, cute animals, fashion, etc.), a list of topics that are sometimes suitable for an audio-primary experience and sometimes not (e.g., sports, theater, etc.), and/or a list of topics that are generally suitable for an audio-primary experience (e.g., relationship advice, talk shows, political commentary, etc.). In some embodiments, the systems described herein may filter first based on topic before filtering on other heuristics, such whether the visual complexity falls below a predetermined threshold for visual complexity. The systems described herein may measure the visual complexity of a video by any appropriate method, such as the ratio of high-definition encoding to standard-definition encoding file sizes of the video. In some embodiments, the systems described herein may determine the quantity and/or percentage of human speech audible in the audio content and may only mark the video as suitable if the video meets a threshold for quantity of human speech. In some examples, the systems described herein may filter by the language of the speech, such as whether the speech is in English. Additionally or alternatively, the systems described herein may determine the quantity of music in the audio of a video and mark videos with a sufficiently high percentage of music (alone or in combination with speech) as suitable. In some embodiments, the systems described herein may use additional information about the video, such as the title, description, tags, category, and/or publisher, to determine whether to categorize the music as background music (and therefore not count the music towards suitable audio content) or filler music (and therefore count the music towards suitable audio content).
In some embodiments, the systems described herein may use the above-described heuristics and/or other heuristics consecutively or concurrently (e.g., by weighting the heuristics) to determine if videos are suitable. For example, as illustrated in
In one example, the systems described herein may determine that videos 304, 306, and/or 310 are not suitable due to category, speech content, and/or complexity. For example, video 304 may portray a makeup tutorial that is difficult to follow without visual content, video 306 may be a discussion of a slideshow of dresses that has limited engagement value without being able to see the dresses, and/or video 310 may be a video of a puppy repeatedly falling over that is engaging to watch but not to listen to. In one example, the systems described herein may determine that video 308 is suitable due to the high speech content, low complexity, and placement in the “technology” category. For example, video 310 may feature a technology expert discussing the home network vulnerabilities posed by malicious toasters and other malware-infected smart devices and thus may be engaging as an audio-primary experience.
At step 206, one or more of the systems described herein may present the audio content of the video to at least one user via an interface designed for the audio-primary user experience in response to determining that the audio content of the video is suitable for the audio-primary user experience. For example, presentation module 112 may, as part of server 106 in
Presentation module 112 may present the audio content in a variety of different ways. In one embodiment, presentation module 112 may present the audio content by making the audio content available for download to an end-user device which then presents the audio content via an interface. In some examples, presentation module 112 may present the audio content via an interface designed to present audio and/or video content to a user.
In some examples, presentation module 112 may present both visual and audio content, while in other examples, presentation module 112 may present only audio content and not visual content. For example, as illustrated in
In one example, the systems described herein may present video 402 as a standard video experience 404 that enables a user to watch visual content of video 402 while listening to audio content of video 402. For example, the systems described herein may present video 402 via a video player on a video streaming platform. In another example, the systems described herein may present video 402 as an audio-only experience 406 that enables a user to listen to audio content of video 402 without watching visual content of video 402. For example, the systems described herein may present video 402 via a podcast player or other audio player. In another example, the systems described herein may present video 402 via a background video experience 408 that enables a user to place the application and/or interface presenting video 402 in the background while another application is in the foreground (e.g., has focus and/or is visibly eclipsing the application and/or interface presenting video 402). In some examples, background video experience 408 may enable a user to switch between actively watching visual content and passively listening to audio content.
In one embodiment, the systems described herein may automatically play the audio and/or visual content of a new video after a previous video ends, without requiring user interaction to begin the new video. In some embodiments, the systems described herein may detect that a user has not interacted with a media presentation interface for a predetermined amount of time and/or videos (e.g., three minutes, five minutes, two videos, five videos, etc.) and may switch from presenting arbitrary videos (e.g., videos that may or may not be suitable for an audio-primary experience) to videos suitable for an audio-primary experience. Additionally or alternatively, the systems described herein may switch to presenting only videos suitable for an audio-primary experience in response to the state of the media presentation interface. For example, if the interface is in the background, the systems described herein may switch to presenting only videos suitable for an audio-primary experience. In some embodiments, the systems described herein may select suitable videos using the same algorithm used to select arbitrary videos (e.g., auto-playing sports videos if the user was watching sports videos, auto-playing relationship advice videos if the user was watching a relationship advice video, etc.).
In some embodiments, the systems described herein may monitor the interactions of at least one user with the video and, based on the presence and/or type of interaction, may mark the video as not suitable for the audio-primary user experience or as confirmed suitable for the audio-primary user experience. In one embodiment, the systems described herein may determine that any interaction with the interface indicates that the user is no longer passively listening to the video and thus the video is not suitable. Additionally or alternatively, the systems described herein may weight specific interactions, such as skipping the video, closing the interface, and/or choosing a different video, as negative interactions and may mark a video as not suitable if the video's score meets a threshold for negative interactions. In some embodiments, the systems described herein may compare user interactions with a video presented as part of an audio-primary experience with user interactions with the same video presented as part of a standard video experience to determine whether a video is suitable. For example, if 20% of users skip the video during a standard video experience and 22% skip the video during an audio-primary experience, the systems described herein may determine that users are interacting similarly with the video and the video is suitable for the audio-primary experience. However, if 20% of users skip the video during a standard video experience and 50% skip the video during an audio-primary experience, the systems described herein may determine that something about the video must not be engaging in an audio-primary experience and may mark the video as not suitable.
In some embodiments, the systems described herein may improve the machine learning model over time based at least in part on interactions of users. For example, as illustrated in
In some embodiments, in addition to categorizing videos as suitable or not suitable, the systems described herein may categorize videos as suitable pending editing. For example, the systems described herein may detect that a percentage of the audio content that is suitable for the audio-primary user experience exceeds a minimum threshold for suitable audio but that another percentage of the audio content is not suitable and, in response, the systems described herein may categorize the video as suitable pending editing. In one example, audio content of a video may mostly consist of human speech but may have one or more periods of silence, static, white noise, and/or other non-speech background noise. In this example, the human speech may be audio content that is suitable and the silence or background noise may be audio content that is not suitable. In some embodiments, the systems described herein may flag the video for manual editing. Additionally or alternatively, the systems described herein may automatically edit the video to be suitable for the audio-primary user experience. For example, the systems described herein may remove the periods of silence and stitch together the remaining portions of the content, resulting in shorter content that is entirely suitable. In some embodiments, the systems described herein may substantially edit a video. For example, the systems described herein may split a longer video into multiple shorter videos that are suitable for an audio-primary experience. In one example, the systems described herein may cut a forty-minute-long video-blog into five two-minute-long highlight segments that are suitable for an audio-primary experience.
As described above, the systems and methods described herein may automatically select videos within a pre-existing video library that are suitable for an audio-primary user experience. In some examples, a media streaming service, social media platform, or other organization may have access to a large library of videos, some of which are only engaging when presented with visual content and others of which are engaging when a user passively listens to the audio content while occasionally glancing at or even entirely ignoring the visual content. For example, users may listen to videos in the background while exercising, driving, cooking, or performing other activities. By automatically identifying suitable videos for an audio-primary experience, the systems described herein may populate audio-primary services and/or interfaces with a rich media library. In some examples, the systems described herein may improve the user experience of users listening to videos in the background by providing the users with videos with engaging audio content.
Example 1: A method for identifying candidate videos for audio experiences may include (i) identifying a video with audio content that is a candidate for an audio-primary user experience that enables users to consume the video by listening to the audio content without watching visual content of the video, (ii) determining, at least in part by analyzing the video via a machine learning algorithm, that the audio content of the video is suitable for the audio-primary user experience, and (iii) presenting the audio content of the video to at least one user via an interface designed for the audio-primary user experience in response to determining that the audio content of the video is suitable for the audio-primary user experience.
Example 2: The computer-implemented method of example 1, where identifying the video with the audio content that is the candidate for the audio-primary user experience includes selecting the video from a library of user-uploaded videos on a platform that hosts the user-uploaded videos.
Example 3: The computer-implemented method of examples 1-2, where determining that the audio content of the video is suitable for the audio-primary user experience includes determining that the audio content includes an amount of human speech that meets a threshold for speech content.
Example 4: The computer-implemented method of examples 1-3, where determining that the audio content of the video is suitable for the audio-primary user experience includes determining that visual content of the video falls below a predetermined threshold for visual complexity.
Example 5: The computer-implemented method of examples 1-4, where determining that the audio content of the video is suitable for the audio-primary user experience includes identifying a category of the video and determining that the category of the video is suitable for the audio-primary user experience.
Example 6: The computer-implemented method of examples 1-5, where determining that the audio content of the video is suitable for the audio-primary user experience includes flagging the video for manual review.
Example 7: The computer-implemented method of examples 1-6, where the interface designed for the audio-primary user experience includes an audio player that presents the audio content of the video without visual content of the video.
Example 8: The computer-implemented method of examples 1-7, where the interface designed for the audio-primary user experience includes a background application configured to present the audio content of the video while the background application is not in the foreground of a user interface for a device.
Example 9: The computer-implemented method of examples 1-8, where presenting the audio content of the video to the at least one user includes monitoring interactions of the at least one user with the video to confirm that the audio content of the video is suitable for the audio-primary user experience.
Example 10: The computer-implemented method of examples 1-9 may further include detecting that the at least one user has performed an interaction with the video via the interface and, in response to detecting the interaction, marking the video as not suitable for the audio-primary user experience.
Example 11: The computer-implemented method of examples 1-10, where determining that the audio content of the video is suitable for the audio-primary user experience includes detecting that a percentage of the audio content that is suitable for the audio-primary user experience exceeds a minimum threshold for suitable audio but that another percentage of the audio content is not suitable and categorizing the video as suitable pending editing.
Example 12: The computer-implemented method of examples 1-11 may further include, in response to categorizing the video as suitable pending editing, automatically editing the video to be suitable for the audio-primary user experience.
Example 13: The computer-implemented method of examples 1-12, where categorizing the video as suitable pending editing includes detecting at least one period of silence within the audio content and further including automatically editing the video to remove a portion of the video comprising the at least one period of silence.
Example 14: A system for identifying candidate videos for audio experiences may include at least one physical processor and physical memory including computer-executable instructions that, when executed by the physical processor, cause the physical processor to (i) identify a video with audio content that is a candidate for an audio-primary user experience that enables users to consume the video by listening to the audio content without watching visual content of the video, (ii) determine, at least in part by analyzing the video via a machine learning algorithm, that the audio content of the video is suitable for the audio-primary user experience, and (iii) present the audio content of the video to at least one user via an interface designed for the audio-primary user experience in response to determining that the audio content of the video is suitable for the audio-primary user experience.
Example 15: The system of example 14, where identifying the video with the audio content that is the candidate for the audio-primary user experience includes selecting the video from a library of user-uploaded videos on a platform that hosts the user-uploaded videos.
Example 16: The system of examples 14-15, where determining that the audio content of the video is suitable for the audio-primary user experience includes determining that the audio content includes an amount of human speech that meets a threshold for speech content.
Example 17: The system of examples 14-16, where determining that the audio content of the video is suitable for the audio-primary user experience includes determining that visual content of the video falls below a predetermined threshold for visual complexity.
Example 18: The system of examples 14-17, where determining that the audio content of the video is suitable for the audio-primary user experience includes identifying a category of the video and determining that the category of the video is suitable for the audio-primary user experience.
Example 19: The system of examples 14-18, where determining that the audio content of the video is suitable for the audio-primary user experience includes flagging the video for manual review.
Example 20: A non-transitory computer-readable medium may include one or more computer-readable instructions that, when executed by at least one processor of a computing device, cause the computing device to (i) identify a video with audio content that is a candidate for an audio-primary user experience that enables users to consume the video by listening to the audio content without watching visual content of the video, (ii) determine, at least in part by analyzing the video via a machine learning algorithm, that the audio content of the video is suitable for the audio-primary user experience, and (iii) present the audio content of the video to at least one user via an interface designed for the audio-primary user experience in response to determining that the audio content of the video is suitable for the audio-primary user experience.
As detailed above, the computing devices and systems described and/or illustrated herein broadly represent any type or form of computing device or system capable of executing computer-readable instructions, such as those contained within the modules described herein. In their most basic configuration, these computing device(s) may each include at least one memory device and at least one physical processor.
In some examples, the term “memory device” generally refers to any type or form of volatile or non-volatile storage device or medium capable of storing data and/or computer-readable instructions. In one example, a memory device may store, load, and/or maintain one or more of the modules described herein. Examples of memory devices include, without limitation, Random Access Memory (RAM), Read Only Memory (ROM), flash memory, Hard Disk Drives (HDDs), Solid-State Drives (SSDs), optical disk drives, caches, variations or combinations of one or more of the same, or any other suitable storage memory.
In some examples, the term “physical processor” generally refers to any type or form of hardware-implemented processing unit capable of interpreting and/or executing computer-readable instructions. In one example, a physical processor may access and/or modify one or more modules stored in the above-described memory device. Examples of physical processors include, without limitation, microprocessors, microcontrollers, Central Processing Units (CPUs), Field-Programmable Gate Arrays (FPGAs) that implement softcore processors, Application-Specific Integrated Circuits (ASICs), portions of one or more of the same, variations or combinations of one or more of the same, or any other suitable physical processor.
Although illustrated as separate elements, the modules described and/or illustrated herein may represent portions of a single module or application. In addition, in certain embodiments one or more of these modules may represent one or more software applications or programs that, when executed by a computing device, may cause the computing device to perform one or more tasks. For example, one or more of the modules described and/or illustrated herein may represent modules stored and configured to run on one or more of the computing devices or systems described and/or illustrated herein. One or more of these modules may also represent all or portions of one or more special-purpose computers configured to perform one or more tasks.
In addition, one or more of the modules described herein may transform data, physical devices, and/or representations of physical devices from one form to another. For example, one or more of the modules recited herein may receive image data to be transformed, transform the image data into a data structure that stores user characteristic data, output a result of the transformation to select a customized interactive ice breaker widget relevant to the user, use the result of the transformation to present the widget to the user, and store the result of the transformation to create a record of the presented widget. Additionally or alternatively, one or more of the modules recited herein may transform a processor, volatile memory, non-volatile memory, and/or any other portion of a physical computing device from one form to another by executing on the computing device, storing data on the computing device, and/or otherwise interacting with the computing device.
In some embodiments, the term “computer-readable medium” generally refers to any form of device, carrier, or medium capable of storing or carrying computer-readable instructions. Examples of computer-readable media include, without limitation, transmission-type media, such as carrier waves, and non-transitory-type media, such as magnetic-storage media (e.g., hard disk drives, tape drives, and floppy disks), optical-storage media (e.g., Compact Disks (CDs), Digital Video Disks (DVDs), and BLU-RAY disks), electronic-storage media (e.g., solid-state drives and flash media), and other distribution systems.
The process parameters and sequence of the steps described and/or illustrated herein are given by way of example only and can be varied as desired. For example, while the steps illustrated and/or described herein may be shown or discussed in a particular order, these steps do not necessarily need to be performed in the order illustrated or discussed. The various exemplary methods described and/or illustrated herein may also omit one or more of the steps described or illustrated herein or include additional steps in addition to those disclosed.
The preceding description has been provided to enable others skilled in the art to best utilize various aspects of the exemplary embodiments disclosed herein. This exemplary description is not intended to be exhaustive or to be limited to any precise form disclosed. Many modifications and variations are possible without departing from the spirit and scope of the instant disclosure. The embodiments disclosed herein should be considered in all respects illustrative and not restrictive. Reference should be made to the appended claims and their equivalents in determining the scope of the instant disclosure.
Unless otherwise noted, the terms “connected to” and “coupled to” (and their derivatives), as used in the specification and claims, are to be construed as permitting both direct and indirect (i.e., via other elements or components) connection. In addition, the terms “a” or “an,” as used in the specification and claims, are to be construed as meaning “at least one of.” Finally, for ease of use, the terms “including” and “having” (and their derivatives), as used in the specification and claims, are interchangeable with and have the same meaning as the word “comprising.”