A video analysis system manages and analyzes a data store of videos. The video analysis system may perform various tasks, including query retrieval, analysis of the videos, and the like. Typically, a user of the video analysis system submits a query that is a request to identify and retrieve videos that are related to the query. For example, a user may submit a query to the video analysis system of “a guy in a red shirt playing tennis in a court” to request videos that include a man in a red shirt playing tennis in a court. The video analysis system performs a relevance analysis and identifies videos that include segments that relate to the query, for example, videos that includes a man in a red shirt playing tennis in a court.
In one instance, the video analysis system performs the relevance analysis by applying one or more machine-learned video retrieval models to the query and the stored videos that may include errors in the retrieval since it is a computer model. In particular, the video retrieval model may have discrepancy in accuracy when, for example, the content and complexity of the user queries are varied. Thus, there is a need to update the video retrieval model to take into account the content and complexity of queries when performing the relevance analysis, such that the video analysis system can retrieve videos with higher accuracy.
The figures depict various embodiments of the present invention for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the invention described herein.
The video analysis system 130 is system for providing various types of services related to videos to users of client devices 116A, 116B. For example, the video analysis system 130 may manage a collection of videos, where users of the video analysis system 130 can view, retrieve, and share videos managed by the video analysis system 130. The video analysis system 130 may have access to a significant number of videos that when analyzed, provides the video analysis system 130 with insight. For example, the video analysis system 130 may have access to a significant number of reviews by online users that when analyzed, provides the video analysis system 130 with insight into preferences of users.
The video analysis system 130 may receive requests from the client devices 116 to perform various types of tasks on the videos of the video analysis system 130. For example, the video analysis system 130 may perform sentiment classification on video reviews to identify positive or negative reviews for a video. Based on the classification results, the video analysis system 130 may also recommend videos to users associated with positive reviews. As another example, the video analysis system 130 may provide and present statistics on the videos.
In one embodiment, the video analysis system 130 receives one or more queries from users of client devices 116A, 116B. The request may include a query used by the video analysis system 130 to retrieve videos related to the query. For example, a user may submit a query to the video analysis system 130 to identify one or more videos of “a man in a red shirt playing tennis in a court.” The video analysis system 130 performs relevance analysis to identify videos that are determined to include a man in a red shirt playing tennis in a court. In one instance, the query may include text data, image data, video data, audio data, and the like. The video analysis system 130 may present the identified videos or particular segments of the videos that are relevant to the query.
In one embodiment, the video analysis system 130 performs the relevance analysis by applying one or more machine-learned video retrieval models to the query and the videos managed by the video analysis system 130. Responsive to receiving a user query, the video analysis system 130 obtains one or more video segments from videos. A segment may be a portion or snippet of a video. In one instance, the video retrieval model is configured to receive a pair of the user query and a respective video segment and generate a relevance score for the pair that indicates how relevant the video segment is to the user query. However, it is appreciated that in other instances, the video retrieval model may generate relevance scores for multiple video segments simultaneously with one forward pass of the model, and/or the video retrieval model may include one or more models that are each used to generate the prediction score for a respective video segment.
In one instance, a high relevance score indicates that the video segment should be retrieved for the query with a high likelihood, and a low relevance score indicates that the video segment should likely not be retrieved for the query. For example, a high relevance score indicates that the video retrieval model identified the video segment as being highly relevant to the user query. For example, the video segment may include entities (e.g., objects, persons, scenery) specified in the query, characteristics described in the query, and the like. In one instance, as described in more detail below, the video retrieval model is configured as a machine-learned model, including neural networks, embedding models, transformer-based architectures, and the like that are capable of analyzing the query and a respective video segment to determine their relevance with one another.
In one embodiment, the video analysis system 130 identifies a set of video segments determined to be relevant to the user query by applying a predetermined threshold to the relevance scores of the video segments. In such an instance, video segments that have predicted relevance scores above the predetermined threshold are determined to be relevant and are provided as a response to the user. For example, a query may request to retrieve videos having video segments with “a car and a bicycle on the road.” When the predicted relevance scores from the video retrieval model range from [0, 1], the predetermined threshold may be set as 0.3. The video analysis system 130 obtains segments of videos. For a pair including the query and a respective video segment, the video analysis system 130 applies the machine-learned video retrieval model to generate a relevance score for the pair that indicates how relevant the video segment is to the user query.
However, since the video retrieval model is machine-learned, errors may occur depending on the query and the predetermined threshold value set for selecting the video segments that are relevant to the query. In particular, the video retrieval model may have discrepancy in retrieval accuracy depending on the content and complexity of the user queries. For example, when the complexity of the query is relatively high (e.g., many objects or characteristics specified in the query), video segments that are relevant may be associated with a higher relevance score compared to a query that is relatively simple (e.g., few objects or characteristics specified in the query), because a higher degree of precision is required for the query.
Therefore, when the predetermined threshold is a relatively high value, the video analysis system 130 may filter out a relatively accurate group of video segments for complex queries, but identify a relatively inaccurate group of video segments for simpler queries because the predetermined threshold is too high to filter all video segments that are relevant to the simple query. Thus, there is a need to update the filtering threshold to take into account the content and complexity of queries when performing the relevance analysis, such that the video analysis system 130 can retrieve videos with higher accuracy.
Thus, in one embodiment, the video analysis system 130 trains and uses a machine-learned adaptive thresholding model configured to receive a query (e.g., text, image, video), and generate a predicted threshold value for a corresponding video retrieval model that indicates what value the filtering threshold should be for the query. The video analysis system 130 filters video segments for the query that are associated with relevance scores above the predicted threshold generated by the adaptive thresholding model. In one instance, the adaptive thresholding model is configured as a machine-learned model, including neural networks, embedding models, transformer-based architectures, and the like that are capable of generating a predicted threshold value given a query that includes, for example, text, images, videos, audio, or any other appropriate data modality.
Specifically, the video analysis system 130 receives a user query of “a can of soda on a table” requesting retrieval of videos or segments of videos that include a can of soda on a table. The video analysis system 130 obtains a set of video segments from videos. The video analysis system 130 applies a machine-learned video retrieval model 210 to each pair of the user query and a respective video segment to generate a relevance score for the video segment. As illustrated in
The video analysis system 130 also applies the adaptive thresholding model 220 to the query to generate a predicted threshold for the query. The predicted threshold value of 0.18 indicates that video segments having relevance scores equal to or above 0.18 should be filtered as being relevant to the user query. The video analysis system 130 filters video segments that are associated with relevance scores equal to or above the predicted threshold value of 0.18. All relevant Video Segments 1 through 4 are filtered in this process, resulting in an accurate relevance analysis. In contrast, if the threshold was a predetermined value of 0.3, the filtering process would result in only Video Segment 1 being determined as relevant, and would miss Video Segments 2 though 4 that are associated with lower relevance scores, resulting in increased error where relevant video segments are missed (“false negatives”).
On the other hand, for another query with relatively high complexity, the ideal threshold value may be higher (e.g., 0.42) than the predetermined threshold value of 0.3 to detect a high portion of the video segments that are relevant to the query. In such an instance, the adaptive thresholding model 220 may generate a predicted threshold value of 0.42 to determine the video segments relevant to the query. If the video segments were filtered using the predetermined threshold value of 0.3, the filtering process would detect more video segments as being relevant than there actually are, resulting in increased error where irrelevant video segments are filtered with relevant video segments (“false positives”).
Thus, by training and deploying the machine-learned adaptive thresholding model, the video analysis system 130 may adaptively and dynamically adjust the filtering threshold for selecting relevant videos that enable the video analysis system 130 to take into account the content and complexity of queries when determining which video segments are relevant. This allows the video analysis system 130 to reduce errors that may occur when the filtering threshold is kept at a static value regardless of the change in complexity and precision of the query.
Returning to the system environment 100 of
In one embodiment, a client device 116 executes an application allowing a user of the client device 116 to interact with the video analysis system 130. For example, a client device 116 executes a browser application to enable interaction between the client device 116 and the video analysis system 130 via the network 120. In another embodiment, the client device 116 interacts with the video analysis system 130 through an application programming interface (API) running on a native operating system of the client device 116, such as IOS® or ANDROID™.
The client device 116 allows users to perform various actions on the video analysis system 130. In one embodiment, users of client devices 116 can submit queries to the video analysis system 130 to retrieve video segments or videos that are relevant to the user query. For example, the client device 116 may generate an interface for the user to submit the query, and responsive to receiving the query from the user, provide the query to the video analysis system 130. In addition, responsive to receiving a response from the video analysis system 130 including a set of video segments or videos including those segments that are determined to be relevant to the query, the interface may present this information to the user.
The video store 360 is a storage system that includes a collection of videos. Each video may include a sequence of frames or images and encoding or decoding information for the video. In one embodiment, the video store 360 may be a local storage, but in other embodiments, may be stored on servers on a datacenter or cloud object storage associated with a virtual datacenter. In one embodiment, the videos of the video store 360 are indexed, such that features like title, description, a thumbnail image, a link to play the video, and metadata are stored in association with the video. Moreover, the indexing may mark one or more segments of the video that are each a collection of frames dedicated to, for example, the same scene, subject matter, and the like. Moreover, the videos may be stored based on the source of the videos across one or more datasets. As an example, a first collection of videos obtained from a video platform may be organized as Dataset A, a second collection of videos obtained from a website may be organized as Dataset B, and so on.
The data management module 320 manages the training data store 365 for storing training datasets used to train the video retrieval model and/or the adaptive thresholding model described in conjunction with
Moreover, the data management module 320 may encode the query, the video segments, and the corresponding labels into a numerical form (e.g., numerical vector) that can be processed by the video retrieval model. For example, the query may be mapped to a numerical embedding that represents the query, whether it is in the form of image, text, video, into a vector in the latent space. As another example, a video segment may be mapped to a numerical embedding that represents the video segment in the latent space.
In one embodiment, the data management module 320 further augments the training dataset by extracting augmented queries from the existing queries in the dataset and associating the augmented queries with video segments that are relevant to the original existing query. In particular, an augmented query derived from an existing query describes a subset of objects, persons, entities that can be inferred from the existing query. For example, the text for a query TA1 may be “A person in a mascot uniform trying to maneuver a bouncy house.” From the query, the data management module 320 or a human operator can infer example augmented queries TA1,aug1=“person,” because a person would be included in video segments that are relevant to query TA1, TA1,aug2=“person in mascot uniform,” because a person in a mascot uniform would be included in video segments that are relevant to query TA1, and TA1,aug3=“tall person,” because a tall person would most likely maneuver the bouncy house. For an example relevant video segment VA1,1 that is relevant to query TA1 (and therefore RA1,1=“1”), the data management module 320 may augment the training dataset by creating training instances (TA1,aug1, VA1,1, RA1,1), (TA1,aug2, VA1,1, RA1,1), and (TA1,aug3, VA1,1, RA1,1).
The data management module 320 may provide the training dataset for the video retrieval model to the training module 330, such that the video retrieval model is trained based on the training dataset including any augmented training instances. In one embodiment, with respect to the adaptive thresholding model, the training dataset also includes one or more training instances. A training instance for the adaptive thresholding model includes a respective query and a relevance score for the query generated by the trained video retrieval model when applied to the query and a respective video segment determined to be relevant for the query. For example, the training instances from Dataset A may be represented as (TAk, R′(TAk, VAk,j)), j∈S where R′(TAk, VAk,j) denotes the relevance score generated for the k-th query TAk and a video segment VAk,j where j denotes a segment in the set S of video segments that are known to be relevant to the query. The training dataset for the adaptive thresholding model may also include instances generated from augmented queries.
The training module 330 trains one or more machine-learned models for the video analysis system 130. In one embodiment, the training module 330 trains the video retrieval model using the training dataset generated by the data management module 320. In one instance, the training module 330 trains the parameters of the video retrieval model by iteratively repeating a forward pass step and a backpropagation step. During the forward pass step, the training module 330 applies the video retrieval model to the query and video segment pair (TAk, VAk,j) for multiple instances in the training dataset to generate estimated relevance scores. The training module 330 computes a loss function indicating a difference between the estimated relevance scores and the known labels RAk,j for the training instances. The training module 330 performs the backpropagation step to update the parameters of the video retrieval model. This process is repeated until a convergence criterion is reached.
In one embodiment, responsive to receiving a request from the data management module 320, the training module 330 generates a set of relevance scores for a query and one or more video segments that can be used to generate the training dataset for training the adaptive thresholding model. For example, the training module 330 may generate a set of relevance scores by applying the trained video retrieval model to a query and one or more video segments that are known to be relevant to the query in a given dataset, as described in conjunction with the data management module 320. The training module 330 may provide the relevance scores to the data management module 320.
During the forward pass step, the training module 330 applies the adaptive thresholding model 420 to the query of each training instance to generate an estimated threshold value for the instance. The training module 330 computes a loss function 490 indicating a difference between the estimated thresholds and the relevance scores R′(TAk, VAk,j) for the training instances. The training module 330 performs the backpropagation step to update the parameters of the adaptive thresholding model 420. This process is repeated until a convergence criterion is reached. In this manner, for a given query, the adaptive thresholding model 420 is trained to predict a filtering threshold that likely corresponds to relevance scores for video segments that are determined to be relevant to the query.
As described in conjunction with
In such an embodiment, the data management module 320 may further generate a training dataset for such an adaptive thresholding model including one or more training instances. A training instance for the adaptive thresholding model 520 includes a respective query and one or more relevance scores for the user query that each correspond to a respective performance metric. In one instance, the relevance score for a respective performance metric (e.g., recall level at 0.7) is generated by applying the trained video retrieval model 510 to video segments of a dataset, and identifying a relevance score that results in the desired performance metric when video segments are filtered according to the relevance score.
For example, in
During the forward pass step, the training module 330 applies the adaptive thresholding model 520 to the query of each training instance to generate one or more estimated threshold values for the instance. The training module 330 computes a loss function 590 indicating a difference between the estimated thresholds and the relevance scores given by R′C1, R′C2, R′C3 for the training instances. The training module 330 performs the backpropagation step to update the parameters of the adaptive thresholding model 520. This process is repeated until a convergence criterion is reached. In this manner, for a given query, the adaptive thresholding model 520 is trained to predict one or more filtering thresholds that likely correspond to different performance metrics for video segments in a dataset.
In this manner, the adaptive thresholding model can be trained to dynamically predict threshold values for the video retrieval model that takes into account the content and complexity of the user query as well as different target performance metrics. Moreover, the adaptive thresholding model herein can be trained using information that can be readily obtained from the training dataset of the video retrieval model or readily generated from the trained video retrieval model (e.g., relevance scores generated by the video retrieval model). Thus, in one embodiment, the video analysis system 130 may not have to separately obtain annotated training datasets from human operators or other sources to train the adaptive thresholding model.
Returning to
The retrieval module 335 may select and/or filter a set of video segments associated with relevance scores equal to or above the one or more predicted threshold values. In one instance, responsive to receiving the one or more predicted threshold values from the adaptive thresholding model, the retravel module 335 may calculate a range including the value (e.g., +/−5%, +/−10%, +/−20%) and select or filter the set of video segments that have relevance scores equal to or above the computed range as determined to be relevant to the query. The retrieval module 335 may provide the filtered set of video segments to the client device 116 of the request as a response.
The foregoing description of the embodiments of the invention has been presented for the purpose of illustration; it is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Persons skilled in the relevant art can appreciate that many modifications and variations are possible in light of the above disclosure.
Some portions of this description describe the embodiments of the invention in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are commonly used by those skilled in the data processing arts to convey the substance of their work effectively to others skilled in the art. These operations, while described functionally, computationally, or logically, are understood to be implemented by computer programs or equivalent electrical circuits, microcode, or the like. Furthermore, it has also proven convenient at times, to refer to these arrangements of operations as modules, without loss of generality. The described operations and their associated modules may be embodied in software, firmware, hardware, or any combinations thereof.
Any of the steps, operations, or processes described herein may be performed or implemented with one or more hardware or software modules, alone or in combination with other devices. In one embodiment, a software module is implemented with a computer program product comprising a computer-readable medium containing computer program code, which can be executed by a computer processor for performing any or all of the steps, operations, or processes described.
Embodiments of the invention may also relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, and/or it may comprise a general-purpose computing device selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a non-transitory, tangible computer readable storage medium, or any type of media suitable for storing electronic instructions, which may be coupled to a computer system bus. Furthermore, any computing systems referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
Embodiments of the invention may also relate to a product that is produced by a computing process described herein. Such a product may comprise information resulting from a computing process, where the information is stored on a non-transitory, tangible computer readable storage medium and may include any embodiment of a computer program product or other data combination described herein.
Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based hereon. Accordingly, the disclosure of the embodiments of the invention is intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.