The present disclosure is related to multi-step localization in videos.
Recognized events in one or more videos may be described with a flow graph.
Then, an association of the flow graph to a previously-unseen video may be referred to as flow graph to video grounding. Flow graph to video grounding is the process of recognizing a particular ordering of events in the video as a topological sort of the corresponding steps in the flow graph. A brute force approach to flow graph to video grounding is to consider every topological sort of the flow graph and consider each topological sort as a candidate event sequence. Each candidate event sequence is then matched with the video and the closest candidate event sequence by some measure, for example alignment cost, is selected as representing the actual sequence of events in the video. Brute force search is computationally inefficient.
Determining a sequence of events in a video is a computationally-expensive task. The brute force method is an inefficient inference method.
Recognizing events in a video stream is useful. For example, a personal assistant may process a query to find an event in a video; this is an example of content addressability. Also, a personal assistant may observe an on-going process and provide information on a next step. This is an example of determining candidate next steps. Computational efficiency is needed to provide these functions.
The present application solves the problem of flow graph to video grounding by providing a graphical structure called tSort and an algorithm for aligning a video with the tSort graph.
A first algorithm is provided for obtaining the tSort graph from a flow graph. The flow graph may be referred to as a first graph and the tSort graph may be referred to as a second graph.
A second algorithm is provided for grounding a video to the tSort graph. The second algorithm, in some embodiments, uses a dynamic programming recursion.
Provided is a method including obtaining a second graph, wherein the second graph is formed from a first graph, wherein the first graph is a flow graph and the second graph encodes topological sorts of the first graph; aligning the second graph to a video, wherein the aligning obtains a start time and an end time of an event in the video, wherein the one or more events are covered by the second graph; and performing a user-assistance function based on the
Also provided is an apparatus including one or more processors; and one or more memories storing instructions, wherein an execution by the one or more processors of the instructions is configured to cause the apparatus to: obtain a second graph, wherein the second graph is formed from a first graph, wherein the first graph is a flow graph and the second graph encodes topological sorts of the first graph; align the second graph to a video, wherein the aligning obtains a start time and an end time of an event in the video; and perform a user-assistance function based on the start time and the end time of the event.
Also provided herein is a non-transitory computer readable medium storing instructions for execution by a computer, wherein the execution by the computer of the instructions is configured to cause the computer to obtain a second graph, wherein the second graph is formed from a first graph, wherein the first graph is a flow graph and the second graph encodes topological sorts of the first graph; align the second graph to a video, wherein the aligning obtains a start time and an end time of an event in the video; and perform a user-assistance function based on the start time and the end time of the event.
Also provided herein is a method including obtaining a second graph, wherein the second graph is formed from a first graph, wherein the first graph is a flow graph and the second graph encodes topological sorts of the first graph; aligning the second graph to a video, wherein the aligning obtains a start time and an end time of an event in the video; receiving, from a user, a query as to the event in the video; cueing, based on the query and the second graph, the video to the start time of the event in the video; and displaying the video to the user beginning at the start time of the event.
Also provided herein is a method including obtaining a second graph, wherein the second graph is formed from a first graph, wherein the first graph is a flow graph and the second graph encodes topological sorts of the first graph; aligning the second graph to a video, wherein the aligning obtains a start time and an end time of an event in the video; and observing one or more recent operations of a user; receiving a query from the user as to candidate next steps; and communicating, based on the query, the second graph and the one or more recent operations of the user, one or more candidate next steps to the user.
The text and figures are provided solely as examples to aid the reader in understanding the invention. They are not intended and are not to be construed as limiting the scope of this invention in any manner. Although certain embodiments and examples have been provided, it will be apparent to those skilled in the art based on the disclosures herein that changes in the embodiments and examples shown may be made without departing from the scope of embodiments provided herein.
Thus, in some embodiments, a user-assistance function includes providing of content addressability to a user by receiving, from the user, a query as to the event in the video, cueing, based on the query and the second graph, the video to the start time of the event in the video, and displaying the video to the user beginning at the start time of the event.
Some embodiments also include obtaining the second graph. The second graph is formed from the first graph as mentioned above. The first graph is a flow graph and the second graph encodes topological sorts of the first graph. The embodiments include aligning the second graph to a video. The aligning obtains a start time and an end time of an event in the video. The embodiments include receiving, from the user, a query as to an event in the video. In some embodiments, cueing is performed, based on the query and the second graph, to cue the video to the start time of the event in the video and display the video to the user beginning at the start time of the event.
Also, as shown in
System 3-20 of
Thus, the user-assistance function in some embodiments is an on-line personal assistant which performs observing one or more recent operations of a user, receiving a query from the user as to candidate next steps, and communicating, based on the query, the second graph and the one or more recent operations of the user, the one or more candidate next steps to the user.
As shown in
Overall with respect to
Formation of the first graph and the second graph will now be described.
The first graph, which is a flow graph, may be formed based on a set of text instructions. The formulation of a flow graph is well known and will not be described further.
Forming the second graph includes defining nodes in the second graph as a two tuple (v,P), wherein v is a node in the second graph currently being processed and P is a set of nodes that has already been processed. The second graph may be referred to as a tSort graph. In some embodiments, forming the second graph further includes defining edges in the second graph as satisfying (v,P) to (w,P′) is an edge in the second graph if and only if P′=PU{v}, ancestors of w are in P, and w is not already in P, wherein P and P′ are nodes in the second graph, v is a step that has been completed, w is not in P and “U” is a set operator indicating union.
An algorithmic representation of Algorithm 1 (a first algorithm) for forming the second graph is given in Table 1.
Given the second graph S (see line 13 of Table 1), graph-to-signal matching is used to match a path through the second graph to the input signal, for example, a video. This matching is also called grounding. Regarding the matching, see for example,
Table 2 describes Algorithm 2. Algorithm 2 is a modification of the Drop-DTW algorithm. The Drop-DTW algorithm is described in U.S. application Ser. No. 17/563,813 filed Dec. 28, 2021 published as US Publication No. 2022/0237043 and assigned to the same assignee as the present application. U.S. application Ser. No. 17/563,813 is incorporated by reference herein.
Table 2 shows that Algorithm 2 aligns the second graph to the video by inferring an ordering of steps in the video by applying a dynamic programming recursion. As in Table 2, the dynamic programming recursion which is Algorithm 2 includes determining a plurality of costs associated with a plurality of two-tuples, a first two-tuple of the plurality of two-tuples corresponding to a pairing of a step encoding with a clip from the video, and tracing back among the plurality of two-tuples to find a minimum cost path.
An overall algorithm flow 6-10 of an embodiment is shown in
Algorithm state 3 obtains an optimal step ordering by simultaneously inferring the actual ordering of steps and localizing them in the video. The output of overall algorithm flow 6-10 is then localization results without a need for step order annotation.
A brute force approach may enumerate all possible topological sorts on the flow graph and then use sequence to sequence matching, assigning a cost to each match. The sequence with the lowest cost is then output.
In an example, a recipe for making a cooled drink including jello may include the following events shown in Table 3 and illustrated by the flow graph in
Before discussing Table 3, some background is provided here. A flow graph represents a partial ordering of events, specifically some events must be completed before another can begin. The events that must be completed before a specific event can begin are earlier in the partial order, that is there is a path from such an ancestor event to this specific event in the flow graph.
As an example, consider that in
A sequence of steps which is in a feasible order is called a topological sort.
The intermediate progress through the set of an events according to some topological can be described by a set of completed steps called a front, defined to be the minimal set such that all the completed steps are either in the front or are ancestors of the front. In the above case, the front is 2 and 4.
Embodiments keep track of this front as the procedure is executed in some (a priori unknown) feasible order.
Given the front, a feasible next step is any child node of the front such that all its parents are in the front. In the example 2, 4, this is only the node 3.
Embodiments build a “tSort graph” (the second graph) where there is a 1-1 correspondence between nodes and (feasible) fronts. In this graph, any feasible execution order is represented simply as a path from the source to sink and, vice versa, any path from source to sink in the tSort graph corresponds to a feasible execution ordering.
Returning to Table 3, please also see
The flow graph in
However, the brute force is not feasible for even smalls sets of instructions.
Consider a flow graph with T sequential threads, with the number of nodes in the first thread being n1, the number of nodes in the second thread being n2, and so on up to nT. In an example, the total number of nodes in the flow graph is n1+n2+n3+ . . . +nT.
In Eqtn. 1, n! represents n factorial.
The speedup using the tSort graph found by Algorithm 1 and matching using Algorithm 2 is shown, as an example, in Table 4.
On the upper left of
In the brute force approach, all topological sorts are found in the lower middle portion of
In an embodiment of the application, the second graph is found from the first graph using Algorithm 1 as shown in the top middle of
As shown in
Referring again to
Hardware for performing embodiments provided herein is now described with respect to
This application claims priority to U.S. Provisional Application No. 63/317,432, filed Mar. 7, 2022; the content of which is hereby incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
10423395 | Stanfill et al. | Sep 2019 | B2 |
10825227 | Amer et al. | Nov 2020 | B2 |
20150050006 | Sipe | Feb 2015 | A1 |
20170213089 | Chen | Jul 2017 | A1 |
20210271886 | Zheng et al. | Sep 2021 | A1 |
20220237043 | Hadji et al. | Jul 2022 | A1 |
20220300417 | Hajewski | Sep 2022 | A1 |
Entry |
---|
Huang et al., “Finding “It”: Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos,” IEEE Computer Society, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5948-5957. (Year: 2018). |
Number | Date | Country | |
---|---|---|---|
20230282245 A1 | Sep 2023 | US |
Number | Date | Country | |
---|---|---|---|
63317432 | Mar 2022 | US |