Video tutorials have become an integral part of day-to-day life, especially in the context of the modern era of software applications. Generally, each software application has a variety of functionalities unique to the application. The goal of video tutorials is to help instruct, educate, and/or guide a user to perform tasks within the application. Conventionally, web browsers are used to access such tutorials. A query in the form of a question may be presented using a web search browser (e.g., GOOGLE®, BING®, YAHOO®, YOUTUBE®, etc.), and the search browser presents all possible relevant tutorials to the user. Once located, the video tutorial is generally played or watched via the web browser. Oftentimes, however, the video tutorials can be entirely too long for accurate recall. As such, users can be forced to switch between the web browser playing the tutorial and the application about which the user is learning to follow the instructions and to perform the associated task, segment-by-segment. Such a workflow requires the user to stop and resume the video in the browser multiple times to perform the associated task in the application. In some cases, only a section of the video may be relevant to the user's inquiry. Here, a user must manually find the relevant portion of the video to perform the needed task.
The video tutorial relied is generally desired to be application specific for it to be useful. Moreover, software applications frequently come out with newer versions. The video tutorials relied on by the user are desired to correspond to the correct version of the application being used by the user. Because application specific video tutorial are valuable, video tutorial systems may be used to provide step-by-step instructions in text, image, and/or other formats during or prior to application use. Video tutorial systems aim to assist users in learning how to use certain parts or functionalities of a product. Many video tutorial systems use a table of contents to provide instructions on use of applications for various tasks. Based on a user query, the video tutorial system presents a list of tutorials, video and/or text based, that may be relevant to the user query. However, the current systems require users to manually navigate through the tutorials and/or the table of contents to find first the right tutorial, and then the right section of the tutorial to perform the relevant task. Additionally, the current systems require the user to leave the application to watch video tutorials in a web browser to learn and perform every step presented in the video tutorial. This process can be extremely time consuming and inefficient. It may take various attempts for a user to both find the right instructions and perform them accurately.
Embodiments of the present invention are directed to an in-application (“in-app”) video navigation system in which a video span with an answer to a user's query is presented to the user within an application window. In this regard, a user may input a query (e.g., a natural language question via text, voice command, etc.) within an application. The query can be encoded to a query embedding in a vector space using a neural network. A database of videos may be searched from a data store including sentence-level and/or passage level embeddings of videos. Top candidate videos may be determined such that the candidate videos include a potential answer to the query. For each of the candidate videos, an answer span within the video may then be determined based on a sentence level encoding of the respective candidate video. The spans for each candidate video may then be scored in order to determine the highest scoring span. The highest scoring answer span can then be presented to the user in form of an answer to the query. The answer span may be presented by itself or within the candidate video with markings within a timeline of the candidate video (e.g., highlighting, markers at start and end of the span, etc.) pointing to the span within the video. As such, a user can be efficiently and effectively guided towards an answer to the query without having to leave the application or watching long videos to find a specific portion including the answer.
The present invention is described in detail below with reference to the attached drawing figures, wherein:
Conventional video tutorial systems utilize one or more of step-by-step guidance, gamification, and manual retrieval approaches to provide instructions on how to perform tasks in an application. For example, some conventional in-application (“in-app”) tutorial systems provide step-by-step instructions using text or images. This requires a user to peruse the text to find the relevant portion to use to perform a specific task. Additionally, this requires a user to read each step before performing it. Some other conventional in-app tutorial systems use gamification to teach users how to perform tasks. Typically, this includes pointing based tutorials prior to application use. In other words, the gamification-based tutorials are training tutorials presented to a user prior to using an application. Pointing based tutorials provide interactive instructions for user by directing a user's attention, via pointing, to certain aspects of an application while providing blurbs explaining how the particular aspect may be used within the application. Some other tutorial systems use catalogues or lists of videos for a user to choose from when searching for a video tutorial to watch. These approaches require a user to manually navigate a list of videos or applications to find a relevant video. Such manual navigation is inefficient and time consuming for users as users have to peruse a large database of tutorials to find the relevant tutorial. Oftentimes, a user may not be able to find relevant tutorials and/or portions of the relevant tutorials at all, leading to user frustration and decreased user satisfaction. Moreover, the conventional approaches do not allow a user access to relevant video tutorials while using the application, requiring a user to leave the application workspace to watch the tutorials. This furthers adds to user frustration as the user has to switch back and forth between the tutorial window and the application workspace to complete a task.
Embodiments of the present invention address the technical problem of providing video segments and/or spans in response to a user query or question within an application video, such that a user may watch a relevant portion of a video tutorial while simultaneously performing the instructions in the application in the same window. In operation and at a high level, a neural network may be used to determine a video and an associate span within the video that answers a question asked by a user via a natural language text or voice query within an application. The neural network may identify a video tutorial and a span (i.e., a start and an end sentence) within the video tutorial that includes an answer to the question. To do so, the neural network may first retrieve top candidate videos from a video repository based on the question. In some embodiments, the neural network may also use context information, such as past queries, application version, etc., to retrieve top candidate videos. Next, within each of the top candidate videos, the neural network may determine a span that includes a potential answer to the question. The spans from each of the top candidate videos may then be ranked based on relevance to the query and the context information. The highest scored span and the associated video may then be presented to the user as an answer to the query.
In some embodiments, the video may be presented, via a user interface, with the span highlighted within the video, in the application itself. The user interface may allow the user to perform various functions, including pausing and/or resuming the video, navigating to different portion of the video, etc., within the application. The user may also navigate straight to the span with the relevant portion of the tutorial without having to watch the entire video from the beginning or searching through a database or table of contents. In an embodiment, the user may also be presented with a table of contents associated with the video. This provides the user with an alternative way of navigating through the video.
Aspects of the technology disclosed herein provide a number of advantages over previous solutions. For instance, one previous approach involves a factoid question answering system that finds a word or a phrase in a given textual passage containing a potential answer to a question. The system works on word level, to find a sequence of words to answer a who/what/why question. However, generating an answer span containing a word or a phrase has a significant drawback when it comes to determining answer spans in a video tutorial. Video tutorials are based on the premise of answering “how to” questions, and a one word or phrase answer may not be appropriate to present a user with instructions on how to perform a particular task. To avoid such constraints on the answers contained in video tutorials, the implementations of the technology described herein, for instance, systematically develops an algorithm to segment a video tutorial into individual sentences and consider all possible spans (i.e., starting sentence and ending sentence) within the video to determine the best possible span to answer a question or a query. The implementations of the present technology may allow for a sequence of sentences within a video tutorial to be an answer span, allowing the span to fully answer a question. Additionally, the implementations of the present technology may also take as input context information (e.g., past commands, program status, user information, localization, geographical information, etc.) to further refine the search for an accurate answer span and/or video in response to a query or question.
Some other previous work addressed the problem of providing summarized versions of news videos in the form of video clips. Sections of a news video are segmented into separate videos based on topic of the news. However, segmenting a video into several parts based on topics has a significant drawback of assuming that the videos may only be divided based on the topics generated by the news cast. To avoid such constraints relating to pre-established segmentations, implementations of the technology described herein, for instance, systematically develop an algorithm to take the entirety of a video tutorial at individual sentence level and assess each combination of starting and ending sentence within the video tutorial to determine the best span to answer the query or a question. The algorithm used in the previous work does not allow for flexibility in answering new questions and does not use contextual information to find the correct video and span within the video to answer a user's query or question.
As such, the in-app video navigation system can provide an efficient and effective process that provides a user with a more relevant and accurate video span answering a query without having to leave the application as opposed to prior techniques. Although the description provided herein generally describes this technology in context of in-application video tutorial navigation, it can be appreciated that this technology can be implemented in other video search contexts. For example, the technology described herein may be implemented to present video answer spans in response to a video search query within a search database (e.g., GOOGLE®, BING®, YAHOO®, YOUTUBE®, etc.), a website, etc. Specifically, the present technology may be used to provide specific video spans as answers to video queries in any number of contexts wherein a video search is conducted, such that a user may be presented with a video including an indicated video span to answer the query, generated and presented in a way similar to the in-application video navigation system technology described herein.
Having briefly described an overview of aspects of the present invention, various terms used throughout this description are provided. Although more details regarding various terms are provided throughout this description, general descriptions of some terms are included below to provider a clearer understanding of the ideas disclosed herein:
A query generally refers to a natural language text or verbal input (e.g., a question, a statement, etc.) to a search engine configured to perform, for example, a video search. As such, a query may refer to a video search query. The query can be in the form of a natural language phrase or a question. A user may submit a query through an application via typing in a text box or voice commands. An automated speech recognition engine may be used to recognize the voice commands.
A query or question embedding (or encoding) generally refers to encoding a query in a vector space. A query can be defined as a sequence of words. The query may be encoded in a vector space using a bidirectional long short-term memory layer algorithm.
A command sequence generally refers to a sequence of commands executed by a user while in or using the application (e.g., icons used from a tool bar, menus selected, etc.). The command sequence may also include additional context information, such as, application status, user information, localization, geographical information, etc. The command sequence may be embedded as a command sequence encoding or embedding into a vector space. Command sequence encoding or embedding may refer to a last hidden vector in a vector space that represents the command sequence.
A sentence-level embedding, as used herein, refers to a vector representation of a sentence generated by an encoding into a joint vector space using a neural network. Generally, sentence-level embeddings may encode based on the meaning of the words and/or phrases in the sentence. By encoding a database of words and/phrases input into the vector space, a sentence can be encoded into a sentence-level embedding, and the closest embedded words, phrases or sentences (i.e., nearest embeddings in the vector space) can be identified.
A passage-level embedding (encoding), as used herein, refers to an encoding of a sentence into a joint vector space such that the embedding takes into account all prior and subsequent sentences in a video transcript. By encoding a database of sentences into the joint vector space, a sentence can be encoded into the passage-level embedding, and the latent meaning of the sentence may be represented in the vector space for the passage.
A span generally refers to a section of a video defined by a starting sentence location and an ending sentence location within a transcript of a video. A span can be any sentence start and end pair within the video transcript. An answer span, as used herein, refers to a span that includes an answer to a user query. An answer span can be the span with the highest score within the video.
A span score generally refers to a probability of a span to include an answer to a query as compared to the other spans in a video. A video score, on the other hand, refers to a probability of a video to include an answer to a query as compared to all other videos in a video repository or data store.
Referring now to
Environment 100 includes a network 102, a client device 106, and a video navigation system 120. In the embodiment illustrated in
Video navigation system 120 generally determines an answering span within a video present in data store 104 that best answers a user's query. The video navigation system 120 may include a query retriever 122, a span determiner 124, and a video generator 126. In some examples, video navigation system 120 may be a part of the video module 114. In other examples, video navigation system 120 may be located in a remote server.
The data store 104 stores a plurality of videos. In some examples, data store 104 may include a repository of videos collected from a variety of large data collection repositories. Data store 104 may include tutorial videos for a variety of application. The videos in data store 104 may be saved using an index sorted based on applications. The components of environment 100 may communicate with each other via a network 102, which may include, without limitation, one or more local area networks (LANs) and/or wide area networks (WANs). Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets, and the Internet.
Generally, the foregoing process can facilitate generation of an answer span in response to a query within an application interface by searching within a data store of videos. By adopting an in-app approach to producing videos and/or specific spans within the videos to answer a user's query via machine learning techniques, there is no need for the user to leave an application to find an answer.
Application interface 110 presents a user with an answer(s) to a user provided query. In some embodiments, a query may be a question. The query may be a natural language query in the form of a textual query or a vocal query. Application interface 110 may receive a query from a user in the form of a text query via a keyboard or touchscreen of client device 106 or in the form of a voice command via a speech recognition software of client device 106. Application interface 110 may use a query receiver, such as but not limited to query receiver 212 of
Application interface 110 may include an application workspace 112 and a video module 114. Application workspace 112 may provide an area within an application interface 110 for a user to interact with the application in use. Video module 114 may present a user with an answer span and/or an answering video in response to the user query. Video module 104 may display an answer span by itself or within a video with markings within a timeline of the video (e.g., highlighting, markers at start and end of the span, etc.) pointing to the span within the video that includes a potential answer to the query. Video module 114 may receive the answer span and/or video from video navigation system 120 and present the answer span and/or video within the application interface 110 for further interaction by the user.
Video navigation system 120 is generally configured to receive a natural language query and determine an answer span that best answers the query. Video navigation system 120 may receive the query in the natural language form from the application interface 110. In some examples, video navigation system 120 may be a part of the video module 114. In other examples, video navigation system 120 may be located in a remote server, such that video module 114 and/or application interface 110 may communicate with video navigation system 120 via network 102. Video navigation system 120 may include a query retriever 122, a span determiner 124, and a video generator 126.
Query retriever 122 may retrieve or obtain a query from the application interface 110 and/or video module 114. Upon obtaining a query, the query may be converted to a vector representation, for example, by encoding the sequence of words in the query in a vector space. Query retriever 122 may encode the query in a vector space using a bidirectional long short-term memory layer algorithm as follows:
h
q=biLSTMlast(q)
where hq is the last hidden vector for the query and q is the sequence of words in the query.
Span determiner 124 may generally be configured to determine an answering span along with a video that includes the best potential answer to the query. Span determiner 124 may access the video repository in data store 104 to determine top candidate videos (i.e., a threshold number of top videos, e.g., top five, top ten, etc.) that include, or may include, an answer to the query. Further, span determiner 124 may determine spans and respective scores within each candidate video that includes potential answers to the query. The answer span with the highest score may be determined to be the answer span by span determiner 124 as described in more detail below with respect to
Video generator 126 may be configured to generate a video or supplement a video to include answer span indicator. This may be done by generating a timeline for the video with markings within a timeline (e.g., highlighting, markers at start and end of the span, etc.) pointing to the start end location of the answer span. The indication of an answer span including a start and an end location for the span within the corresponding video may be received by video generator 126 from span determiner 124. Video generator 126 may provide the indication of the video with answer span locations to the video module 114 for presentation to the user device via application interface 110, such that the user device may reference the video, for example, from the video repository in a data store and present the video with the associated answer span markings. The user may then interact with the video in the same window (i.e., application interface) as the application workspace. As such, environment 100 provides an in-app video navigation system where a user may watch a video tutorial while simultaneously applying the learned steps to the application without ever having to leave the application workspace. Additionally, video module 114 may also be configured to allow a user to navigate or control the presented video, such as, pausing the video, resuming the video, jumping to another position within the video, etc. In some examples, a user may navigate the presented video using any number of key board shortcuts. In some other examples, a user may use voice commands to navigate the video.
Turning to
Video component 214 of video module 114 is generally configured to present a video with an answering span to the user via application interface 110. In some examples, video component 214 may obtain an indication of the video with answer span locations that includes a potential answer to the query. Video module 214 may further present the indication to the user device via application interface 110, such that the user device may reference the video, for example, from the video repository and present the video with the associated answer span markings. In some examples, video navigation system 120 may be a part of the video component 214. In other examples, video navigation system 120 may be located in a remote server, such that video module 114 may communicate with video navigation system 120 via network 102.
Table of contents component 216 may present a table of contents associated with the video that includes the answer span. Each video in data store 104 may include an associated table of contents that points to different topics covered at different section of the video. In some examples, the table of contents information is saved in association with the corresponding video. The video may be manually segmented into topics. In other examples, any known method of automatically segmenting videos into individual topics may be used to generate table of contents. Table of contents component 216 retrieves table of contents associated with the video having the answer span and presents it to the user via application interface 110 of client device 106. Table of contents component 216 may allow a user to navigate the video by clicking on the topics or picking a topic using voice commands. This gives a user flexibility in navigating the video, one via the timeline and the marked answer span, and another through the table of contents.
Referring to
Video encoder 130 may include a sentence-level encoder 232, a passage-level encoder 234, and a span generator 236. A transcript of each video may be generated or obtained. Each video may be represented as individual sentences. This may be done by segmenting the transcript of the video into individual sentences using known sentence segmentation techniques. Sentence-level encoder 232 may be used to encode the individual sentences of video transcript as sentence embedding vectors (i.e., S1, S2, S3 . . . Sn) in a vector space, which encodes the meaning of the sentences. For example, referring briefly to
Further, the videos in data store 104 may further be encoded at a passage-level, by a passage-level encoder 234. The sentence encoding vectors may be leveraged to generate passage-level representations in the vector space. Long-term dependencies between a sentence and its predecessors may be determined to learn the latent meaning of each sentence. In some examples, two bidirectional long short-term memory (biLSTM) layers may be used to encode the transcript of the corresponding video, one for encoding individual sentences and another to encode passages. For each individual sentence, the sentence-level encoder 232 may take as input the sequence of the words in the sentence, and apply a biLSTM to determine the last hidden vector as follows:
h
i=biLSTMlast(si) for i=1 . . . n
where hi is the last hidden vector for the immediate predecessor sentence and n is the total number of sentences in the video. A second biLSTM may then be applied to the last hidden vectors of the sentences to generate a passage-level encoding, by passage-level encoder 234, that is, the hidden vectors for the sentences, as follows:
p=biLSTMall({h1,h2, . . . ,hn})
where p encodes all hidden vectors along the sequence of sentences in the transcript of the video. These hidden vectors may represent the latent meaning of each individual sentence (S1, S2, S3 . . . Sn) as passage-level encoding in a vector space.
Next, span generator 236 may be configured to compute embeddings of each possible span in the corresponding video. A span is represented in an index as (starting sentence location, ending sentence location). All possible spans, i.e. spans for each possible pair of two sentences may be embedded in a vector space. As such, for a transcript of a video with n sentences, there are n*(n−1)/2 spans generated and embedded in a span vector space. All possible spans may be considered by concatenating all possible pair of two sentences, using the following:
r
ij=[pi,pj] for i,j=1 . . . n
where [pi,pj] indicates a concatenation function, i is the starting sentence location and j is the ending sentence location. It should be understood that the entirety of the video (i.e., video transcript) may also be a span. In some examples, span embeddings for a span may be based on sentence-level and/or passage-level embeddings of their associated sentence pair. In such an example, the span embedding may leverage the latent meaning of the paired sentences from the sentence-level and/or passage-level embeddings to determine the meaning included in the span. These span embeddings may be saved in the data store 104 with the corresponding videos.
Turning now to
The candidate identifier 222 may generally be configured to identify and/or obtain top candidate videos that include a potential answer to the query. To do so, candidate identifier 222 may take as input span embeddings for each video in the video data store 104 and query embeddings generated by query retriever 122. In some examples, candidate identifier 222 may also take as input a sequence of commands executed by a user while in or using the application (e.g., icons used from a tool bar, menus selected, etc.) as context information. In some examples additional context information, such as, application status, user information, localization, geographical information, etc., may also be used as input by candidate identifier. The contextual information may be embedded as a command sequence encoding using another biLSTM layer to calculate last hidden vector to represent contextual information in a vector space as follows:
c=biLSTMlast({c1, . . . ,cm})
where c is the command sequence embedding of the contextual information in the vector space, and m is the number of commands.
Candidate identifier 222 may identify top candidate videos using any trained neural network trained to find an answer within transcripts to a query. Top candidate videos are videos in the data store 104 most likely to include an answer to a query. Top candidate videos may be identified based on the query, and in some examples, the command sequence. In some examples, a machine-learning algorithm may be used to identify top candidate videos based on the query and/or the command sequence. In some other examples, candidate identifier 222 may identify top candidate videos from the data store 104 based on the distance of sentence-level and/or passage-level embedding from the combination of query embedding and the command sequence embedding in a vector space. In some examples, candidate identifier 222 may retrieve top candidate videos based on their scores determined by any known machine learning technique. In some examples, a neural network may be used. The output of the machine learning technique and/or the neural network may include scores and/or probabilities for each video in the data store 104, the scores indicating the probability of an answer to the query being included in the particular video as compared to all other videos in the data store 104. Any known search technique may be used to determine top candidate videos. In one example, ElasticSearch® technique may be used to retrieve the top candidate videos along with their corresponding scores. The top candidate videos and/or an indication of the top candidate videos with the corresponding scores may then be used by the span detector 224 to determine a best span that includes an answer to the query for each of the top candidate videos.
Span detector 224 may be configured to identify the best span for each of the top candidate videos that includes a potential answer to the user query. Span detector 224 may use a machine learning algorithm to identify the best span for each top candidate video. In some examples, a deep neural network may be used. The neural network may be trained using ground truth data generated manually, as discussed in more detail below. Span detector 224, for each of the top candidate videos, may take as input query embedding generated by query retriever 122, command sequence embedding generated by the candidate identifier 222, the passage-level embedding generated by the passage-level encoder 234, and/or all possible span embeddings for each span (i.e., starting sentence location, ending sentence location) generated by the span generator 236 corresponding to the associated top candidate video. A score for each span embedding may be calculated. A span score can be determined based on the probability of the span including the best possible answer to the query, and in view of the contextual information, as compared to all other spans associated with the corresponding video. In one example, for each span, a 1-layer feed forward network may be used to combine the span embedding, the command sequence embedding, and the query embedding. A softmax may then be used to generate normalized score for each span of the corresponding video. In some examples, leaky rectified linear units (ReLU) may be used as an activation function. In another example, a cross entropy function may be used as an activation function. Score for each span may be calculated as follows:
Scorespan,ij=softmax(FFNN{[rij,hq,c]))
where FFNN is an objective function and Scorespan,ij is the score for the span (i, j), where i is the starting sentence location and j is the ending sentence location for the span. The span with the highest score may then be selected as the best span for that corresponding top candidate video. The best span for each of the top candidate videos and their respective score may be similarly calculated.
Next, span selector 226 may be configured to select or determine an answer span for the query, the answer span including the best potential answer to the query. Span selector 226 may receive as input top candidate video scores from candidate identifier 222 and their respective best span scores from the span detector 224. An aggregate score for each of the top candidate videos and their respective best spans may be calculated by combining the top candidate video score with its corresponding best span score. In one example, the aggregate score may be calculated as follows:
Scoreaggregate=Scorevideo*Scorespan
where Scorevideo is the score of the candidate video and Scorespan is the best span score of the best span in the associated candidate video. Span selector 226 may determine the answer span with the best potential answer to the query as the span with the highest aggregate score. In some examples, span selector 226 may determine the answer span as the span with the highest best span score. Span selector 226 may output an indication of the answer span as a location defined by (starting sentence location, ending sentence location).
Video generator 126 may be configured to identify an answer to be presented to the user based on the query. Video generator 126 may receive indication of the answer span along with the associated video from span selector 226. A timeline for the video may be identified. The timeline can run from the beginning of the video to the ending of the video. The answer span is indicated by a starting sentence location and an ending sentence location for the span within the transcript and/or the timeline of the video. The locations for the starting and ending sentences of the span may then be used to provide markers for the span within the video timeline by the video component 214. The video component 214 and/or the video module 114 may receive an indication of the video and the span location. The video may be identified or retrieved based on the indication. A timeline may be associated with the video, and markers may be generated within the timeline to identify the answer span. The markers may include highlighting the span in the timeline, including a starting marker and ending marker in the timeline, etc. It should be understood that any markings that may bring attention to the answer span may be used. In some examples, only the answer span may be presented to the user. The answer span, corresponding video and/or the marked timeline may be sent to the video module 114 for presentation via the application interface 110 for further interaction by the user. Video module 114 may also receive voice or text commands from user to navigate the video (e.g., pause the video, resume the video, jump to another position in the video, etc.). For example, a user may provide a command to start the video at the span starting location. In response, video module 114 may start the video from the span starting location.
Turing now to
The video module 312 includes a video 314 and a table of contents 316. The video 314 is determined to include an answer span answering the query. The video 314 including the answering span may be determined by a video navigation system, such as but not limited to video navigation system 120 of
Now turning to
Turning now to
When a user query is received via a client device, such as but not limited to client device 106 of
In some examples, query encoding 518 and command sequence encoding may be used to find top candidate videos using a neural network. For each of the top candidate videos, each of the possible spans generated during span generation 514 are scored based on the question encoding 518 and the command sequence encoding 520. The highest scoring spans for each candidate video are then scored against each other to find the best answer span. Span scoring 516 may use the video score generated by the neural network and the span score for each video to calculate an aggregate score for each highest scoring spans. The span with the highest aggregate score may be presented to the user as an answer to the query.
Generally, the foregoing process can facilitate presenting specific and efficient answer spans and/or videos inside an application interface in response to user queries. By adopting an in-app and span based approach to producing answers to user query, there is no need for user to switch back and forth between an application and web browser to learn to perform tasks within the application. These approaches also provides a user with effective, efficient, and flexible way to access videos with clearly marked answers without the user having to search through long and arduous search results.
A machine learning model or a neural network may be trained to score spans based on a query. Span selector, such as but not limited to span selector 226 of
Embodiments of the present invention address such problems by describing a data collection framework that allows a crowdsourcing worker to effectively and efficiently generate ground truth data to train the machine learning model to score and provide answer spans within videos. First, parts of the video that can serve as a potential answer may be identified by a worker. For this, the crowdsourcing worker may read a transcript of the corresponding video and segment the transcript such that each segment can serve as a potential answer. The segments with potential answers may vary in granularity and may overlap.
Next, a different set of crowdsourcing workers may be utilized to generate possible questions that can be answered by each potential answer segment. In some examples, multiple questions may be generated for a single segment. The questions may then be used to train the machine learning model with the segments used as ground truth spans. Advantageously, context is provided to the workers prior to generating questions.
A tolerance accuracy metric may be used to evaluate the performance of the machine learning model prior to real-time deployment. The tolerance accuracy metric may indicate how far the predicted answer span is from the ground truth span. In one example, the predicted answer span may be determined as correct if the boundaries of the predicted and the ground truth span are within a threshold distance, k. For example, a predicted answer span may be determined as correct if both the predicted starting sentence location and the predicted ending sentence location are within the threshold distance k of the ground truth starting sentence location and the ground truth ending sentence location, respectively. Further, a percentage of questions with a correct prediction in a training question data set may be calculated.
With reference now to
Turning initially to
Next, at block 606, an answer span including a best potential answer to the query is determined. The best potential answer may be the best potential answer to the question within the query related to the application. The answer span may be determined by a span selector, such as span selector 226 of
Turning now to
Having described an overview of embodiments of the present invention, an exemplary operating environment in which embodiments of the present invention may be implemented is described below in order to provide a general context for various aspects of the present invention. Referring now to
The invention may be described in the general context of computer code or machine-useable instructions, including computer-executable instructions such as program modules, being executed by a computer or other machine, such as a cellular telephone, personal data assistant or other handheld device. Generally, program modules including routines, programs, objects, components, data structures, etc. refer to code that perform particular tasks or implement particular abstract data types. The invention may be practiced in a variety of system configurations, including hand-held devices, consumer electronics, general-purpose computers, more specialty computing devices, etc. The invention may also be practiced in distributed computing environments where tasks are performed by remote-processing devices that are linked through a communications network.
With reference to
Computing device 800 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by computing device 800 and includes both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computing device 800. Computer storage media does not comprise signals per se. Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer-readable media.
Memory 812 includes computer-storage media in the form of volatile and/or nonvolatile memory. The memory may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid-state memory, hard drives, optical-disc drives, etc. Computing device 800 includes one or more processors that read data from various entities such as memory 812 or I/O components 820. Presentation component(s) 816 present data indications to a user or other device. Exemplary presentation components include a display device, speaker, printing component, vibrating component, etc.
I/O ports 818 allow computing device 800 to be logically coupled to other devices including I/O components 820, some of which may be built in. Illustrative components include a microphone, joystick, game pad, satellite dish, scanner, printer, wireless device, touch pad, touch screen, etc. The I/O components 820 may provide a natural user interface (NUI) that processes air gestures, voice, or other physiological inputs generated by a user. In some instances, inputs may be transmitted to an appropriate network element for further processing. An NUI may implement any combination of speech recognition, stylus recognition, facial recognition, biometric recognition, gesture recognition both on screen and adjacent to the screen, air gestures, head and eye tracking, and touch recognition (as described in more detail below) associated with a display of computing device 800. Computing device 800 may be equipped with depth cameras, such as stereoscopic camera systems, infrared camera systems, RGB camera systems, touchscreen technology, and combinations of these, for gesture detection and recognition. Additionally, the computing device 800 may be equipped with accelerometers or gyroscopes that enable detection of motion. The output of the accelerometers or gyroscopes may be provided to the display of computing device 800 to render immersive augmented reality or virtual reality.
Embodiments described herein support in-app video navigation based on a user query. The components described herein refer to integrated components of an in-app video navigation system. The integrated components refer to the hardware architecture and software framework that support functionality using the in-app video navigation system. The hardware architecture refers to physical components and interrelationships thereof and the software framework refers to software providing functionality that can be implemented with hardware embodied on a device.
The end-to-end software-based in-app video navigation system can operate within the in-app video navigation system components to operate computer hardware to provide in-app video navigation system functionality. At a low level, hardware processors execute instructions selected from a machine language (also referred to as machine code or native) instruction set for a given processor. The processor recognizes the native instructions and performs corresponding low level functions relating, for example, to logic, control and memory operations. Low level software written in machine code can provide more complex functionality to higher levels of software. As used herein, computer-executable instructions includes any software, including low level software written in machine code, higher level software such as application software and any combination thereof. In this regard, the in-app video navigation system components can manage resources and provide services for the in-app video navigation system functionality. Any other variations and combinations thereof are contemplated with embodiments of the present invention.
Having identified various components in the present disclosure, it should be understood that any number of components and arrangements may be employed to achieve the desired functionality within the scope of the present disclosure. For example, the components in the embodiments depicted in the figures are shown with lines for the sake of conceptual clarity. Other arrangements of these and other components may also be implemented. For example, although some components are depicted as single components, many of the elements described herein may be implemented as discrete or distributed components or in conjunction with other components, and in any suitable combination and location. Some elements may be omitted altogether. Moreover, various functions described herein as being performed by one or more entities may be carried out by hardware, firmware, and/or software, as described below. For instance, various functions may be carried out by a processor executing instructions stored in memory. As such, other arrangements and elements (e.g., machines, interfaces, functions, orders, and groupings of functions, etc.) can be used in addition to or instead of those shown.
The subject matter of the present invention is described with specificity herein to meet statutory requirements. However, the description itself is not intended to limit the scope of this patent. Rather, the inventor has contemplated that the claimed subject matter might also be embodied in other ways, to include different steps or combinations of steps similar to the ones described in this document, in conjunction with other present or future technologies. Moreover, although the terms “step” and/or “block” may be used herein to connote different elements of methods employed, the terms should not be interpreted as implying any particular order among or between various steps herein disclosed unless and except when the order of individual steps is explicitly described.
The present invention has been described in relation to particular embodiments, which are intended in all respects to be illustrative rather than restrictive. Alternative embodiments will become apparent to those of ordinary skill in the art to which the present invention pertains without departing from its scope.
From the foregoing, it will be seen that this invention is one well adapted to attain all the ends and objects set forth above, together with other advantages which are obvious and inherent to the system and method. It will be understood that certain features and subcombinations are of utility and may be employed without reference to other features and subcombinations. This is contemplated by and is within the scope of the claims.