This disclosure generally relates to video display techniques, and more particularly, to systems for displaying text captions in video.
Videos are widely shared among friends, families, colleagues, and other groups of people. Often, videos can more effectively convey a message than text or images. Yet videos can be enhanced by superimposing text on images within a video. Adding textual comments or captions in videos is becoming an increasingly popular way to more effectively convey a message with a video.
This disclosure describes techniques that include modifying original text associated with a sequence of images or a video sequence and overlaying the modified text as captions in the video sequence. In some examples, the original text associated with the video sequence may correspond to audio of speech spoken by a subject of the video, or by a narrator of the video, while the video sequence is captured. In such an example, the new, modified text captions may be included within one or more images within the video sequence by considering the timing and/or pacing of the original text relative to timestamps or events in the video sequence, or by considering the timing and/or pacing of the audio sounds corresponding to speech occurring within the video sequence.
In some examples, the modified text captions may be rendered and displayed within the video sequence in a manner that preserves alignment, synchronization, pace, and/or timing of speech by the original speaker, and/or preserve the flow of the original text associated with the video sequence. To preserve such attributes, timestamps may be associated with each of the words in the original text. Modified text captions may be presented as captions in the video sequence by giving preference to maintaining timestamps of the original words and/or timestamps of original words that correspond with new or modified words.
Techniques in accordance with one or more aspects of the present disclosure may enable technical advantages. For instance, by generating an initial text caption based on text transcribed or converted from audio data may enable a user to more quickly generate the final set of captions. Also, providing an ability to edit transcribed captions may enable a user to efficiently fix errors in an audio transcription while still retaining appropriate timing of originally-captured audio (and video) events.
In some examples, this disclosure describes operations performed by a computing system in accordance with one or more aspects of this disclosure. In one specific example, this disclosure describes a system comprising a storage system and processing circuitry having access to the storage system, wherein the processing circuitry is configured to: receive audio data associated with a scene occurring over a time period, wherein the audio data includes data representing speech uttered during the time period; transcribe the audio data of the speech into text, wherein the text includes a sequence of original words; associate a timestamp with each of the original words during the time period; receive, responsive to user input, a sequence of new words; and associate a timestamp with each of the new words in the sequence of new words by using the timestamps associated with the original words to determine a corresponding time during the time period for each of the new words.
In another example, this disclosure describes a method comprising receiving, by a computing system, audio data associated with a scene occurring over a time period, wherein the audio data includes data representing speech uttered during the time period; transcribing, by the computing system, the audio data of the speech into text, wherein the text includes a sequence of original words; associating, by the computing system, a timestamp with each of the original words during the time period; receiving, by the computing system and responsive to user input, a sequence of new words; and associating, by the computing system, a timestamp with each of the new words in the sequence of new words by using the timestamps associated with the original words to determine a corresponding time during the time period for each of the new words.
In another example, this disclosure describes a computer-readable storage medium comprises instructions that, when executed, configure processing circuitry of a computing system to receive audio data associated with a scene occurring over a time period, wherein the audio data includes data representing speech uttered during the time period; transcribe the audio data of the speech into text, wherein the text includes a sequence of original words; associate a timestamp with each of the original words during the time period; receive, responsive to user input, a sequence of new words; and associate a timestamp with each of the new words in the sequence of new words by using the timestamps associated with the original words to determine a corresponding time during the time period for each of the new words.
The details of one or more examples of the disclosure are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the disclosure will be apparent from the description and drawings, and from the claims.
User interface device 119 may, in some examples, represent one or more input devices, one or more output devices, or one or more combined input and output devices. For instance, user interface devices 119 may include an input device such as a keyboard, mouse, microphone, or in general, any type of device capable of detecting input from a human or machine. User interface devices 119 may also include an output device, such as a display, speaker, tactile feedback device, or in general, any type of device capable of outputting information to a human or machine. Further, user interface device 119 may also be a combined input and output device, such as a presence-sensitive or touch-screen panel capable of both presenting images on the panel and detecting interactions (e.g., touch interactions) with the panel.
Computing system 110 is illustrated as being in communication, via network 104, with compute nodes 120A, 120B, and 120C (collectively “compute nodes 120” and representing any number of compute nodes). Each of compute nodes 120 may correspond to computing resources in any form. Each of compute nodes 120 may be a physical computing device or may be a component of a cloud computing system, server farm, and/or server cluster (or portion thereof) that provides services to client devices and other devices or systems. Accordingly, any of compute nodes 120 may represent physical computing devices, virtual computing devices, virtual machines, containers, and/or other virtualized computing device.
Computing system 110 in
Network 104 may be the internet, or may include or represent any public or private communications network or other network. For instance, network 104 may be or may include a cellular, Wi-Fi®, ZigBee, Bluetooth, Near-Field Communication (NFC), satellite, enterprise, service provider, and/or other type of network enabling transfer of transmitting data between computing systems, servers, and computing devices. One or more of client devices, server devices, or other devices may transmit and receive data, commands, control signals, and/or other information across network 104 using any suitable communication techniques. Network 104 may include one or more network hubs, network switches, network routers, satellite dishes, or any other network equipment. Such devices or components may be operatively inter-coupled, thereby providing for the exchange of information between computers, devices, or other components (e.g., between one or more client devices or systems and one or more server devices or systems). Each of the devices or systems illustrated in
In
In some examples described herein, computing system 110 may capture images and audio of a scene, such as that depicted in
To effectively present modified captions in such a way, computing system 110 may generate and/or record timestamps of the beginning and ending of each word in the initial text captions, and use such timestamps to determine which of the images in a sequence of images the new text captions will be included. In some examples, generating or recording timestamps may be part of capturing images and/or audio of a scene, as described above. The timestamps may be used to determine proper image placement or alignment of words in scenarios that may include replacing original words with new words, removing original words without replacement, inserting new words between original words, and other scenarios. Further, in some examples, computing system 110 may incorporate language translation techniques when modifying initial text captions. In such an example, computing system 110 may perform sequence alignment of words to synchronize a translated text caption with the original pace and timing of the speech by the original speaker in the original language.
In
In the example of
In one or more of the examples illustrated in
Computing system 110 may transcribe or parse text from the audio data. For instance, again with reference to
Computing system 110 may add captions to the video sequence. For instance, continuing with the example and with reference to
In
Computing system 110 may enable user 101 to view or review video sequence 133. For instance, in some examples, video sequence 133 may be a video recording captured by user 101 (e.g., as a potential social media post), and user 101 may seek to review the video recording and any captions 131 displayed within video sequence 133. In one such example, user interface device 119 detects input that computing system 110 determines corresponds to a request to play or replay video sequence 133. Responsive to the request, computing system 110 accesses stored video sequence 133 and causes user interface device 119 to present video sequence 133 as a sequence of images. Video sequence 133 is presented (e.g., at a display) such that each of captions 131 presented in the appropriately timed images, meaning that each of captions 131 is presented at the time generally corresponding to the time that audio was detected by audio device 118. For example, in a video depicting a scene where a subject of a video is speaking (e.g., user 101), speech audio is presented in the video so that the audio is heard at a time consistent with the times that the audio was recorded, and so that the mouth of the subject of the video moves appropriately when speech audio can be heard. Similarly, computing system 110 may present captions 131 at appropriate times, in appropriate images 132, so that captions 131 are being presented at least approximately when the subject of the video can be observed saying the spoken words.
Computing system 110 may respond to a request to modify video sequence 133. For instance, in an example that can be described in connection with
Accordingly, computing system 110 may detect input that it determines corresponds to a request to correct an improperly transcribed word in video sequence 133. For instance, with reference to
Computing system 110 may detect input removing a word. For instance, referring again to
Computing system 110 may detect input adding a word. For instance, again with reference to
Computing system 110 may associate each of the words within sequence of words 144 with one or more corresponding original words in sequence of words 141. For instance, with reference to
When a word is removed (e.g., sequence of words 142 to sequence of words 143), computing system 110 generally will have no new word to associate with the original word that was removed. In such an example (e.g., where a word is removed), the timestamp associated with the deleted word might not be used in the new text, and a caption might not be presented within images corresponding to the removed word.
When a word is inserted or added (e.g., sequence of words 143 to sequence of words 144), computing system 110 may interpolate between the timestamps of the words surrounding the inserted word(s). Stated another way, computing system 110 may associate the inserted word(s) with one or more of the original words. For instance, computing system 110 may associate the inserted word(s) with the original word prior to the insertion, and in other examples, computing system 110 may associate the inserted word(s) with the original word following the insertion. In still other examples, computing system 110 may associate the inserted word(s) with original words prior to and after the insertion, if doing so more appropriately aligns with the flow of text in the original sequence of words. In general, computing system 110 may seek to associate the inserted words with original corresponding words to ensure that that the flow of the text remains relatively consistent with the flow of text in the original sequence of words. In addition, computing system 110 may maintain the timestamps for the original word and fit in the new words so that the alignment is the same or approximately the same as the original text. In doing so, computing system 110 may give preference to maintaining timestamps of the original words, to the extent that such a preference tends to ensure that the flow of the text in the new sequence of words remains relatively consistent with the flow of text in the original sequence of words.
Computing system 110 also determines that in sequence of words 144, no new word in sequence of words 144 corresponds to original word “umm” in sequence of words 141, which is presented within video sequence 133 (see
Computing system 110 generates video sequence 153 by overlaying words from sequence of words 144 on a sequence of images. For instance, computing system 110 overlays caption 151A on image 152A as illustrated in
Computing system 110 stores video sequence 153. For instance, in some examples, computing system 110 stores video sequence 153 within computing system 110. In other examples, computing system 110 stores video sequence 153 on one or more of compute nodes 120 by transmitting video sequence 153 over network 104.
In some examples, computing system 110 may make stored video sequence 153 available for retrieval and/or playback, such as in a social media post or as an on-demand video. In such an example, computing system 110 (or another computing system having access to stored video sequence 153) may later present video sequence 153 in response to user input. For instance, in some examples, user interface device 119 of computing system 110 may detect input that computing system 110 determines corresponds to a request to present video sequence 153. Computing system 110 accesses video sequence 153 (or retrieves video sequence 153 over network 104). Computing system 110 causes video sequence 153 to be displayed at user interface device 119 with concurrent audio presented by audio device 118.
In some examples, processors 212 and memory devices 214 may be integrated into a single hardware unit, such as a system on a chip (SoC). Each of processors 212 may comprise one or more of a multi-core processor, a controller, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), processing circuitry (e.g., fixed function circuitry, programmable circuitry, or any combination of fixed function circuitry and programmable circuitry) or equivalent discrete logic circuitry or integrated logic circuitry. Memory devices 214 may include any form of memory for storing data and executable software instructions, such as random-access memory (RAM), read only memory (ROM), programmable read only memory (PROM), erasable programmable read only memory (EPROM), electronically erasable programmable read only memory (EEPROM), and Flash memory.
Processors 212 and memory devices 214 may provide a computing platform for executing operating system 220. Operating system 220 may provide a multitasking environment for executing one or more modules. Such modules may include application engine 230, audio processing engine 231, video processing engine 232, translation engine 238, and/or user interface engine 239. Memory devices 214 may connect via one or more i/o interfaces 215 to internal or external systems and/or devices, such as one or more of audio sensors 216, image sensors 217, and/or displays 219. One or more i/o interfaces 215 may incorporate network interface hardware, such as one or more wired and/or wireless network interface controllers (NICs) for communicating via a communication channel or network (e.g., a packet-based network).
One or more audio sensors 216 may correspond to sensor 116 of
One or more image sensors 217 may correspond to camera 117 of
One or more displays 219 may correspond to user interface device 119 of
Operating system 220 may provide an operating environment for executing one or more modules or software components, which may include application engine 230, audio processing engine 231, video processing engine 232, translation engine 238, and/or user interface engine 239. Application engine 230 may include any appropriate application for which techniques described herein may be used, including a social media application, a social networking application, a video publishing, creating or editing application or otherwise.
Audio processing engine 231 may perform functions relating to processing speech data. Audio processing engine 231 may receive audio data that includes speech and image 132 may transcribe or parse speech, including individual words, from the audio data. Audio processing engine 231 may associate a timestamp with each of the words, thereby enabling modifications to the words while preserving aspects of the timing and spacing of the original words.
Video processing engine 232 may perform functions relating to processing a sequence of images. Video processing engine 232 may generate video sequences from a set of images. In generating such video sequences, video processing engine 232 may overlay captions on selected images within a sequence of images.
Translation engine 238 may perform functions relating to translating text from one language to another. Translation engine 238 may translate specific words or sequences of words in one language into specific words or sequences of words in another language. In performing translations of sequences of words, translation engine 238 may identify words or sets of words in a sequence of translated words that correspond to words or sets of words in the original sequence of translated words. Translation engine 238 may identify such corresponding words, enabling audio processing engine 231 to generate captions for a video sequence that preserves aspects of timing and spacing of the original, untranslated words.
Modules illustrated in
Although certain modules, data stores, components, programs, executables, data items, functional units, and/or other items included within one or more storage devices may be illustrated separately, one or more of such items could be combined and operate as a single module, component, program, executable, data item, or functional unit. For example, one or more modules or data stores may be combined or partially combined so that they operate or provide functionality as a single module. Further, one or more modules may interact with and/or operate in conjunction with one another so that, for example, one module acts as a service or an extension of another module. Also, each module, data store, component, program, executable, data item, functional unit, or other item illustrated within a storage device may include multiple components, sub-components, modules, sub-modules, data stores, and/or other components or modules or data stores not illustrated.
Further, each module, data store, component, program, executable, data item, functional unit, or other item illustrated within a storage device may be implemented in various ways. For example, each module, data store, component, program, executable, data item, functional unit, or other item illustrated within a storage device may be implemented as a downloadable or pre-installed application or “app.” In other examples, each module, data store, component, program, executable, data item, functional unit, or other item illustrated within a storage device may be implemented as part of an operating system executed on a computing device.
In
Some words from sequence of words 341 have a direct word-to-word correspondence from sequence of words 342, such as “ball” and “balle”, and the captions presented in images 352 may include such words in sequence of words 342 at the respective times for the corresponding words in sequence of words 341. However, as is typical for a language translation, and as can be seen from
In accordance with one or more aspects of the present disclosure, computing system 210 may capture a sequence of images and accompanying audio data. For instance, in an example that can be described with reference to
Computing system 210 may generate data corresponding to sequence of words 341 of
Computing system 210 may detect input corresponding to a request to modify sequence of words 341. For instance, continuing with the example being described, I/O interface 215 detects input that operating system 220 and/or user interface engine 239 determines corresponds to signals generated by a keyboard, mouse, or other input device that may be operated by a user. Operating system 220 and/or user interface engine 239 determines that the signals correspond to interactions editing text (e.g., where the text corresponds to sequence of words 341) presented at display 219 or at another display device. In another example, I/O interface 215 may detect input that operating system 220 and/or user interface engine 239 determines corresponds to signals generated by a presence-sensitive panel or touchscreen associated with display 219. In such an example, operating system 220 and/or user interface engine 239 may further determine that such signals correspond to interactions editing text presented at display 219. In still other examples, I/O interface 215 may detect input that operating system 220 and/or user interface engine 239 determines corresponds to signals generated by a voice prompt user interface or other type of user interface or user interface device used by a user to modify sequence of words 341.
Computing system 210 may translate sequence of words 341 into translated sequence of words 342. For instance, again in the example and with reference
Computing system 210 may align the timing of the words in translated sequence of words 342 with one or more words in sequence of words 341. For instance, in the example being described with reference to
In some examples, including in translations where sequence of words 341 has been changed significantly, audio processing engine 231 might not attempt to create a direct word to word correspondence. In other words, when translated sequence of words 342 is substantially different than sequence of words 341, audio processing engine 231 might take a different approach to aligning translated sequence of words 342 to sequence of words 341. In one such example, audio processing engine 231 might simply determine the time period over which sequence of words 341 occurs, and also determine the pacing and/or spacing of words associated with sequence of words 341. Audio processing engine 231 may use such information to appropriately pace and/or space words in translated sequence of words 342 accordingly, and use information about the appropriate pace and/or spacing to generate captions for a video corresponding to
Computing system 210 may generate a sequence of images captioned with words from translated sequence of words 342. For instance, again referring to the example being described with reference to
Computing system 210 may present video sequence 353 in response to user input. For instance, in some examples, video sequence 353 may be stored at computing system 210 (e.g., in memory device 214). Video sequence 353 may be made available (e.g., by computing system 210) for playback (e.g., as a file for further editing, as a social media post, as an on-demand video, or otherwise). In such an example, 239 of computing system 210 may detect input that application engine 230 determines corresponds to a request to present video sequence 353. Application engine 230 accesses video sequence 353 within memory device 214 and causes video sequence 353 to be presented at display 219, including the sequence of captions 351 as illustrated in
In
In the example of
In another example, computing system 210 may calculate, based on the transcribed speech from the original audio signal, an average or typical time that the speaker takes to say a word, and an average or typical amount of time that the speaker pauses between words. Computing system 210 may use this information to determine pacing and flow, and may present sequence of words 442 as a sequence of captions in images 452 using the determined pacing and flow. In such an example, the captions may span more or less images 452 than the original text, but the pacing and/or flow may be more aligned with the spoken words in the original audio.
In the process illustrated in
Computing system 210 may transcribe the audio data into text (502). For example, again with reference to
Computing system 210 may associate a timestamp with each of the original words in the text (503). For example, audio processing engine 231 may determine a time associated with the start and end of each of the transcribed words in the text. Audio processing engine 231 may also determine information about the timing and/or duration of any pauses occurring between the words in the text. Audio processing engine 231 may store timestamps and other timing information associated with whether the audio of the speech includes a pause between any or all of the words spoken during the speech. In some examples, application engine 230 of computing system 210 may store timestamps and information about the timing and/or duration of such pauses in memory device 214.
Computing system 210 may process input to modify original words in the text (504). For example, still referring to
Computing system 210 may determine that further modifications are being made to the text (YES path from 505). For example, I/O interface 215 may detect further signals that operating system 220 determines corresponds to further interactions editing the original words in the text. User interface engine 239 further modifies the original text in response to the further input. Eventually, computing system 210 determines that further modifications are no longer being made to the text (NO path from 505).
Computing system 210 may generate a new sequence of images with new words as captions (506). For example, and referring again to
For processes, apparatuses, and other examples or illustrations described herein, including in any flowcharts or flow diagrams, certain operations, acts, steps, or events included in any of the techniques described herein can be performed in a different sequence, may be added, merged, or left out altogether (e.g., not all described acts or events are necessary for the practice of the techniques). Moreover, in certain examples, operations, acts, steps, or events may be performed concurrently, e.g., through multi-threaded processing, interrupt processing, or multiple processors, rather than sequentially. Further certain operations, acts, steps, or events may be performed automatically even if not specifically identified as being performed automatically. Also, certain operations, acts, steps, or events described as being performed automatically may be alternatively not performed automatically, but rather, such operations, acts, steps, or events may be, in some examples, performed in response to input or another event.
For ease of illustration, only a limited number of devices (e.g., computing systems 110, 210, compute nodes 120, as well as others) are shown within the Figures and/or in other illustrations referenced herein. However, techniques in accordance with one or more aspects of the present disclosure may be performed with many more of such systems, components, devices, modules, and/or other items, and collective references to such systems, components, devices, modules, and/or other items may represent any number of such systems, components, devices, modules, and/or other items.
The Figures included herein each illustrate at least one example implementation of an aspect of this disclosure. The scope of this disclosure is not, however, limited to such implementations. Accordingly, other example or alternative implementations of systems, methods or techniques described herein, beyond those illustrated in the Figures, may be appropriate in other instances. Such implementations may include a subset of the devices and/or components included in the Figures and/or may include additional devices and/or components not shown in the Figures.
The detailed description set forth above is intended as a description of various configurations and is not intended to represent the only configurations in which the concepts described herein may be practiced. The detailed description includes specific details for the purpose of providing a sufficient understanding of the various concepts. However, these concepts may be practiced without these specific details. In some instances, well-known structures and components are shown in block diagram form in the referenced figures in order to avoid obscuring such concepts.
Accordingly, although one or more implementations of various systems, devices, and/or components may be described with reference to specific Figures, such systems, devices, and/or components may be implemented in a number of different ways. For instance, one or more devices illustrated in the Figures herein (e.g.,
Further, certain operations, techniques, features, and/or functions may be described herein as being performed by specific components, devices, and/or modules. In other examples, such operations, techniques, features, and/or functions may be performed by different components, devices, or modules. Accordingly, some operations, techniques, features, and/or functions that may be described herein as being attributed to one or more components, devices, or modules may, in other examples, be attributed to other components, devices, and/or modules, even if not specifically described herein in such a manner.
Although specific advantages have been identified in connection with descriptions of some examples, various other examples may include some, none, or all of the enumerated advantages. Other advantages, technical or otherwise, may become apparent to one of ordinary skill in the art from the present disclosure. Further, although specific examples have been disclosed herein, aspects of this disclosure may be implemented using any number of techniques, whether currently known or not, and accordingly, the present disclosure is not limited to the examples specifically described and/or illustrated in this disclosure.
The techniques described in this disclosure may be implemented, at least in part, in hardware, software, firmware or any combination thereof. For example, various aspects of the described techniques may be implemented within one or more processors, including one or more microprocessors, DSPs, application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), or any other equivalent integrated or discrete logic circuitry, as well as any combinations of such components. The term “processor” or “processing circuitry” may generally refer to any of the foregoing logic circuitry, alone or in combination with other logic circuitry, or any other equivalent circuitry. A control unit comprising hardware may also perform one or more of the techniques of this disclosure.
Such hardware, software, and firmware may be implemented within the same device or within separate devices to support the various operations and functions described in this disclosure. In addition, any of the described units, modules or components may be implemented together or separately as discrete but interoperable logic devices. Depiction of different features as modules or units is intended to highlight different functional aspects and does not necessarily imply that such modules or units must be realized by separate hardware or software components. Rather, functionality associated with one or more modules or units may be performed by separate hardware or software components or integrated within common or separate hardware or software components.
The techniques described in this disclosure may also be embodied or encoded in a computer-readable medium, such as a computer-readable storage medium, containing instructions. Instructions embedded or encoded in a computer-readable storage medium may cause a programmable processor, or other processor, to perform the method, e.g., when the instructions are executed. Computer readable storage media may include random access memory (RAM), read only memory (ROM), programmable read only memory (PROM), erasable programmable read only memory (EPROM), electronically erasable programmable read only memory (EEPROM), flash memory, a hard disk, a CD-ROM, a floppy disk, a cassette, magnetic media, optical media, or other computer readable media.
As described by way of various examples herein, the techniques of the disclosure may include or be implemented in conjunction with an artificial reality system. As described, artificial reality is a form of reality that has been adjusted in some manner before presentation to a user, which may include, e.g., a virtual reality (VR), an augmented reality (AR), a mixed reality (MR), a hybrid reality, or some combination and/or derivatives thereof. Artificial reality content may include completely generated content or generated content combined with captured content (e.g., real-world photographs). The artificial reality content may include video, audio, haptic feedback, or some combination thereof, and any of which may be presented in a single channel or in multiple channels (such as stereo video that produces a three-dimensional effect to the viewer). Additionally, in some examples, artificial reality may be associated with applications, products, accessories, services, or some combination thereof, that are, e.g., used to create content in an artificial reality and/or used in (e.g., perform activities in) an artificial reality. The artificial reality system that provides the artificial reality content may be implemented on various platforms, including a head-mounted display (HMD) connected to a host computer system, a standalone HMD, a mobile device or computing system, or any other hardware platform capable of providing artificial reality content to one or more viewers.
This application is a continuation application of and claims priority to U.S. patent application Ser. No. 16/601,102 filed on Oct. 14, 2019, which is hereby incorporated by reference herein in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
10917607 | Cheung et al. | Feb 2021 | B1 |
20070011012 | Yurick | Jan 2007 | A1 |
20150106091 | Wetjen et al. | Apr 2015 | A1 |
20190096390 | Kurata et al. | Mar 2019 | A1 |
20190200072 | Camargo | Jun 2019 | A1 |
20200053215 | Kats et al. | Feb 2020 | A1 |
20200335135 | Li | Oct 2020 | A1 |
Entry |
---|
Prosecution History from U.S. Appl. No. 16/601,102, dated Jun. 29, 2020 through Oct. 16, 2020, 19 pp. |
Number | Date | Country | |
---|---|---|---|
Parent | 16601102 | Oct 2019 | US |
Child | 17170314 | US |