A transcript is a written rendering of dictated or recorded speech. Transcripts are used in many applications, including audio and video editing, closed-captioning, and subtitling. For these types of applications, transcripts include time codes that synchronize the written words with spoken words in the recordings.
Although the transcription quality of automated transcription systems is improving, they still cannot match the accuracy or capture fine distinctions in meaning as well as professional transcribers. In addition to transcribing speech into written words, professional transcribers oftentimes also enter time codes into the transcripts in order to synchronize the transcribed words with recorded speech. Although superior to automated transcription, manual transcription and time code entry is labor-intensive, time-consuming, and subject to error. Errors also can arise as a result of the format used to send transcripts to transcribers. For example, an audio recording may be divided into short segments and sent to a plurality of transcribers in parallel. Although this process can significantly reduce transcription times, it also can introduce transcription errors. In particular, when an audio segment is started indiscriminately in the middle of a word or phrase, or when there otherwise is insufficient audio context for the transcriber to infer the first one or more words, a transcriber will not be able to produce an accurate transcription.
A time-coded written transcript that is synchronized with an audio or video recording enables an editor or filmmaker to rapidly search the contents of the source audio or video based on the corresponding written transcript. Although text-based searching can allow a user to rapidly navigate to a particular spoken word or phrase in a transcript, such searching can be quite burdensome when many hours of source media content from different sources must be examined. In addition, to be most effective, there should be a one-to-one correlation between the search terms and the source media content; in most cases, however, the correlation is one-to-many resulting in numerous matches that need to be individually scrutinized.
Thus, there is a need in the art to reduce or eliminate transcription and time-coding errors in transcripts, and there is a need for a more efficient approach for finding the most salient parts of source media content transcripts.
This specification describes systems implemented by one or more computers executing one or more computer programs that can achieve highly accurate timing alignment between spoken words in an audio recording and the written words in the associated transcript, which is essential for a variety of transcription applications, including audio and video editing applications, audio and video search applications, and captioning and subtitling applications, to name a few.
Embodiments of the subject matter described herein can be used to overcome the above-mentioned limitations in the prior classification approaches and thereby achieve the following advantages. For example, the disclosed systems and methods can substantially reduce the burden of identifying the best media content, discovering themes, and making connections between seemingly disparate source media. Embodiments of the subject matter described herein include methods, systems, apparatus, and tangible non-transitory carrier media encoded with one or more computer programs for providing the search and categorization tools needed to rapidly parse source media recordings using highlights, make connections (thematic or otherwise) between highlights, and combine highlights into a coherent and focused multimedia file.
Other features, aspects, objects, and advantages of the subject matter described in this specification will become apparent from the description, the drawings, and the claims.
In the following description, like reference numbers are used to identify like elements. Furthermore, the drawings are intended to illustrate major features of exemplary embodiments in a diagrammatic manner. The drawings are not intended to depict every feature of actual embodiments nor relative dimensions of the depicted elements, and are not drawn to scale.
A “computer” is any machine, device, or apparatus that processes data according to computer-readable instructions that are stored on a computer-readable medium either temporarily or permanently.
A “computer operating system” is a software component of a computer system that manages and coordinates the performance of tasks and the sharing of computing and hardware resources.
A “software application” (also referred to as software, an application, computer software, a computer application, a program, and a computer program) is a set of instructions that a computer can interpret and execute to perform one or more specific tasks.
A “data file” is a block of information that durably stores data for use by a software application.
The term “media” refers to a single form of information content, for example, audio or visual content. “Multimedia” refers to multiple forms of information content, such as, audio and visual content. Media and multimedia typically are stored in an encoded filed format (e.g., MP4).
A media “track” refers to one of multiple forms of information content in multimedia (e.g., an audio track or a video track). A media “sub-track” refers to a portion of a media track.
An “audio part” is a discrete interval that is segmented or copied from an audio track.
A “language element” is a discernably distinct unit of speech. Examples of language elements are words and phonemes.
“Time coding” refers to associating time codes with the words in a transcript of recorded speech (e.g., audio or video). Time coding may include associating time codes, offset times, or time scales relative to a master set of time codes.
“Forced alignment” is the process of determining, for each word in a transcript of an audio track containing speech, the time interval (e.g., start and end time codes) in the audio track that corresponds to the spoken word and its constituent phonemes.
A “tag” is an object that represents a portion of a media source. Each tag includes a unique category descriptor, one or more highlights, and one or more comments.
A “highlight” is an object that includes text copied from a transcript of a media source and the start and end time codes of the highlight text in the transcript.
A “comment” is an object that includes a text string that is associated with a portion of a transcript and typically conveys a thought, opinion, or reaction to the associated portion of the transcript.
Alignment
For a variety of transcription applications, it is essential to have accurate timing alignment between the spoken words in an audio recording and the written words in the associated transcript. For example: (1) in audio and video editing applications, accurate timing alignment between the spoken words in an audio recording and the written words in the associated transcript is required for the edits to the transcript text to be accurately applied to the corresponding locations in audio and video files; (2) in audio and video search applications, accurate timing alignment between the spoken words in an audio recording and the written words in the associated transcript is required for word searches on the transcript to accurately locate corresponding spoken words in the audio and video files; and (3) in captioning and subtitling applications, accurate timing alignment between the spoken words in an audio recording and the written words in the associated transcript is required for accurate timing alignment to avoid a disconcerting lag between the times when the words spoken in the audio and video files and the appearance of the corresponding words in the transcript. As explained in detail below, transcription accuracy can be improved by ensuring that transcribers are given sufficient preliminary audio context to comprehend the speech to be transcribed, and timing alignment can be improved by synchronizing multiple media source sub-tracks to a master transcript and by correcting timing errors (e.g., drift) that may result from using low quality or faulty recording equipment.
Exemplary Use Case
Source media recordings (also referred to herein as source media files, clips, or tracks) are obtained (
A master audio track is created from one or more audio recordings that are obtained from the source media files (
A transcription of the master audio track is procured (
Each audio segment transcript is force-aligned to the master audio track to produce a master transcript of language elements (e.g., words of phonemes) that are associated with respective time coding (
As explained above, “forced alignment” is the process of determining, for each language element (e.g., word or phoneme) of a transcript, the time interval (e.g., start and end times) in the corresponding audio recording containing the spoken text of the language element. In some examples, the force aligner component of the Gentle open source project (see https://lowerquality.com/gentle) is used to generate the time coding data for aligning the master transcript to the master audio track. In some embodiments, the Gentle force aligner includes a forward-pass speech recognition stage that uses 10 ms “frames” for phoneme prediction (in a hidden Markov model), and this timing data can be extracted along with the transcription results. The speech recognition stage has an explicit acoustic and language-modeling pipeline that allows for extracting accurate intermediate data-structures, such as frame-level timing. In operation, the speech recognition stage generates language elements recognized in the master audio track and times at which the recognized language elements occur in the master audio track. The recognized language elements in the master audio track are compared with the language elements in the individual transcripts to identify times at which one or more language elements in the transcripts occur in the master audio track. The identified times then are used to align a portion of the transcript with a corresponding portion of the master audio track. Other types of force aligners may be used to align the master transcript to the master audio track.
After the time-coded master transcript is produced (
In some examples, video tracks are captured at the same time by different cameras that are focused on different respective speakers. In some of these examples, the force aligner splices in video content that is selected based on speaker labels associated with the master transcript. For example, the force aligner splices in a first video sub-track during times when a first speaker is speaking and slices in a second video sub-track during times when a second speaker is speaking, where the times when the different speakers are speaking are determined from the speaker labels and the time coding in the master track.
Thus, the composite video 30 shown in
Source media 52 may be uploaded to a server 54 from a variety of sources, including volatile and non-volatile memory, flash drives, computers, recording devices 56, such as video cameras, audio recorders, and mobile phones, and online resources 58, such as Google Drive, Dropbox, and YouTube. Typically, a client uploads a set of media files 51 to the server 54, which copies the audio tracks of the uploaded media files and arranges them into a sequence specified by the user to create a master audio track 60. The master audio track 60 is processed by an audio segmentation system 62, which divides the master audio track 60 into a sequence of audio segments 64, which typically have a uniform length (e.g., 1 minute to five minutes) and may or may not have padding (e.g., beginning and/or ending portions that overlap with adjacent audio segments). In the illustrated example, the audio segments 64 are stored in an audio segment database 66, where they are associated with identifying information, including, for example, a client name, a project name, and a date and time.
In some examples, the audio segments 66 are transcribed by a transcription service 67. A plurality of professional transcribers 68 typically work in parallel to transcribe the audio segments 60 that are divided from a single transcript. In other examples, one or more automated machine learning based audio transcription systems 70 are used to transcribe the audio segments 66 (see
After the transcripts 72 of the audio segments 66 have been transferred to the server 54, they may be stored in a transcript database 73. The transcripts are individually force-aligned to the master audio track 60 by a force aligner 74 to produce a master transcript 76 of time-coded language elements (e.g., words or phonemes), which may be stored in a master transcript database 78 (see
In some examples, the master audio track is divided into audio segments without any overlapping padding. In these examples, the force aligner starts force-aligning each master transcript segment to the master audio track at the beginning of each transcript segment.
As explained above, however, the lack of audio padding at the start and end portions of each audio segment can prevent transcribers from comprehending the speech to be transcribed, increasing the likelihood of transcription errors. Transcription accuracy can be improved by ensuring that transcribers are given sufficient audio context to comprehend the speech to be transcribed. In some examples, the master audio track is divided into audio segments with respective audio content that overlaps the audio content in the preceding and/or successive audio segments in the master audio track. In these examples, to avoid duplicating words, the force-aligner automatically starts force-aligning each master transcript segment to the master audio track at a time point in the master audio track that is offset from the beginning and/or end of the master transcript segment by an amount corresponding to the known length of the padding.
Correcting Audio Drift
Some microphones exhibit imperfect timing, which results in the loss of synchronization between recordings over time. Referring to
In this example, the audio captured by the primary microphone (MIC 1) is the master audio track, which is transcribed into a sequence of one or more transcripts that are force-aligned to the master audio track to produce the master transcript of language elements and associated timing data. The audio captured by the secondary microphone is force-aligned to the master transcript to produce a set of time offsets between the words in the master transcript and the spoken words in the secondary audio track. As shown in
The computed time offsets to master transcript can be used to splice in any media that is synchronized with the secondary audio track. In some examples, the linear best fit is used to determine the correct timing offsets for splicing in a video track synchronized with the secondary audio track. In some examples, the linear best fit is used to determine the correct time offsets in realtime, subject to a maximum allowable drift threshold. For example, when patching in a video track that exhibits drift relative to the master transcript, frames in the video track can be skipped, duplicated, or deleted, or the timing of the video frames can be adjusted to reduce drift to a level that maintains the drift within the allowable drift threshold. For example, linear interpolation can be added throughout an entire media chunk to realign the timing data to reduce the drift. In other examples, a video track that exhibits drift can be divided into a number of different parts each of which is force-aligned to the master transcript and spliced in separately with a respective offset that ties each part to the master audio track.
The approaches described above are robust and work under a variety of adverse conditions, as a result of the very high accuracy of the process of force-aligning media tracks to the master transcript. For example, even if the microphones are very different (e.g., a clip-on microphone that records only one person's voice and another microphone that records audio from everything in the room), there typically will be enough words that are received by the clip-on microphone that timing data can be obtained and used to correct the drift. An advantage of this approach is that it accommodates a large disparity in microphone drift and audio quality because anything that is legible for voice would work and there is no need to maintain tone or use all of the audio channels (e.g., the audio channels could be highly directional). In this way, the approach of force-aligning all channels to the timing data associated with the master transcript offers many advantages over using acoustic signals directly. Even if the transcript is imperfect (e.g., an automated machine transcript), it is likely to be good enough to force-align audio tracks to the master transcript. For this particular application a verbatim transcript is not required. The transcript only needs to have enough words and timing data for the force aligner to anchor into it.
As explained above, in some examples, a user optionally can specify for each audio recording a rank that will dictate precedence for the automated inclusion of one audio recording over other audio recordings that overlap a given interval in the master audio track. This feature may be useful in scenarios in which there are gaps in the primary audio recording captured by a dedicated high-quality microphone. In such cases, the gap can be filled-in with audio data selected based on the user designated ranking of the one or more microphones that recorded audio content overlapping the gap in coverage. In some examples, the ranking corresponds to designated quality levels of the recording devices used to capture the audio recordings.
In an example, the current media track is a master audio track and the replacement media track is a higher quality audio track that includes a section that overlaps the master audio track. In another example, the current media track is a video track of a first speaker and the replacement media track is a video track of a second speaker that is selected based on a speaker label associated with the master transcript.
Editing Application
As explained above, even with transcripts that contain accurate timing data (e.g., time codes or offsets to master transcript timing data) that are synchronized with audio and video recordings, finding the best media content to use for a project involving many sources and many hours of recordings can be difficult and time-consuming. The systems and methods described herein provide the search and categorization tools needed to rapidly parse source media recordings using highlights, make connections (thematic or otherwise) between highlights, and combine highlights into a coherent and focused multimedia file. The ease and precision of creating a highlight can be the basis for notes, comments, discussion, and content discovery. In addition, these systems and methods support collaborative editing within projects, where users who are concurrently on the system can immediately see and respond to changes made and suggested by other users. In this way, these systems and methods can substantially reduce the burden of identifying the best media content, discovering themes, and making connections between seemingly disparate source media.
The media editing web-application 142 provides media editing services to remote users in the context of a network communications and computing infrastructure environment 144. Clients may access the web-application from a variety of different client devices 146, including desktop computers, laptop computers, tablet computers, mobile phones and other mobile clients. Users access the media editing service 140 by logging into the web site. In one example, the landing page 148 displays a set of projects 150 that are associated with the user. As explained in detail below, each project 150 may be edited and refined in a recursive process flow between different sections of the web-application that facilitate notes, comments, discussion, and content discovery, and enables users to quickly identify the most salient media content, discover themes, and make connections between highlights. In the illustrated embodiment, the main sections of the web-application are a source media page 152, a highlights page 154, and a composition page 156.
The user opens a project 150 by selecting a project from a set of projects that is associated with the user. This takes the user to the source media page 152 shown in
The source media page 152 includes an upload region 158 that enables a user to upload source media into the current project, either by dragging and dropping a graphical representation of the source media into the upload box 158 or by selecting the “Browse” link 160, which brings up an interface for specifying the source media to upload into the project. Any user in a project may upload source media for the project. Each source media may include multiple audio and video files. As explained above in connection with
After the user has uploaded source media to the project, the service server 54 may process the uploaded source media, as described above in connection with
Referring back to
In some examples, the media editing application 142 is configured to automatically populate the fields of each source media panel with metadata that is extracted from the corresponding source media file 51. In other examples, the user may manually enter the meta data into the fields of each source media panel 162, 163, 165, 167.
Each source media panel 162, 163, 165, 167 also includes a respective graphical interface element 176 that brings up an edit window 178 that provides access to an edit tool 180 that allows a user to edit the caption of the source media panel 162 and a remove tool 182 that allows a user to delete the corresponding source media from the project.
The image 164 of the first frame of the corresponding source media file is associated with a link that takes the user to a source media highlighting interface 220 that enables the user to create one or more highlights of the corresponding source media as described below in connection with
The source media page 152 also includes a search interface box 184 for inputting search terms to a search engine of the media editing web-application 142 that can find results in the text-based elements (e.g., words) in a project, including, for example, one or more of the transcripts, source media metadata, highlights, and comments. In some embodiments, the search engine operates in two modes: a basic word search mode, and an extended search mode.
The basic word search mode returns exact word or phrase matches between the input search words and the words associated with the current project. In some examples, the search words that are associated with the current project are the set of words in a corpus that includes the words in the intersection between the vocabulary of words in a dictionary and the words in the current project.
After performing the basic word search, the user has the option to extend the search to semantically related words in the dictionary. Therefore, in addition to finding exact-word matches, the search engine is able to find semantically-related results, using a word embedding model. In an embodiment of this process, only the vectors of words contained within the project are considered when computing a distance from a search term. In some examples, the search engine identifies terms that are similar to the input search terms using a word embedding model that maps search terms to word vectors in a word vector space. In some examples, the cosine similarity is used as the measure of similarity between two word vectors. The extended search results then are joined with the exact word match results, if any. In some use cases, this approach allows the user to isolate all conversational segments relating to a theme of interest, and navigate exactly to the relevant part of the video based on the precise timing alignment between the video and the words in the master transcript.
In the example shown in
Referring to
The media source highlighting interface 220 includes a media player pane 222 for playing video content of the selected media source and a transcript pane 224 for displaying the corresponding synchronized transcript 225 of the selected media source. The media source highlighting interface 220 also includes a progress bar 226 that shows the currently displayed frame with a dark line 228, and indicates the locations of respective highlights 230 in the media source with shaded intervals 230 of the progress bar 226. Below the progress bar 226 is header bar 227 that shows the name 232 (SpeakerID) of the current speaker at the current location in the transcript, the current playback time 234 in the media source, and a “Download Transcript” Button 236 that enables the user to download a text document that contains the transcript 225 of the selected source media.
The media source highlighting interface 220 enables a user to create highlights of the selected source media. A highlight is an object that includes text copied from a transcript of a media source and the start and end time codes of the copied text. In some examples, the user creates a highlight by selecting text 238 in the transcript 225 displayed in the transcript pane 224 of the media source highlighting interface 220. The user may use any of a wide variety of input devices to select text in a transcript to create a highlight, including a computer mouse or track pad. In response to the user's selection of the text 238 shown in
Referring to
Referring to
In this way, the user can scroll through the media sources that are discovered in the search, playback individual ones of the source media files and their respective transcripts, and save tagged highlights of the source media in the search results without having to leave the current interface. At a high level, the fact that this textual search takes the user back to the primary source video is both valuable and unusual, due to the capacity of audio/video media to contain additional sentiment information that's not apparent in the transcript alone.
Referring back to
In an exemplary process flow, the user performs an iterative process that enables the user to quickly and efficiently isolate all conversational segments relating to a theme of interest, and navigate exactly to the relevant part of the video. In such a process, the user starts off by searching for a word or phrase. The user examines the returned clips for relevance. The user then extends or broadens the search for related and suggested search terms. The user tags multiple clips with relevant themes of categories of interest. The user then edits individual clips to highlight a particular theme. The user can browse the clips and start playing the video at the exact point when the theme of interest begins. Now the user is ready to compile a single video for export consisting of segments related to the theme of interest.
Referring to
Referring back to
Referring to
The highlights sidebar 280 includes all of the highlights in the project, grouped by category. The user can drag and drop individual highlights or all the highlights associated with a selected category into the editing interface 350. An individual highlight or an entire group of highlights can be inserted into any location before, after, or between the any of the highlights currently appearing in the editing interface 350.
The header bar 349 includes a title 380 for the current reel, an Add Title button 380 for editing the title 382 of each selected highlight in the current reel, a download button 384, and indications 386, 387 of the original length of the sequence of media files in the reel and the current length of the sequence of media files in the reel. In response to selection of the download button, all of the highlights are rendered into a single, continuous video, including corresponding edits, title pages, and optional burn-in captions as desired. Each highlight is represented in the editing interface 350 by a respective highlight panel 352. Each highlight panel 352 is an object that includes a respective image 354 of the first frame of the corresponding highlight, the name 356 (SpeakerID) of the speaker appearing in the highlight, indications 358 of the length and location of the highlight in the source media, a link 360 to the source media, the text of the highlight 362, a pair of buttons 364 for moving the associated highlight panel 352 forward or backward in the sequence of highlight panels, a closed captioning button 370 for turning on or off the appearance of closed captioning text in the playback pane 351, a toggle button 372 for expanding or collapsing cut edits in the transcript 362, and a delete button 374 for deleting the highlight and the associated highlight panel from the current reel.
As soon as one or more highlights are dragged and dropped into the editing interface 350, the web-application compiles the highlights into a concatenated sequence of media files. The highlights are played back according to the arrangement of highlights in the highlight panels 352. In one embodiment, the web-application concatenates the sequence of highlight panels 352 in the editing interface 350 from top to bottom. The sequence of media files can be played back by clicking the playback pane 351. Additionally, a Reel can be downloaded as rendered video by selecting the Download button 384. In this process, the web-application packages the concatenated sequence of media files into a single multimedia file.
If closed captioning is enabled, closed captioning text 390 will appear in the playback pane 351 synchronized with the words and phrases in the corresponding audio track. In particular, the web-application performs burn-in captioning using forced-alignment timing data so that each word shows up on-screen at the exact moment when it is spoken. In the editing interface 350, words and phrases in the text 362 of the highlight transcripts can be selected and struck out, resulting in a cut in the underlying audio and video multimedia. This allows the user to further refine the highlights to capture the precise themes of interest in the highlight. Indications of the struck out portions may or may not be displayed in the closed captioning or audio portions of the highlight. In the embodiment shown in
In some embodiments, the user can apply typographical emphasis to one or more words in a highlight transcript, and the web-application will interpret the typographical emphasis as an instruction to automatically apply a media effect that is synchronized with the playback of the composite multimedia file. In the example shown in
Referring to
In addition to the above-described web application, there is an example mobile-first version of the web application that supports many of the same features (e.g., search, strike-through editing, and burn-in downloads) from a mobile touchscreen enabled, processor operated mobile device. The text-based editing capabilities of the mobile device allow for extremely rapid and precise edits, even with the mobile form-factor.
A user may interact (e.g., input commands or data) with the computer apparatus 420 using one or more input devices 430 (e.g. one or more keyboards, computer mice, microphones, cameras, joysticks, physical motion sensors, and touch pads). Information may be presented through a graphical user interface (GUI) that is presented to the user on a display monitor 432, which is controlled by a display controller 434. The computer apparatus 320 also may include other input/output hardware (e.g., peripheral output devices, such as speakers and a printer). The computer apparatus 420 connects to other network nodes through a network adapter 336 (also referred to as a “network interface card” or NIC).
A number of program modules may be stored in the system memory 424, including application programming interfaces 438 (APIs), an operating system (OS) 440 (e.g., the Windows® operating system available from Microsoft Corporation of Redmond, Wash. U.S.A.), software applications 441 including one or more software applications programming the computer apparatus 420 to perform one or more of the steps, tasks, operations, or processes of the hierarchical classification systems described herein, drivers 442 (e.g., a GUI driver), network transport protocols 444, and data 446 (e.g., input data, output data, program data, a registry, and configuration settings).
Examples of the subject matter described herein, including the disclosed systems, methods, processes, functional operations, and logic flows, can be implemented in data processing apparatus (e.g., computer hardware and digital electronic circuitry) operable to perform functions by operating on input and generating output. Examples of the subject matter described herein also can be tangibly embodied in software or firmware, as one or more sets of computer instructions encoded on one or more tangible non-transitory carrier media (e.g., a machine readable storage device, substrate, or sequential access memory device) for execution by data processing apparatus.
The details of specific implementations described herein may be specific to particular embodiments of particular inventions and should not be construed as limitations on the scope of any claimed invention. For example, features that are described in connection with separate embodiments may also be incorporated into a single embodiment, and features that are described in connection with a single embodiment may also be implemented in multiple separate embodiments. In addition, the disclosure of steps, tasks, operations, or processes being performed in a particular order does not necessarily require that those steps, tasks, operations, or processes be performed in the particular order; instead, in some cases, one or more of the disclosed steps, tasks, operations, and processes may be performed in a different order or in accordance with a multi-tasking schedule or in parallel.
Outline of Related Subject Matter
The following is an outline of related subject matter.
1. A computer-implemented method of parsing and synthesizing spoken media sources to create multimedia for a project, comprising:
displaying one of the spoken media sources in a media player in a first pane of a first interface and a respective synchronized transcript of the spoken media source in a second pane of the first interface;
creating a highlight for the spoken media source, wherein the creating comprises associating the highlight with a text string excerpt from the respective synchronized transcript and a tag labeled with a respective category descriptor;
repeating the displaying and the creating for one or more of the spoken media sources, wherein each tag is associated with a unique category descriptor and one or more highlights;
displaying the highlights in a first pane of a second interface, wherein displaying the highlights comprises presenting at least portions of the respective text string excerpts of the highlights grouped according to their associated tags, wherein each group is labeled with the category descriptor for the associated tag;
associating selected ones of the highlights with a second pane of the second interface in a sequence, and automatically concatenating clips of the spoken media sources corresponding to and synchronized with the selected highlights according to the sequence; and
displaying the sequence of concatenated clips of the spoken media sources in a media player in a third pane of the second interface synchronized with displaying the text string excerpts in the second pane of the second interface.
2. The method of claim 1, wherein each highlight is displayed in a respective highlight panel in the first pane of the second interface.
3. The method of claim 2, wherein the highlight panels displayed in the first pane of the second interface are listed alphabetically by category descriptor.
4. The method of claim 2, wherein each highlight panel displayed in the first pane of the second interface comprises a respective tag category descriptor associated with a respective link to a third interface for displaying all highlights associated with the project.
5. The method of claim 1, further comprising displaying in a third pane of the first interface a set of one or more highlight panels each of which comprises: a respective text string excerpt derived from a transcript currently displayed in the second pane of the first interface.
6. The method of claim 5, wherein each highlight panel in the third pane of the first interface is linked to a respective text string excerpt in the transcript currently displayed in the second pane of the first interface.
7. The method of claim 6, wherein selection of the highlight presents a view of the respective text string excerpt in the transcript in the second pane.
8. The method of claim 6, wherein each highlight panel in the third pane of the first interface is linked to a third interface for displaying all highlights associated with the project.
9. The method of claim 1, wherein the associating comprises dragging a selected highlight from the first pane of the second interface and dropping the selected highlight into the second pane of the second interface.
10. The method of claim 9, wherein each highlight in the second pane in the second interface is displayed in a highlight panel comprising a respective link to the respective spoken media source and the respective text string excerpt.
11. The method of claim 10, wherein selection of the respective link displays the respective media source in the media player in the first pane of the first interface time-aligned with the respective text string excerpt in the respective synchronized transcript.
12. The method of claim 1, further comprising: generating subtitles comprising words from the text string excerpts synchronized with speech in the sequence of concatenated clips; and displaying the subtitles over the sequence of concatenated clips in the second pane of the second interface.
13. The method of claim 12, further comprising automatically replacing text deleted from one or more of the highlighted text strings with a deleted text marker, and displaying the deleted text marker in the subtitles displayed in the second pane of the second interface.
14. The method of claim 12, further comprising, responsive to the deletion of text from the one or more of the highlighted text strings, automatically deleting a segment of audio and video content in the sequence of concatenated clips that is force-aligned with the deleted text.
15. The method of claim 1, further comprising applying typographical emphasis to one or more words in the text string excerpts, and automatically applying a media effect synchronized with playback of the sequence of concatenated clips in the second pane of the second interface.
16. The method of claim 15, wherein the typographical emphasis comprises applying bold emphasis to the one or more words from the text string excerpts, automatically applying a volume increase effect synchronized with playback of the sequence of concatenated clips in the second pane of the second interface.
17. The method of claim 1, further comprising receiving a search term in a search box of the first interface, searching exact word matches to the received search term in a corpus comprising words from the transcripts of all spoken media sources associated with the project, and using a word embedding model to expand the search results to words from the transcripts that match search terms that are similar to the received search terms.
18. The method of claim 1, further comprising in a search pane of the first interface:
receiving a search term entered in a search box and, in response, matching the search term to exact word or phrase matches in a corpus comprising all words in a dictionary that intersect with words associated with the project;
presenting, in a results pane, one or more extracts from each of the transcripts that comprises exact word or phrase matches to the search term.
19. The method of claim 18, wherein each of the extracts is presented in the first interface in a respective panel that comprises a respective link to a start time in the respective media source.
20. The method of claim 18, further comprising identifying search terms that are similar to the received search terms using a word embedding model that maps search terms to word vectors in a word vector space and returns one or more similar search terms in the corpus that are within a specified distance from the received search term in the word vector space.
21. The method of claim 20, wherein the presenting comprises presenting the one or more similar search terms for selection, and in response to selection of one or more of the similar search terms presenting one or more respective extracts from one or more of the transcripts comprising one or more of the selected similar search terms.
22. The method of claim 18, further comprising:
switching from the first interface to a fourth interface;
responsive to the switching, automatically presenting in the fourth interface the search box and the results pane in the same state as they were in the first interface before switching.
23. The method of claim 22, wherein the fourth interface comprises an interface element for uploading spoken media sources for the project, and a set of panels each of which is associated with a respective uploaded spoken media source and a link to the first interface.
24. Apparatus comprising a memory storing processor-readable instructions, and a processor coupled to the memory, operable to execute the instructions, and based at least in part on the execution of the instructions operable to perform operations comprising:
displaying one of the spoken media sources in a media player in a first pane of a first interface and a respective synchronized transcript of the spoken media source in a second pane of the first interface;
creating a highlight for the spoken media source, wherein the creating comprises associating the highlight with a text string excerpt from the respective synchronized transcript and a tag labeled with a respective category descriptor;
repeating the displaying and the creating for one or more of the spoken media sources, wherein each tag is associated with a unique category descriptor and one or more highlights;
displaying the highlights in a first pane of a second interface, wherein displaying the highlights comprises presenting at least portions of the respective text string excerpts of the highlights grouped according to their associated tags, wherein each group is labeled with the category descriptor for the associated tag;
associating selected ones of the highlights with a second pane of the second interface in a sequence, and automatically concatenating clips of the spoken media sources corresponding to and synchronized with the selected highlights according to the sequence; and
displaying the sequence of concatenated clips of the spoken media sources in a media player in a third pane of the second interface synchronized with displaying the text string excerpts in the second pane of the second interface.
25. A computer-readable data storage apparatus comprising a memory component storing executable instructions that are operable to be executed by a computer, wherein the memory component comprises:
executable instructions to display one of the spoken media sources in a media player in a first pane of a first interface and a respective synchronized transcript of the spoken media source in a second pane of the first interface;
executable instructions to create a highlight for the spoken media source, wherein the creating comprises associating the highlight with a text string excerpt from the respective synchronized transcript and a tag labeled with a respective category descriptor;
executable instructions to repeat the displaying and the creating for one or more of the spoken media sources, wherein each tag is associated with a unique category descriptor and one or more highlights;
executable instructions to display the highlights in a first pane of a second interface, wherein displaying the highlights comprises presenting at least portions of the respective text string excerpts of the highlights grouped according to their associated tags, wherein each group is labeled with the category descriptor for the associated tag;
executable instructions to associate selected ones of the highlights with a second pane of the second interface in a sequence, and automatically concatenating clips of the spoken media sources corresponding to and synchronized with the selected highlights according to the sequence; and
executable instructions to display the sequence of concatenated clips of the spoken media sources in a media player in a third pane of the second interface synchronized with displaying the text string excerpts in the second pane of the second interface.