This invention relates to obtaining data related to audio-video content in situations in which such data may already be available, but not linked to the audio-video content.
Audio-video content such as television programmes and the like may be distributed to consumers via a variety of media. Traditional linear television programmes are broadcast over a broadcast network, whether terrestrial, cable or satellite and may be consumed by users by a variety of receiver devices. Such audio-video content includes the audio and video components and may also include additional content such as subtitles, audio description and other additional data. Such audio-video content may also be distributed via additional routes, in particular via on demand services such as on the internet. In the process of repurposing audio-video content for other distribution channels, though, additional content such as subtitles, audio description or other data or metadata is not routinely copied with the result that such additional content is not available on the version distributed by the additional distribution route. Similar issues can occur when audio-video content is re-versioned for broadcast.
We have appreciated the need to allow for data related to audio-video content, in particular supplementary data such as subtitles, should be easily retrieved for use with a complete or partial copy of that audio-video content.
The present invention provides a system and method for retrieving
supplementary data related to audio-video content in relation to a complete or partial copy of that audio-video content using a signature of the audio component of the audio-video content and matching such a signature to a reference signature to extract supplementary data for the audio-video content.
The embodiment of the invention preferably uses a feature extraction technique that produces a signature of the audio component. The signature may be variable in length so as to allow matching of an entire television programme, portion of a programme, or edited version, thereby allowing matching of an edited version of a programme against an original programme.
The signature (which may also be referred to as a fingerprint) may be any process or function which reduces the amount of information and complexity in the representation of the data which retains a low likelihood of correlation with a signature from an unrelated set of data whilst ensuring repeatable correlation when comparing similar signals. In the embodiment, the signature preferably retains a temporal resolution of around a second so that the location of the matching audio can be located to a similar resolution.
The search strategy may be to match using audio fingerprints alone, but preferably additional data is used to direct the search, in particular a date time field that specifies a date related to the audio video content such as the date it was first broadcast, which can then be used to direct the order of the search against the database containing the audio features and the supplementary data. Such an approach is particularly applicable to topical audio-video content presented on websites shortly before or after a date of first broadcast over a traditional channel. Additional data is also preferably used to direct the search, such as a category, channel or keywords.
The invention will be described in more detail by way of example with reference to the accompanying drawings, in which:
The main embodiment that will be described is a method and system for obtaining subtitles that have already been created for one copy of audio-video content, that may be referred to as a reference copy, and providing them for another copy of that audio-video content. In the example given, the reference copy of the audio-video content may be an originally broadcast version sent over a communication channel such as terrestrial, satellite or cable. The second copy of the same audio-video content may be the whole or a portion of that content made available via the internet such as a website, dedicated online player or similar. The approach is operated by an embodiment to search for the same video clip in an archive or library of broadcast content for which subtitles have already been authored. These subtitles are then reformatted to a form suitable for that clip including error corrections and retiming.
This approach could also be used to retrieve any other data which is stored alongside the audio and video content, such as timing triggers for an interactive app or a list of shot changes or script information describing the scene or the characters/actors in the scene. Such data may generally be referred to as supplementary data.
The search is done using available data about the clip, and its audio and/or video content. The data in a web page may include the programme title and date of broadcast, or other text that suggests its approximate time and date of broadcast that can narrow the search down to a single day or programme. It may even contain the start time of the clip. The method enables the location of corresponding supplementary data to be found, even without supporting data, provided it exists in the archive/library.
The preferred method is to perform a comparison between the video clip audio signal and the contents of the archive/library. The audio signal is preferred as the search method because it relates directly to the main example of supplementary data, namely words in the subtitles, whereas the video may have been reused with a different soundtrack, which would contain different words to the clip being searched. In the case of data that relates to the video signal, such as audio description, the video signal would be preferred.
Refinements to this method can be applied to speed up the process and make the implementation practicable. These include methods such as using fingerprints generated from the audio and/or video content of the clip and a database of fingerprints for the archive/library. It may also be possible to use speech to text technology to generate semantic data from the clips to directly interrogate an archive of subtitles files to find a match.
The embodiment focuses matching the audio extracted from a web clip to audio stored in an archive. The example implementation does this by generating a fingerprint for the audio of all broadcast content and storing it in a database, alongside the matching subtitles, making it possible to rapidly search the database for the matching clips and provide the matching subtitles. However, this search could also use video fingerprints and any other metadata to identify when and where the clip was (or will be) broadcast. The preferred embodiment also includes various search heuristics to identify the fingerprint match and novel approaches to tidying up and retiming the subtitles to improve their quality, particularly if the subtitles were generated live.
An overview of the main components of a system embodying the invention will first be described, followed by an explanation of the process for creating a fingerprint database and then a description of the process for searching the database.
A server is used to maintain and update a database of fingerprints. The system preferably maintains such a database for all content broadcast on traditional linear broadcast for at least the main channels. As new content is broadcast on the conventional channels the matching audio is cached and a fingerprint is generated providing a small representation of the audio file (˜2% of the size of the equivalent MP3 file). This fingerprint is then written to the database along with the matching subtitles that were broadcast live. Any other metadata (such as channel, programme name etc.) are also stored in the database.
The Server provides a search mechanism, which can be invoked by providing a URL from a website with web clips identified by a unique URL address. Once the search starts it visits the URL, downloads the video clip from the page, extracts the audio and generates a fingerprint representation for the clip. It also extracts meta-data from the web page (such as the creation time and date of the page, if it is news etc.) This meta-data is used to create a heuristic search through the database, by weighting each of the database entries based on time and date, channel and keywords found. The search algorithm then does a brute force search through the database, doing a cross correlation between the clip fingerprint and each entry in the database. Two thresholds are specified in the parameters, if the correlation error is below the first threshold it is assumed to be a highly likely match and returns an immediate result. If this is not found but the lowest error is below the second threshold once the search has completed, it is believed to be a very likely match and that result returned. A match will not found if the clip has been edited together from a number of separate excerpts as the fingerprint will not be continuous within the reference database.
In order to allow for edited clips to be found, the algorithm firstly searches for the beginning of the web clip. Once it finds a match, it then grows the fingerprint length until it no longer matches signifying that an edit has occurred. The search then starts again from the point of the known edit to identify the subtitles for several sub clips. The retrieved subtitles may then be cleaned up such as by removing the repeated words present in live subtitle streams. Natural language processing may be used to identify the most likely start and end of the required subtitles and to correct incorrect words where the subtitler provided a correction. Also colours are corrected and subtitle files are generated. Finally a web-based editor provides full editorial control, to correct any errors that occurred in the live subtitles. It also includes other tools, such as a one-click manual retiming tool, the ability to import text from other sources, such as transcripts and original scripts (in case no matches were found). It also includes the functionality to pass the edited subtitles and audio through our phonetic retiming service (running on the same server) to provide an autonomous retiming tool. The web editor also provides warnings to indicate whether the current subtitles meet guidelines, such as showing the reading rate in words per minute for each subtitle.
The signature (which may also be referred to as a fingerprint) may be any process or function which reduces the amount of information and complexity in the representation of the data which retains a low likelihood of correlation with a signature from an unrelated set of data whilst ensuring repeatable correlation when comparing similar signals. In the embodiment, the signature preferably retains a temporal resolution of around a second so that the location of the matching audio can be located to a similar resolution. A variety of different fingerprint techniques will now be discussed.
The embodiment of the invention is arranged to create a fingerprint of audio-video content whenever an update to a database of audio-video content is detected. In this way, the system may be used to automatically create a fingerprint database from continually delivered audio-video content. In particular, the system can monitor content broadcast on a range of channels and automatically record the associated fingerprints and supplementary data such as subtitles to database. At step 10 of
The downloaded audio file may then be removed at step 24 and a file containing the subtitle data (in a format such as SRT or TTML) downloaded at step 26. At step 28, if all programmes have been completed, the fingerprint and subtitle data is stored in a database at step 30 along with any other data available such as a programme name, date of broadcast and so on. If at step 28 it is determined that there are further programmes to be fingerprinted, the process reverts to step 20 and the next audio file downloaded.
Using the above approach, the database 30 may be created which contains a fingerprint for each portion of audio-video content (programme), the accompanying subtitles including timestamps for the presentation of the subtitles, a channel on which the programme is broadcast, a programme name, programme ID and other data that may be useful in the directed search that will be described later.
A fingerprinting technique (such as ‘FPCALC’) is used in order to create a repeatable set of features that represent the audio in a compressed format. Using a fingerprinting technique reduces the size of the representation, which reduces the time and processing required to complete the search and the storage for the data. For the purpose of locating the subtitle content a granularity of 1 to 2 seconds is appropriate. The fingerprinting technique used may be any of a variety of commercially available algorithms for reducing an audio track to features.
The feature extraction preferably provides a time varying output that varies with the time period of the audio-video content thereby allowing alignment between the reference database and a sample audio-video content so as to find matching parts of the content and also the matching timing for subtitles.
A Database interrogation service is implemented and used to interrogate the database in order to identify where and when a video clip was broadcast by rapidly matching its audio track to the audio stored in the database, using the stored fingerprints. The Service is invoked by providing a URL (for example: http://www.bbc.co.uk/news/uk-england-28685666)—All such web clips can be identified by their unique URL address.
The service then retrieves the video that is hosted on the URL along with its relevant metadata, performs a search against the database and if it is found the matching subtitles is returned as a subtitle file (such as an STL or TTML file) to match the video clip. Caching is also used to improve performance at each stage of the process.
The search method shown in
The database is then ordered by an appropriate heuristic depending upon the information available from the URL for searching. For example, if the URL provides keywords related to the audio-video content, the database may be ordered by keyword. If the URL contains a date that the audio-video content was first broadcast, the database may be ordered by date and/or the channel on which the content was broadcast. A combination of such heuristics may be used to increase the likelihood that the fingerprint would be found quickly. At step 50, each fingerprint is loaded in turn and cross-correlated at step 52 with the fingerprint generated from the content at step 44. If this generates a score below a threshold, the relative offset between the database fingerprint and the fingerprint of the content being analysed may be generated. If the score is above a certain threshold, the next fingerprint is selected from the database and cross-correlation performed again so that the fingerprints are retrieved in turn in the order defined by the heuristic until a match is found. When the match is found and the time within the fingerprint calculated, this is compared to a broadcast time at step 56 and the subtitles retrieved and realigned using the offset time at step 58 and returned to the metadata file at step 62 for returning to the user that requested the subtitles.
The directed search progress will now be described in more detail with reference to some examples.
The video clip is first demuxed (using a tool such as ‘ffmpeg’) to extract its audio track. A fingerprint matching the audio file is then created and stored. A pass is made through the database cross correlating the clip fingerprint to each entry in the database. A confidence value is calculated for every comparison and compared to two threshold values. If the confidence value is less than the first value we are certain to have received a match and the search is stopped. If the confidence value is higher than the first threshold but less than the second value we continue searching to the end in case we find a better match. If we find a match we can extract the matching subtitles from the database and return them to the user.
A database search can be run in one of several modes as determined by the user when starting the service. Examples of operation include:
The search algorithm then does a brute force search through the database, doing a cross correlation between the clip fingerprint and each entry in the database. Our cross Correlation algorithm performs a sliding dot product comparison between the sub fingerprint f(t) within each database entry fingerprint as g(t).
f*g=f*(−t)*g.
To enable the database to be searched in near real-time, a heuristics model is used to search the most likely areas of the database first. The meta-data stored when the file is downloaded is used to create a search through the database, by weighting each of the database entries based on the time the clip was uploaded, most likely channels and keywords on the page.
As the subtitles stored in the database come from the live broadcast it may be necessary to reconstruct them from the live format into a pseudo-prepared block format. A request is made to the database for all of the subtitles stored between the start and end times of the programme. An offset is provided to ensure we retrieve subtitles from earlier and later than the clip to therefore getting too many subtitled returned at both the beginning and end and we can intelligently trim the subtitles later to ensure we have not missed the start and end point.
The subtitles are initially converted to a single text string, where timing is maintained on a per word basis. Each word is extracted from the live subtitle file and stored in a subtitle object alongside its broadcast timing. Due to the nature of live subtitles, once each line is completed, it is resent as a block. The system simply ignores anything that is not a single word.
Once a text string made up from the live subtitles is retrieved, the system will try and find the most likely start and end point for the text, this generally relies upon looking to start with the beginning of a sentence and ending on a punctuation mark. Starting at the point that the Database interrogation identified as the start time we search forwards and backwards until we find the beginning of a sentence, denoted by both a capital letter and the previous sentence ending with a punctuation mark. Similarly at the end of the subtitle file we look for the closest point at which a sentence ends and then we drop the words that fall outside of this. A parameter in the reconstruction process allows the user to specify that they want to look further forwards or backwards that forces the service to include more or less sentences.
The text string also maintains the original colours that were broadcast live. As it is likely the video has been clipped from the middle of an article the subtitles are likely to start on the wrong colour. We therefore re index the colour back into the standard order of white, yellow, green then blue by creating a lookup table.
The subtitles are reconstructed into blocks, by adhering to a maximum character per line rule. By default this is set at 38 characters as recommended in some industry guidelines, although can be overwritten by the user. Each word is appended to a subtitle block until the next word is too long and forces the letter count over the limit. At which point a break is inserted and we start to fill the next line. Additional rules also avoid leaving orphaned words, by forcing a newline if we receive a punctuation mark over half way through the subtitle line and breaks are always forced when there is a change of speaker. Each line is then assembled into a two linen pair to complete the block.
Finally the subtitles are retimed. As the subtitles have been clipped from a subtitle file representing the entire programme each of the timings relate to the position within the programme. We subtract the found start time from each subtitle in order to realign it with the clip. We also make an assumption that short clips are always cut relatively tightly to the dialog. Therefore we have found reasonable results with scaling our subtitles to fit the video clip. This is done by pulling the first subtitle to the beginning of the clip and pushing the last subtitle so that it ends at the end of the clip. Each subtitle in between is then scaled to fit, maintaining the time ratio from broadcast.
The resulting subtitles and additional metadata are then written to the database cache in a format (such as JSON or XML). This enables the export of the subtitles to be provided in any standard format (such as TTML or STL).
The approach we have described may operate for complete audio-video programmes or portions of audio-video programmes. To do this, instead of comparing the entire signature of a programme against the entire signature of a corresponding programme in the database, portions of signatures may be compared as described below.
Instead of searching for the entire fingerprint; the system only searches to find the first 5 seconds. If a match is found the fingerprint is grown in second increments and compared to the database until the fingerprint no longer matches. This is then interpreted as the boundary to the point that the video was edited as shown in
correct match
Number | Date | Country | Kind |
---|---|---|---|
1418261.2 | Oct 2014 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/GB2015/053026 | 10/14/2015 | WO | 00 |