The present invention relates to a method for generating caption file, and more particularly to a method for generating caption file through URL of an AV platform.
The current method of audio-video (AV) platform for generating caption file is to listen to its audio directly in an artificial way, and then record it verbatim to form a caption file and play it with the video film.
This artificial method is not efficient and cannot form caption files in real time. For users of audio-video platforms, it cannot achieve the effect of real-time assistance.
Today AI (Artificial Intelligence) is commonly used. It is very convenient for users of the audio-video platform to apply AI methods (such as artificial neural networks) to the current audio-video platform to generate audio caption files.
The object of the present invention is to provide a method for generating caption file through URL of an AV platform, so as to form caption files effectively for audio-video files in real time. The method of the present invention is described below.
An automatic speech recognition (ASR) server according to the present invention first parses the URL descriptions given by the user and finds a relevant audio-video platform, then sends an HTTP request to the web application interface provided by the web server of the audio-video platform to obtain an HTTP reply of the web server.
Parse the content in the HTTP reply to obtain the URL of an AV (Audio-Video) file, and download the AV file.
Abstract an audio track in the AV file to obtain an audio sample, then send it to a speech recognition system for processing, and then generate a caption file.
The speech recognition system includes a pre-processing step for audio, a step for extracting speech feature parameters, a phoneme recognition step, and a sentence decoding step. Artificial neural networks are used in both the phoneme recognition step and the sentence decoding step.
A sentence breaking mechanism in the speech recognition system 3 is described in
Thereafter a Short-Time Fourier Transform 54 is processed to obtain a Spectrogram 55, this step is for extracting speech feature parameters. Feature parameters are used for express material or phenomenon characteristics. Take Chinese pronunciation as an example, a Chinese pronunciation can be cut into two parts, i.e. an initial and a final. The two parts uses the Short-Time Fourier Transform 54 to obtain the Spectrogram 55, and get the feature values [V1, V2, V3, . . . , Vn].
The speech recognition system 3 has two major models, i.e. acoustic model 56 and language model 57, as shown in
The phoneme recognition module 58 recognizes for Chinese by initiala and finals (i.e. consonants and vowels in English), and inputs [V1, V2, V3, . . . , Vn] into the acoustic model 56 to obtain a pinyin sequence [C1, C2, C3, . . . , Cn]. The acoustic model 56 is an artificial neural network.
The sentence decoding module 59 includes a language dictionary 60 and a language model 57. Since each pinyin in Chinese may represent different words, the language dictionary 60 is used to spread [C1, C2, C3, . . . , Cn] into a two dimensional sequence as below:
For example, [ma, hua, teng] can be spreaded into a two dimensional sequence of 3×n
,
,
,
,
,
,
,
,
,
The above two dimensional sequence of 3×n are inputted into the language model 57 for being judged as ||, instead of || or ||, so as to form a final output [A1, A2, A3, . . . , An], i.e. the caption file 4. The language model 57 is an artificial neural network.
is a Chinese name with pinyin (ma hua teng), he ranked 20th in Forbes' 2019 Billionaires List, with assets reaching 38.8 billion U.S. dollars.
means (hemp flower pain), means (hemp flower rattan), both pinyin (ma hua teng), but no special meaning.
The scope of the present invention depends upon the following claims, and is not limited by the above embodiments.