SYSTEM AND METHOD FOR CONVERTING AUDIO-TO-TEXT WITH DELAY

BACKGROUND OF THE INVENTION
Technical Field

The embodiments described herein relate generally to audio systems, and more particularly to systems, methods, and modes for alleviating the problems of delays between video and live captions for deaf and/or hard of hearing people.

Background Art

Often times, when using live captions for deaf or hard of hearing people, there is a noticeable delay or disconnect between the video and the transcribed audio. This noticeable disconnect or delay can be of little consequence if two people are merely talking to each other in the video, but if other video is displayed, such as sports or video that shows events happening, then the delay between the video and the live captions can be very disconcerting.

Accordingly, a need has arisen for systems, methods, and modes for alleviating the problems of delays between video and live captions for deaf and/or hard of hearing people.

SUMMARY

It is an object of the embodiments to substantially solve at least the problems and/or disadvantages discussed above, and to provide at least one or more of the advantages described below.

It is therefore a general aspect of the embodiments to provide systems, methods, and modes for alleviating the problems of delays between video and live captions for deaf and/or hard of hearing people that will obviate or minimize problems of the type previously described.

This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Further features and advantages of the aspects of the embodiments, as well as the structure and operation of the various embodiments, are described in detail below with reference to the accompanying drawings. It is noted that the aspects of the embodiments are not limited to the specific embodiments described herein. Such embodiments are presented herein for illustrative purposes only. Additional embodiments will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein.

According to a first aspect of the embodiments, a method for generating text caption information for an audio-video (AV) signal is provided, the method comprising: receiving an AV signal; extracting audio from the AV signal to form an extracted audio signal; time stamping both the extracted audio signal and the received AV signal; partitioning the extracted audio signal into a first predetermined duration segment of extracted audio signal; generating text captions from the partitioned extracted audio signal over a first duration, and converting the same to a video text signal, with the same time stamp as the extracted audio signal and received AV signal; delaying the received AV signal by an amount of time substantially similar to the first duration; combining the time stamped video text signal and the delayed time stamped received AV signal based on the time stamps; and outputting the combined time stamped video text signal and the time stamped received AV signal to a display.

According to the first aspect of the embodiments, the step of generating text captions from the partitioned extracted audio signal over a first duration further comprises: comparing the generated text captions with a list of text obtained by a source of the AV signal to improve accuracy of the generated text captions.

According to the first aspect of the embodiments, the list of text obtained by the source of the AV signal comprises text associated with the subject matter of the AV signal.

According to the first aspect of the embodiments, the step of generating text captions from the partitioned extracted audio signal over a first duration further comprises: obtaining metadata from the AV signal; generating a list of text that substantially matches the subject matter of the AV signal based on the obtained metadata; comparing the generated text captions with the generated list of text based on the obtained metadata to improve accuracy of the generated text captions.

According to the first aspect of the embodiments, the step of generating text captions from the partitioned extracted audio signal over a first duration further comprises: using artificial intelligence programming techniques to develop a list of text that substantially matches the subject matter of the AV signal based on the obtained metadata; comparing the generated text captions with the AI developed list of text to improve accuracy of the generated text captions.

According to the first aspect of the embodiments, the AI programming techniques comprise: Recurrent Neural Networks that are trained to suppress non-voice audio resulting in significantly improved voice signal-to-noise ratio (SNR) and clarity.

According to a second aspect of the embodiments, a system for generating text caption information for an audio-video (AV) signal system is provided, the system comprising: an audio-video (AV) signal receiver; at least one processor that is part of the AV signal receiver; a memory operatively connected with the at least one processor, wherein the memory stores computer-executable instructions that, when executed by the at least one processor, causes the at least one processor to execute a method that comprises: receiving an AV signal at the AV signal receiver; extracting audio from the AV signal to form an extracted audio signal; time stamping both the extracted audio signal and the received AV signal; partitioning the extracted audio signal into a first predetermined duration segment of extracted audio signal; generating text captions from the partitioned extracted audio signal over a first duration, and converting the same to a video text signal, with the same time stamp as the extracted audio signal and received AV signal; delaying the received AV signal by an amount of time substantially similar to the first duration; combining the time stamped video text signal and the delayed time stamped received AV signal based on the time stamps; and outputting the combined time stamped video text signal and the time stamped received AV signal to a display.

According to the second aspect of the embodiments, the step of generating text captions from the partitioned extracted audio signal over a first duration further comprises: comparing the generated text captions with a list of text obtained by a source of the AV signal to improve accuracy of the generated text captions.

According to the second aspect of the embodiments, the list of text obtained by the source of the AV signal comprises text associated with the subject matter of the AV signal.

According to the second aspect of the embodiments, the step of generating text captions from the partitioned extracted audio signal over a first duration further comprises: obtaining metadata from the AV signal; generating a list of text that substantially matches the subject matter of the AV signal based on the obtained metadata; comparing the generated text captions with the generated list of text based on the obtained metadata to improve accuracy of the generated text captions.

According to the second aspect of the embodiments, the step of generating text captions from the partitioned extracted audio signal over a first duration further comprises: using artificial intelligence programming techniques to develop a list of text that substantially matches the subject matter of the AV signal based on the obtained metadata; comparing the generated text captions with the AI developed list of text to improve accuracy of the generated text captions.

According to the second aspect of the embodiments, the AI programming techniques comprises: Recurrent Neural Networks that are trained to suppress non-voice audio resulting in significantly improved voice signal-to-noise ratio (SNR) and clarity.

According to the second aspect of the embodiments, the AV signal receiver, at least one processor and memory are part of an audio video display device.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and other objects and features of the embodiments will become apparent and more readily appreciated from the following description of the embodiments with reference to the following figures. Different aspects of the embodiments are illustrated in reference figures of the drawings. It is intended that the embodiments and figures disclosed herein are to be considered to be illustrative rather than limiting. The components in the drawings are not necessarily drawn to scale, emphasis instead being placed upon clearly illustrating the principles of the aspects of the embodiments. In the drawings, like reference numerals designate corresponding parts throughout the several views.

FIG. 1 illustrates a functional block diagram of an audio-to-text conversion and audio-video signal delay circuit for use in an audio-video playback device or system, according to aspects of the embodiments.

FIG. 2 illustrates a flow chart of a method for converting audio-to-text and adding delay to the audio-video signal using the audio-to-text conversion and audio-video signal delay circuit shown in FIG. 1 according to aspects of the embodiments.

FIG. 3 illustrates a block diagram of the major components of a personal computer (PC), server, laptop, personal electronic device (PED), personal digital assistant (PDA), tablet (e.g., iPad), or any other computer/processor (herein after, “processing device”) suitable for use to implement the method shown in FIG. 2 for converting audio-to-text and adding delay to the audio-video signal using the audio-to-text conversion and audio-video signal delay circuit shown in FIG. 1 according to aspects of the embodiments.

FIG. 4 illustrates a network system within which the system and method for substantially automatically converting audio-to-text with a delay using the audio-to-text conversion and audio-video signal delay circuit shown in FIG. 1 according to aspects of the embodiments.

DETAILED DESCRIPTION

The embodiments are described more fully hereinafter with reference to the accompanying drawings, in which embodiments of the inventive concept are shown. In the drawings, the size and relative sizes of layers and regions may be exaggerated for clarity. Like numbers refer to like elements throughout. The embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the inventive concept to those skilled in the art. The scope of the embodiments is therefore defined by the appended claims. The detailed description that follows is written from the point of view of a control systems company, so it is to be understood that generally the concepts discussed herein are applicable to various subsystems and not limited to only a particular controlled device or class of devices, such as audio networks, but can be used in virtually any type of audio playback system.

Reference throughout the specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with an embodiment is included in at least one embodiment of the embodiments. Thus, the appearance of the phrases “in one embodiment” or “in an embodiment” in various places throughout the specification is not necessarily referring to the same embodiment. Further, the particular feature, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

The different aspects of the embodiments described herein pertain to the context of systems, methods, and modes for alleviating the problems of delays between video and live captions for deaf and/or hard of hearing people, but is not limited thereto, except as may be set forth expressly in the appended claims.

For 40 years Crestron Electronics Inc., has been the world’s leading manufacturer of advanced control and automation systems, innovating technology to simplify and enhance modem lifestyles and businesses. Crestron designs, manufactures, and offers for sale integrated solutions to control audio, video, computer, and environmental systems. In addition, the devices and systems offered by Crestron streamlines technology, improving the quality of life in commercial buildings, universities, hotels, hospitals, and homes, among other locations. Accordingly, the systems, methods, and modes described herein can improve audio systems as discussed below.

The systems, methods, and modes described herein substantially alleviate the problems of delays between video and live captions for deaf and/or hard of hearing people

In the following detailed description, references are made to the accompanying drawings that form a part hereof, and in which are shown by way of illustrations, specific embodiments, or examples. These aspects may be combined, other aspects may be utilized, and structural changes may be made without departing from the spirit or scope of the present disclosure. The following detailed description is therefore not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims and their equivalents.

While some embodiments will be described in the general context of program modules that execute in conjunction with an application program that runs on an operating system on a personal computer, those skilled in the art will recognize that aspects may also be implemented in combination with other program modules.

The following is a list of the elements of the figures in numerical order:

100

Audio Video Delay (AVD) Circuit

102

Audio Extractor Device

104

Audio Video Receiver

106

Caption Generating Device

108

Delay Device

110

Combiner/Re-combiner

112

Clock

114

Processor

116

Memory

118

Audio Video Delay & Captioning Software Application (AVDC App)

120

Audio Video (AV) Display

122

Network

124

Cloud Based Digital Audio Video Sources

126

Other Digital Audio Video Sources

128

Analog Audio Video Sources

130

Analog Audio Video Receiver & Analog-to-Digital Converter Processing

200

Method for Generating Captions for Video and Delaying the Video to Ensure Synchronized Captions and Video

202–210
Steps of Method 200

300

Processing Device

304

Microprocessor Internal Memory

306

Computer Operating System (OS) VGA

308

Internal Data/Command Bus (Bus)

312

Read-Only Memory (ROM)

314

Random Access Memory (RAM)

316

Printed Circuit Board (PCB)

318

Hard Disk Drive (HDD)

320

Universal Serial Bus (USB) Port

322

Ethernet Port

324

Video Graphics Array (VGA) Port or High Definition Multimedia Interface (HDMI)

326

Compact Disk (CD)/Digital Video Disk (DVD) Read/Write (RW) (CD/DVD/RW) Drive

328

Floppy Diskette Drive (FDD)

330

Integrated Display/Touchscreen (Laptop/Tablet etc.)

332

Wi-Fi Transceiver

334

BlueTooth (BT) Transceiver

336

Near Field Communications (NFC) Transceiver

338

Third Generation (3G), Fourth Generation (4G), Fifth Generation (5G), Long Term Evolution (LTE) (3G/4G/5G/LTE) Cellular Transceiver

340

Communications Satellite/Global Positioning System (Satellite) Transceiver

342

Mouse

344

Scanner/Printer/Fax Machine

346

Universal Serial Bus (USB) Cable

348

High Definition Multi-Media Interface (HDMI) Cable

350

Ethernet Cable (CAT5)

352

External Memory Storage Device

354

Flash Drive Memory

356

CD/DVD Diskettes

358

Floppy Diskettes

360

Keyboard

364

Antenna

366

Shell/Box

402

Modulator/Demodulator (Modem)

404

Wireless Router

406

Internet Service Provider (ISP)

408

Server/Switch/Router

410

Internet

412

Cellular Service Provider

414

Cellular Telecommunications Service Tower (Cell Tower)

416

Satellite System Control Station

418

Global Positioning System (GPS) Station

420

Satellite (Communications/GPS)

422

Mobile Electronic Device (MED)/Personal Electronic Device (PED)

424

Plain Old Telephone Service (POTS) Provider

518

Equalizer

520

Amplifier(s)

522

Loudspeaker(s)

524

Microphone (Mic)

526

Digital Input(s)

528

Analog Input(s)

Used throughout the specification are several acronyms, the meanings of which are provided as follows:

3G
Third Generation

4G
Fourth Generation

5G
Fifth Generation

APB NW
Audio Playback Network

API
Application Programming Interface

App
Executable Software Programming Code/Application

ASIC
Application Specific Integrated Circuit

BIOS
Basic Input/Output System

BT
BlueTooth

CD
Compact Disk

CRT
Cathode Ray Tube

DVD
Digital Video Disk

EEPROM
Electrically Erasable Programmable Read Only Memory

FDD
Floppy Diskette Drive

FPGA
Field Programmable Gate Array

GAN
Global Area Network

GPS
Global Positioning System

GUI
Graphical User Interface

HDD
Hard Disk Drive

HDMI
High Definition Multimedia Interface

ISP
Internet Service Provider

LCD
Liquid Crystal Display

LED
Light Emitting Diode Display

LTE
Long Term Evolution

MODEM
Modulator-Demodulator

NFC
Near Field Communications

OS
Operating System

PC
Personal Computer

PED
Personal Electronic Device

POTS
Plain Old Telephone Service

PROM
Programmable Read Only Memory

RAM
Random Access Memory

ROM
Read-Only Memory

RW
Read/Write

USB
Universal Serial Bus (USB) Port

UV
Ultraviolet Light

UVPROM
Ultraviolet Light Erasable Programmable Read Only Memory

VGA
Video Graphics Array

Generally, program modules include routines, programs, components, data structures, and other types of structures that perform particular tasks or implement particular abstract data types. Moreover, those of skill in the art can appreciate that different aspects of the embodiments can be practiced with other computer system configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and comparable computing devices. Aspects of the embodiments can also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules can be located in both local and remote memory storage devices.

Aspects of the embodiments can be implemented as a computer-implemented process (method), a computing system, or as an article of manufacture, such as a computer program product or computer readable media. The computer program product can be a computer storage medium readable by a computer system and encoding a computer program that comprises instructions for causing a computer or computing system to perform example process(es). The computer-readable storage medium is a computer-readable memory device. The computer-readable storage medium can for example be implemented via one or more of a volatile computer memory, a non-volatile memory, a hard drive, a flash drive, a floppy disk, or a compact disk, and comparable hardware media.

Throughout this specification, the term “platform” can be a combination of software and hardware components for providing share permissions and organization of content in an application with multiple levels of organizational hierarchy. Examples of platforms include, but are not limited to, a hosted service executed over a plurality of servers, an application executed on a single computing device, and comparable systems. The term “server” generally refers to a computing device executing one or more software programs typically in a networked environment. More detail on these technologies and example operations is provided below.

A computing device, as used herein, refers to a device comprising at least a memory and one or more processors that includes a server, a desktop computer, a laptop computer, a tablet computer, a smart phone, a vehicle mount computer, or a wearable computer. A memory can be a removable or non-removable component of a computing device configured to store one or more instructions to be executed by one or more processors. A processor can be a component of a computing device coupled to a memory and configured to execute programs in conjunction with instructions stored by the memory. Actions or operations described herein may be executed on a single processor, on multiple processors (in a single machine or distributed over multiple machines), or on one or more cores of a multi-core processor. An operating system is a system configured to manage hardware and software components of a computing device that provides common services and applications. An integrated module is a component of an application or service that is integrated within the application or service such that the application or service is configured to execute the component. A computer-readable memory device is a physical computer-readable storage medium implemented via one or more of a volatile computer memory, a non-volatile memory, a hard drive, a flash drive, a floppy disk, or a compact disk, and comparable hardware media that includes instructions thereon to automatically save content to a location. A user experience can be embodied as a visual display associated with an application or service through which a user interacts with the application or service. A user action refers to an interaction between a user and a user experience of an application or a user experience provided by a service that includes one of touch input, gesture input, voice command, eye tracking, gyroscopic input, pen input, mouse input, and keyboards input. An application programming interface (API) can be a set of routines, protocols, and tools for an application or service that allow the application or service to interact or communicate with one or more other applications and services managed by separate entities.

While example implementations are described using audio networks herein, embodiments are not limited to such applications. For example, aspects of the embodiments can be employed in stand-alone audio systems, such as a room in a building that can play be audio through a dedicated system not connected to any network, and further can be used with any personal audio/video device. Anytime audio/video is received for viewing by a user, whether in or through a network or not, systems, methods, and modes of the aspects of the embodiments can substantially alleviate the problems of delays between video and live captions for deaf and/or hard of hearing people.

Technical advantages exist for substantially alleviating the problems of delays between video and live captions for deaf and/or hard of hearing people when using the aspects of the embodiments. Such technical advantages can include, but are not limited to, communicating more effectively with a greater amount of people.

Aspects of the embodiments address a need that arises from very large scale of operations created by networked computing and cloud-based services that cannot be managed by humans. The actions/operations described herein are not a mere use of a computer, but address results of a system that is a direct consequence of software used as a service such as audio network communication services offered in conjunction with communications.

FIGS. 1-4 illustrate various aspects of systems, methods, and modes for alleviating the problems of delays between video and live captions for deaf and/or hard of hearing people, and which can be used in an audio network for use on or with one or more computing devices, including, according to certain aspects of the embodiments, use of the internet or other similar networks. Further, such systems, modes, and methods can be used with personal communications devices, and which can be used in an audio network for use on or with one or more computing devices, including, according to certain aspects of the embodiments, use of the internet or other similar networks.

The automatic transcription of audio and then delaying the video such that the video and transcribed audio are substantially aligned provides a practical, technical solution to the problem of transcribed audio that is mis-matched in time with its related video; as those of skill in the art can appreciate, the aspects of the embodiments have no “analog equivalent” as its embodiments reside solely or substantially in the physical device or computer domain. That is, substantially automatically and substantially instantaneously transcribed audio and aligning it with related video by delaying the original audio and video can be used with one or more computing devices, including, according to certain aspects of the embodiments, use of the internet or other similar networks. The systems, methods, and modes of the aspects of the embodiments, for transcribing audio from an audio/video signal always meant, and continues to mean, using practical, non-abstract physical devices.

The technological improvement of the aspects of the embodiments resides in at least in the ability to quickly and easily alleviate the problems of delays between video and live captions for deaf and/or hard of hearing people by delaying the video while the audio signal is transcribed and then aligning the two within an audio system using sophisticated computer hardware.

Using a Crestron touchscreen communication device in a conference call, with one person who is hearing impaired, audio, or audio and video can be delayed by up to 500 milliseconds so that the audio portion can be processed through a voice recognition system and an audio-to-text conversion system, and the resultant audio-text can be displayed for the benefit of the hearing impaired person. Other security features can be included, such as encryption, list of authorized recipients, among others. In addition, such a system can be implemented on cell phones, and practically any type of personal communication device.

FIG. 1 illustrates a functional block diagram of audio-to-text conversion and audio-video signal delay circuit (audio-video delay (AVD) circuit) 100 for use in an audio-video playback device or system, such as a personal communication device (e.g., a phone, laptop, or any other personal electronic device (PED) according to aspects of the embodiments.

According to aspects of the embodiments, AVD circuit 100 implement steps for receiving an audio-video signal, extracting audio from an audio-video signal, time stamping both the extracted audio and audio-video signal, generating captions from the extracted audio and converting the text to a video text signal, delaying the video for a duration substantially equal to the time it takes to generate the captions from the extracted audio to ensure that the captions are substantially synchronized with video when recombined, recombining the video text signal and delayed audio-video signal based on their respective time stamps, and displaying the recombined audio-video signal and video text signal.

As a person of ordinary skill in the art (POSITA) will be able to appreciate following this discussion, the functional block diagram components shown in FIG. 1 can, in general, be implemented as either software components, hardware components, or any combination thereof. Furthermore, in the following discussion, a POSITA can appreciate that any of the signals that are shown and discussed can be, for the most part, in analog form, or digital form, and any combination thereof. In general, as a POSITA can appreciate, the only signals that must be analog are those transmitted to loudspeakers used in AV display 118, discussed below.

AVD circuit 100 comprises audio extractor 102, audio-video (AV) receiver 104, caption generator 106, delay 108, combiner/recombiner 110, clock 112, at least one processor 114, memory 116, audio video delay & captioning software application (AVDC App) 118, AV display 120, network 122, cloud based digital sources of AV signals 124, other digital AV signal sources 126, analog AV signals 128, and analog AV receiver & analog-to-digital converter processing circuitry 130, the lattermost of which converts received analog AV signals to digital AV signals for processing within AVD circuit 100, according to aspects of the embodiments.

AV display 120 comprises one or more currently available displays (e.g., liquid crystal diode (LCD) displays, light emitting diode (LED) displays, plasma panel displays, and the like), and can further include one or more AV receivers, audio amplifiers, loudspeakers, digital signal processors (DSPs), and digital to analog converters (DAC), among other analog and/or digital signal processing devices. According to further aspects of the embodiments, AVD circuit 100 can be part of AV display 120, either as a separate hardware/software component, or as an integrated part of the existing circuitry of AV display 120. In this case, both analog and digital signals (124, 126, 128) can be directed received by AV display 120.

For the purposes of this discussion, each block in the block diagram of FIG. 1 will be discussed as if a physical circuit or device; however, as discussed above, each block in the diagram of AVD circuit 100 can be implemented in hardware, or as software as part of AVDC App 118, or in any combination thereof. Further, if each block were constructed as physical devices, AVDC App 118 would coordinate operation and signal flow between such physical devices.

In FIG. 1, a combined audio video (AV) signal is received by audio extractor 102 and AV receiver 104 substantially simultaneously. Sources of AV signals include cloud based AV sources 124 that can be transmitted through network 122, as well as other digital AV signals 126 (e.g., digital video disk (DVD) players, and the like), and analog AV signal sources 128, which are received and processed by analog AV receiver & ADC processing circuit 130 to create digital AV signals that are then input to audio extractor 102. Network 122 can be virtually any type of network, including but not limited to a local area network (LAN), global area network (GAN), the internet, among other types of networks. Accessible through network 122 are one or more cloud based digital streaming audio/video sources 124.

The audio component that is present in the received digital AV signal is extracted by audio extractor 102, and the audio is then time stamped in audio extractor device 102. The AV signal is time stamped in AV receiver 104. The extracted audio is sent to caption processing device 106, wherein the audio transcription process occurs. The original AV signal is delayed in delay device 108; according to aspects of the embodiments, the delay used in delay device 108 needs to be at least as long as it takes to caption the audio in audio captioning device 106; it can be longer, but must be at least as long otherwise there can be a mismatch between the captioned audio and video that is displayed. Because all of the signals possess original time stamp information, they can be recombined with substantially no mismatch at all.

As a POSITA can appreciate, it typically takes about 25% of the video length to generate a suitable caption for the video segment; therefore, if the received AV signal is broken into ten second lengths, it will take about 2.5 seconds to accurately caption the video. According to aspects of the embodiments, metadata that accompanies the video can be used to more accurately caption the video by providing information beforehand regarding the video, and therefore limiting the expected vocabulary or word lists to be used to generate the caption. Artificial intelligence and machine learning programming techniques can be used as well. By way of non-limiting example, if a program were received for delayed captioning was directed towards the subject matter of cake making, then if caption generator “heard” or recognized the word “sweet” it would most likely have that word in its metadata list, and not the word “suite” which, of course, refers to rooms. As those of skill in the art can further appreciate, artificial intelligence (AI) programming techniques can be incorporated such that AVDC App 118 and caption generator 106 can review each previous word or phrase and use that information to generate the word it hears as being the most likely used word in that particular scenario. Other uses of AI can include AI based noise suppression, which use various techniques like Recurrent Neural Networks that are trained to suppress non-voice audio resulting in significantly improved voice signal-to-noise ratio (SNR) and clarity. Performing this prior to transcription will improve the transcription accuracy. Rnnoise is a software implementation of such a capability.

Recombination of the captioned audio (a video signal comprising text only) - i., the output of caption generator 106 - with the delayed AV signal - i.e., the output of delay 108, occurs in combiner/recombiner 110 (hereinafter referred to as recombiner 110). Since the original AV signal was delayed as combined signal, the output of recombiner 310 is a delayed version of the original AV signal - which incurs no “slippage” between audio and the video it is associated with, and the caption text signal, which, because of its time stamp, can be matched to the delayed AV signal such that the caption text matches substantially the video information from which it was originally extracted in audio extraction device 102.

AVD circuit 100 can also be referred to a processing device. A processing device is generally a server, computer, laptop, or the like, and includes at least one display (not shown), keyboard (which can be separate or integrated into the display), mouse, and/or other devices commonly associated with known processor based devices. Processing device includes at least one microprocessor 114, memory 116, and AVDC App 118. AVDC App 118 can also include a portion that generates user interfaces such as graphical user interfaces (GUIs) through which AVD circuit 100 can be managed.

FIG. 2 illustrates a flow chart of method 200 for generating captions for video and delaying the video to ensure substantially synchronized captions and video within AVD circuit 100 according to aspects of the embodiments. Method 200 can be generally performed by AVDC App 118, stored in memory 116, and executed by microprocessor 114, the steps of storing and execution known to a person of ordinary skill in the art. Or, as discussed above, some or all of blocks 102 - 112 can be physical devices and controlled by AVDC App 118 stored in memory 116, and executed by microprocessor 114 according to aspects of the embodiments.

Method 200 begins with method step 202. In fulfillment of the dual purposes of clarity and brevity, the source of the AV signal (digital or analog) is not discussed as the processing that occurs applies equally to analog sourced signals and digital signals, with the exception of converting analog to digital signals, which has been discussed in detail above in regard to FIG. 1. Thus, method 200 is described as the AV signal being received at audio extractor 102 and AV receiver 104 as being a digital AV signal. In method step 202 the AV signal is received at audio extractor 102 and AV receiver 104, and is time stamped in both 102 and 104.

In method step 204, the audio portion of the AV signal that was time stamped is extracted by audio extractor 102; according to aspects of the embodiments, the time stamp is still attached to the audio portion.

In method step 206, the time stamped AV signal is received by delay 108, and a delay is added, Δτ. Substantially simultaneously, the time stamped audio portion of the received AV signal is received by caption generator 106, and captioning of the audio signal begins. According to aspects of the embodiments, a predetermined time length of audio signal is loaded into the caption generator (by way of a non-limiting example, about 10 seconds of audio signal), and text generation occurs in the manner described above. As further described above, it can take about 25% of the duration of the audio signal to generate the caption text; therefore, if the duration of the audio signal is about 10 seconds, and it takes 2.5 seconds to generate the caption text from the audio, the length of the delay Δτ imposed by delay 108 is also about 2.5 seconds. The text that emerges from caption generator 106 is a video signal, but contains text only.

In method step 208 both the video text signal from caption generator 106 and the delayed AV signal from delay 108 - each with a time stamp - are output.

In method step 210 the output video text signal from caption generator 106 and the delayed AV signal from delay 108 - each with a time stamp - are received by combiner 110. Combiner 110 verifies that the time stamps are substantially similar and then combines the two signals.

In method step 212, the combined text captioned AV signal that has been delayed is output and received by display 120, wherein further audio and video signal processing can occur prior to being displayed on a display and the audio broadcast by one or more loudspeakers.

FIG. 3 illustrates a block diagram of the major components of a personal computer (PC), server, laptop, personal electronic device (PED), personal digital assistant (PDA), tablet (e.g., iPad), or any other processing device/computer, such as AVD circuit 100 (herein after, “processing device 300”) suitable for use to implement method 200 among others, for generating captions for video and delaying the video to ensure substantially synchronized captions and video within AVD circuit 100 according to aspects of the embodiments.

Processing device 300 includes microprocessor 114, with memory 116, within which was stored AVDC App 118; in regard to FIG. 3, memory 116 can take the form of microprocessor internal memory 304, hard disk drive (HDD) 318, random access memory (RAM) 314, and read only memory (ROM) 312, as described in greater detail below.

Processing device 300 comprises, among other items, a shell/box 366, integrated display/touchscreen 330 (though not used in every application of the computer), internal data/command bus (bus) 308, printed circuit board (PCB) 316, and one or more processors 114, with processor internal memory 304 (which can be typically ROM and/or RAM). Those of ordinary skill in the art can appreciate that in modem computer systems, parallel processing is becoming increasingly prevalent, and whereas a single processor would have been used in the past to implement many or at least several functions, it is more common currently to have a single dedicated processor for certain functions (e.g., digital signal processors) and therefore could be several processors, acting in serial and/or parallel, as required by the specific application. Processing device 300 further comprises multiple input/output ports, such as universal serial bus (USB) ports 320, Ethernet ports 322, and video graphics array (VGA) ports/high definition multimedia interface (HDMI) ports 324, among other types. Further, processing device 300 includes externally accessible drives such as compact disk (CD)/digital versatile disk (DVD) read/write (RW) (CD/DVD/RW) drive 326, and floppy diskette drive (FDD) 328 (though less used currently, some computers still include this type of interface). Processing device 300 still further includes wireless communication apparatus, such as one or more of the following: Wi-Fi transceiver 332, BlueTooth (BT) transceiver 334, near field communications (NFC) transceiver 336, third generation (3G)/fourth Generation (4G)/long term evolution (LTE)/fifth generation (5G) transceiver (cellular transceiver) 338, communications satellite/global positioning system (satellite) transceiver 340, and antenna 364.

Internal memory that is located on PCB 316 itself can comprise HDD 318 (these can include conventional magnetic storage media, but, as is becoming increasingly more prevalent, can include flash drive memory 354, among other types), ROM 312 (these can include electrically erasable programmable ROM (EEPROMs), ultra-violet erasable PROMs (UVPROMs), among other types), and RAM 314. Usable with USB port 320 is flash drive memory 354, and usable with CD/DVD/RW drive 326 are CD/DVD diskettes (CD/DVD) 356 (which can be both read and write-able). Usable with FDD 328 are floppy diskettes 358. External memory storage device 352 can be used to store data and programs external to processing device 300, and can itself comprise another HDD 318, flash drive memory 354, among other types of memory storage. External memory storage device 352 is connectable to processing device 300 via universal serial bus (USB) cable 346. Each of the memory storage devices, or the memory storage media (116, 318, 312, 314, 352, 354, 356, and 358, among others), can contain parts or components, or in its entirety, executable software programming code or application that has been termed AVDC App 118 according to aspects of the embodiments, which can implement part or all of the portions of method 200 among other methods not shown, described herein.

In addition to the above described components, processing device 300 also comprises keyboard 360, external display 330, printer/scanner/fax machine 344, and mouse 342 (although not technically part of the processing device 300, the peripheral components as shown in FIG. 3 (352, 120, 360, 342, 354, 356, 358, 346, 350, 344, and 348) are adapted for use with processing device 300 that for purposes of this discussion they shall be considered as being part of the processing device 300). Other cable types that can be used with processing device 300 include RS 232, among others, not shown, that can be used for one or more of the connections between processing device 300 and the peripheral components described herein. Keyboard 360, and mouse 342 are connectable to processing device 300 via USB cable 346, and external display 120 is connectible to processing device 300 via VGA cable/HDMI cable 348. Processing device 300 is connectible to network 122 via Ethernet port 322 and Ethernet cable 350 via a router and modulator-demodulator (MODEM) and internet service provider, none of which are shown in FIG. 3. All of the immediately aforementioned components (324, 352, 120, 360, 342, 354, 356, 358, 346, 350, and 344) are known to those of ordinary skill in the art, and this description includes all known and future variants of these types of devices.

External display 120 can be any type of currently available display or presentation screen, such as liquid crystal displays (LCDs), light emitting diode displays (LEDs), plasma displays, cathode ray tubes (CRTs), among others (including touch screen displays). In addition to the user interface mechanism such as mouse 342, processing device 300 can further include a microphone, touch pad, joystick, touch screen, voice-recognition system, among other interactive inter-communicative devices/programs, which can be used to enter data and voice, and which all of are currently available and thus a detailed discussion thereof has been omitted in fulfillment of the dual purposes of clarity and brevity.

As mentioned above, processing device 300 further comprises a plurality of wireless transceiver devices, such as Wi-Fi transceiver 332, BT transceiver 334, NFC transceiver 336, cellular transceiver 338, satellite transceiver 340, and antenna 364. While each of Wi-Fi transceiver 332, BT transceiver 334, NFC transceiver 336, cellular transceiver 338, and satellite transceiver 340 has their own specialized functions, each can also be used for other types of communications, such as accessing a cellular service provider (not shown), accessing network 122 (which can include the Internet), texting, emailing, among other types of communications and data/voice transfers/exchanges, as known to those of skill in the art. Each of Wi-Fi transceiver 332, BT transceiver 334, NFC transceiver 336, cellular transceiver 338, satellite transceiver 340 includes a transmitting and receiving device, and a specialized antenna, although in some instances, one antenna can be shared by one or more of Wi-Fi transceiver 332, BT transceiver 334, NFC transceiver 336, cellular transceiver 338, and satellite transceiver 340. Alternatively, one or more of Wi-Fi transceiver 332, BT transceiver 334, NFC transceiver 336, cellular transceiver 338, and satellite transceiver 340 will have a specialized antenna, such as satellite transceiver 340 to which is electrically connected at least one antenna 364.

In addition, processing device 300 can access network 122 (of which the Internet can be part of, as shown and described in FIG. 4 below), either through a hard wired connection such as Ethernet port 322 as described above, or wirelessly via Wi-Fi transceiver 332, cellular transceiver 338 and/or satellite transceiver 340 (and their respective antennas) according to aspects of the embodiments. Processing device 300 can also be part of a larger network configuration as in a GAN (e.g., internet), which ultimately allows connection to various landlines.

According to further aspects of the embodiments, integrated display/touchscreen 330, keyboard 360, mouse 342, and external display 120 (if in the form of a touch screen), can provide a means for a user to enter commands, data, digital, and analog information into the processing device 300. Integrated and external displays 330, 120 can be used to show visual representations of acquired data, and the status of applications that can be running, among other things.

Bus 308 provides a data/command pathway for items such as: the transfer and storage of data/commands between processor 114, Wi-Fi transceiver 332, BT transceiver 334, NFC transceiver 336, cellular transceiver 338, satellite transceiver 340, integrated display 330, USB port 320, Ethernet port 322, VGA/HDMI port 324, CD/DVD/RW drive 326, FDD 328, and processor internal memory 304. Through bus 308, data can be accessed that is stored in processor internal memory 304. Processor 114 can send information for visual display to either or both of integrated and external displays 330, 120, and the user can send commands to the computer operating system (operating system (OS)) 306 that can reside in processor internal memory 304 of processor 114, or any of the other memory devices (356, 358, 318, 312, and 314).

Processing device 300, and either internal memories 304, 312, 314, and 318, or external memories 352, 354, 356 and 358, can be used to store computer code that when executed, implements method 200, as well as other methods not shown and discussed, for substantially automatically establishing secure communications between similar audio devices, according to aspects of the embodiments. Hardware, firmware, software, or a combination thereof can be used to perform the various steps and operations described herein. According to aspects of the embodiments, AVDC App 118 for carrying out the above discussed steps can be stored and distributed on multi-media storage devices such as devices 318, 312, 314, 354, 356 and/or 358 (described above) or other form of media capable of portably storing information. Storage media 354, 356 and/or 358 can be inserted into, and read by devices such as USB port 320, CD/DVD/RW drive 326, and FDD 328, respectively.

As also will be appreciated by one skilled in the art, the various functional aspects of the aspects of the embodiments can be embodied in a wireless communication device, a telecommunication network, or as a method or in a computer program product. Accordingly, aspects of embodiments can take the form of an entirely hardware embodiment or an embodiment combining hardware and software aspects. Further, the aspects of embodiments can take the form of a computer program product stored on a computer-readable storage medium having computer-readable instructions embodied in the medium. Any suitable computer-readable medium can be utilized, including hard disks, CD-ROMs, DVDs, optical storage devices, or magnetic storage devices such a floppy disk or magnetic tape. Other non-limiting examples of computer-readable media include flash-type memories or other known types of memories.

Further, those of ordinary skill in the art in the field of the aspects of the embodiments can appreciate that such functionality can be designed into various types of circuitry, including, but not limited to field programmable gate array structures (FPGAs), application specific integrated circuitry (ASICs), microprocessor based systems, among other types. A detailed discussion of the various types of physical circuit implementations does not substantively aid in an understanding of the aspects of the embodiments, and as such has been omitted for the dual purposes of brevity and clarity. However, the systems and methods discussed herein can be implemented as discussed and can further include programmable devices.

Such programmable devices and/or other types of circuitry as previously discussed can include a processing unit, a system memory, and a system bus that couples various system components including the system memory to the processing unit. The system bus can be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. Furthermore, various types of computer readable media can be used to store programmable instructions. Computer readable media can be any available media that can be accessed by the processing unit. By way of example, and not limitation, computer readable media can comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile as well as removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROMs, DVDs or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information, and which can be accessed by the processing unit. Communication media can embody computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and can include any suitable information delivery media.

The system memory can include computer storage media in the form of volatile and/or nonvolatile memory such as ROM and/or RAM. A basic input/output system (BIOS), containing the basic routines that help to transfer information between elements connected to and between the processor, such as during start-up, can be stored in memory. The memory can also contain data and/or program modules that are immediately accessible to and/or presently being operated on by the processing unit. By way of non-limiting example, the memory can also include an operating system, application programs, other program modules, and program data.

The processor can also include other removable/non-removable and volatile/nonvolatile computer storage media. For example, the processor can access a hard disk drive that reads from or writes to non-removable, nonvolatile magnetic media, a magnetic disk drive that reads from or writes to a removable, nonvolatile magnetic disk, and/or an optical disk drive that reads from or writes to a removable, nonvolatile optical disk, such as a CD-ROM or other optical media. Other removable/non-removable, volatile/nonvolatile computer storage media that can be used in the operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM and the like. A hard disk drive can be connected to the system bus through a non-removable memory interface such as an interface, and a magnetic disk drive or optical disk drive can be connected to the system bus by a removable memory interface, such as an interface.

Aspects of the embodiments discussed herein can also be embodied as computer-readable codes on a computer-readable medium. The computer-readable medium can include a computer-readable recording medium and a computer-readable transmission medium. The computer-readable recording medium is any data storage device that can store data which can be thereafter read by a computer system. Examples of the computer-readable recording medium include ROM, RAM, CD-ROMs and generally optical data storage devices, magnetic tapes, flash drives, and floppy disks. The computer-readable recording medium can also be distributed over network coupled computer systems so that the computer-readable code is stored and executed in a distributed fashion. The computer-readable transmission medium can transmit carrier waves or signals (e.g., wired, or wireless data transmission through the Internet). Also, functional programs, codes, and code segments to, when implemented in suitable electronic hardware, accomplish or support exercising certain elements of the appended claims can be readily construed by programmers skilled in the art to which the aspects of the embodiments pertains.

The disclosed aspects of the embodiments provide a system and method for generating captions for video and delaying the video to ensure substantially synchronized captions and video within AVD circuit 100, according to aspects of the embodiments, on one or more computers or processing devices 300. It should be understood that this description is not intended to limit aspects of the embodiments. On the contrary, aspects of the embodiments are intended to cover alternatives, modifications, and equivalents, which are included in the spirit and scope of the aspects of the embodiments as defined by the appended claims. Further, in the detailed description of the aspects of the embodiments, numerous specific details are set forth to provide a comprehensive understanding of the claimed aspects of the embodiments. However, one skilled in the art would understand that various aspects of the embodiments can be practiced without such specific details.

FIG. 4 illustrates network system 122 within which the system and method for generating captions for video and delaying the video to ensure substantially synchronized captions and video within AVD circuit 100 can be used, according to aspects of the embodiments. Much of the infrastructure of network system 122 shown in FIG. 4 is or should be known to those of skill in the art, so, in fulfillment of the dual purposes of clarity and brevity, a detailed discussion thereof shall be omitted.

According to aspects of the embodiments, a user of the above described system and method can store AVDC App 118 on their processing device 300 as well as mobile electronic device (MED)/PED 422 (hereon in referred to as “PEDs 422). PEDs 422 can include, but are not limited to, so-called smart phones, tablets, personal digital assistants (PDAs), notebook and laptop computers, and essentially any device that can access the internet and/or cellular phone service or can facilitate transfer of the same type of data in either a wired or wireless manner.

PED 422 can access cellular service provider 412, either through a wireless connection (cell tower 414) or via a wireless/wired interconnection (a “Wi-Fi” system that comprises, e.g., modem 402, wireless router 404, internet service provider (ISP) 406, and internet 410 (although not shown, those of skill in the art can appreciate that internet 410 comprises various different types of communications cables, servers/routers/switches 408, and the like, wherein data/software/applications of all types is stored in memory within or attached to servers or other processor based electronic devices, including, for example, AVDC App 118 within a computer/server that can be accessed by a user of AVDC App 118 on their PED 422 and/or processing device 300). As those of skill in the art can further appreciate, internet 410 can include access to “cloud” computing service(s) and devices, wherein the cloud refers to the on-demand availability of computer system resources, especially data storage and computing power, without direct active management by the user. Large clouds often have functions distributed over multiple locations, each location being a data center.

Further, PED 422 can include NFC, “Wi-Fi,” and Bluetooth (BT) communications capabilities as well, all of which are known to those of skill in the art. To that end, network system 122 further includes, as many homes (and businesses) do, one or more computers or processing devices 300 that can be connected to wireless router 404 via a wired connection (e.g., modem 402) or via a wireless connection (e.g., Bluetooth). Modem 402 can be connected to ISP 406 to provide internet-based communications in the appropriate format to end users (e.g., processing device 300), and which takes signals from the end users and forwards them to ISP 406.

PEDs 422 can also access global positioning system (GPS) satellite 420, which is controlled by GPS station 418, to obtain positioning information (which can be useful for different aspects of the embodiments), or PEDs 422 can obtain positioning information via cellular service provider 412 using cellular tower(s) (cell tower) 414 according to one or more methods of position determination. Some PEDs 422 can also access communication satellites 420 and their respective satellite communication systems control stations 416 (the satellite in FIG. 4 is shown common to both communications and GPS functions) for near-universal communications capabilities, albeit at a much higher cost than convention “terrestrial” cellular services. PEDs 422 can also obtain positioning information when near or internal to a building (or arena/stadium) through the use of one or more of NFC/BT devices. FIG. 4 also illustrates other components of network 122 such as plain old telephone service (POTS) provider 424.

According to further aspects of the embodiments, and as described above, network 122 also contains other types of servers/devices that can include processing device 300, wherein one or more processors, using currently available technology, such as memory, data and instruction buses, and other electronic devices, can store and implement code that can implement the system and method for generating captions for video and delaying the video to ensure substantially synchronized captions and video within AVD circuit 100, according to aspects of the embodiments.

According to further aspects of the embodiments, additional features and functions of inventive embodiments are described herein below, wherein such descriptions are to be viewed in light of the above noted detailed embodiments as understood by those skilled in the art.

As described above, an encoding process is discussed specifically in reference to FIG. 2, although such delineation is not meant to be, and should not be taken in a limiting manner, as additional methods according to aspects of the embodiments have been described herein. The encoding processes as described are not meant to limit the aspects of the embodiments, or to suggest that the aspects of the embodiments should be implemented following the encoding processes. The purpose of the encoding processes as described is to facilitate the understanding of one or more aspects of the embodiments and to provide the reader with one or many possible implementations of the processed discussed herein. FIG. 2 illustrates a flowchart of various steps performed during the encoding process, but such encoding processes are not limited thereto. The steps of FIG. 2 are not intended to completely describe the encoding processes but only to illustrate some of the aspects discussed above.

This application may contain material that is subject to copyright, mask work, and/or other intellectual property protection. The respective owners of such intellectual property have no objection to the facsimile reproduction of the disclosure by anyone as it appears in published Patent Office file/records, but otherwise reserve all rights.

It should be understood that this description is not intended to limit the embodiments. On the contrary, the embodiments are intended to cover alternatives, modifications, and equivalents, which are included in the spirit and scope of the embodiments as defined by the appended claims. Further, in the detailed description of the embodiments, numerous specific details are set forth to provide a comprehensive understanding of the claimed embodiments. However, one skilled in the art would understand that various embodiments may be practiced without such specific details.

Although the features and elements of aspects of the embodiments are described being in particular combinations, each feature or element can be used alone, without the other features and elements of the embodiments, or in various combinations with or without other features and elements disclosed herein.

This written description uses examples of the subject matter disclosed to enable any person skilled in the art to practice the same, including making and using any devices or systems and performing any incorporated methods. The patentable scope of the subject matter is defined by the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims.

The above-described embodiments are intended to be illustrative in all respects, rather than restrictive, of the embodiments. Thus, the embodiments are capable of many variations in detailed implementation that can be derived from the description contained herein by a person skilled in the art. No element, act, or instruction used in the description of the present application should be construed as critical or essential to the embodiments unless explicitly described as such. Also, as used herein, the article “a” is intended to include one or more items.

All United States patents and applications, foreign patents, and publications discussed above are hereby incorporated herein by reference in their entireties.

Industrial Applicability

To solve the aforementioned problems, the aspects of the embodiments are directed towards systems, methods, and modes for receiving an audio-video signal, extracting audio from an audio-video signal, time stamping both the extracted audio and audio-video signal, generating captions from the extracted audio and converting the text to a video text signal, delaying the video for a duration substantially equal to the time it takes to generate the captions from the extracted audio to ensure that the captions are substantially synchronized with video when recombined, recombining the video text signal and delayed audio-video signal based on their respective time stamps, and displaying the recombined audio-video signal and video text signal.

Alternate Embodiments

Alternate embodiments may be devised without departing from the spirit or the scope of the different aspects of the embodiments.

SYSTEM AND METHOD FOR CONVERTING AUDIO-TO-TEXT WITH DELAY

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

PRIORITY INFORMATION

Provisional Applications (1)