BOOSTING, CORRECTING, AND BLOCKING TO PROVIDE IMPROVED TRANSCRIBED AND TRANSLATED RESULTS OF CLOUD-BASED MEETINGS

FIELD OF THE INVENTION

This disclosure is generally related to improving transcription and language translation of spoken content in meetings.

BACKGROUND

Globalization has led to large companies to have employees in many different countries. Large business entities, law, consulting, and accounting firms, and non-governmental (NGO) organizations are now global in scope and have physical presences in many countries. Persons affiliated with these institutions can often speak many different languages and must communicate with each other regularly with confidential information exchanged. Conferences and meetings involving many participants are routine and can involve persons speaking and exchanging material in multiple languages.

Automated transcription or speech recognition technology can be inaccurate if words are not spoken clearly, there is low signal quality, there is high noise signal, or the words spoken sound similar to another word or words (homophones). For example, in a cloud based system, if the internet connection is poor, the input speech can be garbled leading to poor speech recognition (transcription).

Furthermore, offensive words, terms, or phrases may be spoken in the various languages that are undesirable for others to read or listen to. The offensive words may be swear or cuss words, vulgarities in the various different languages. Other offensive words can be culturally offensive words, terms, or phrases such as ethnic slurs, religious slurs, or racist words. Some medical words can be offensive words depending upon the context of a meeting. Participants may not desire reading or hearing offensive words, terms or phrases. Additionally, words in one language may seem innocuous but when translated into another language leads to offensive words, terms, or phrases that people don't want to see or read.

It is desirable to improve transcription results and translation results. It is desirable to avoid displaying offensive words, terms, or phrases in multi-language meetings and further propagating them in a translations of transcripts and in speech synthesis of the translated transcript.

SUMMARY

The disclosed embodiments are generally summarized by the claims that follow below. However, in some aspects, the techniques described herein relate to a method carried out by one or more servers for managing a cloud-based meeting involving multiple languages, the method including: receiving, from a first client device, a preselection of a language for a first speaker; assigning one or more glossaries to the first speaker, wherein the one or more glossaries for the first speaker configure one or more servers to boost, filter, or replace transcribed terms transcribed for the first speaker; receiving, from a microphone at the first client device, first audio content which originated from the first speaker; transcribing the first audio content for the first speaker into text in the language of the first speaker in a transcript; and generating first altered text by altering, according to the one or more glossaries, the text of the transcript in the language of the first speaker.

In some aspects, the techniques described herein relate to a method, wherein the one or more glossaries includes a set of languages expected to be spoken during the cloud-based meeting.

In some aspects, the techniques described herein relate to a method, wherein the one or more glossaries includes N languages of common languages that are spoken in the world, N being a positive integer.

In some aspects, the techniques described herein relate to a method, wherein the one or more glossaries includes a plurality of lists for each language in the set of languages, and each list of the plurality of lists for each language is at least one of: a boost list including one or more of spoken words, spoken terms, spoken phrases, spoken passages, spoken names, spoken abbreviations with expansions, and spoken acronyms with expansions expected to be spoken during a meeting that can be used for replacement in the transcript to improve recognition results; a block list including one or more of offensive spoken words, offensive spoken terms, offensive spoken phrases, and offensive spoken passages expected to be spoken during a meeting that can be used to filter out from the transcript to avoid offending readers and listeners of speech synthesis; or a correction list including one or more typed error words, terms, phrases and passages and their replacement for correcting common mistakes in a transcript or a translated transcript.

In some aspects, the techniques described herein relate to a method, wherein each list of the plurality of lists for each language is an initial list with standard words, terms, or phrases to improve speech recognition or transcription results.

In some aspects, the techniques described herein relate to a method, wherein at least one list of the plurality of lists for each language is a user edited list to improve speech recognition or transcription results, and wherein the one or more glossaries are stored in a server associated with at least one login identification.

In some aspects, the techniques described herein relate to a method, wherein the generating of the altered text includes identifying one or more offensive words, terms, or phrases in text by using the one or more glossaries; and replacing the one or more offensive words, terms, or phrases with non-letter characters.

In some aspects, the techniques described herein relate to a method, further including transmitting, to the first client device, the altered text for display in a speech bubble.

In some aspects, the techniques described herein relate to a method, further including: receiving, from a second client device, a preselection of a language for a second speaker; assigning one or more glossaries to the second speaker, wherein the one or more glossaries for the second speaker configure the one or more servers to boost, filter, or replace transcribed terms transcribed for the second speaker; receiving, from a microphone at the second client device, second audio content which originated from the second speaker; transcribing the second audio content for the second speaker into text in the language of the second speaker; and generating second altered text by altering, according to the one or more glossaries, the text in the language of the second speaker.

In some aspects, the techniques described herein relate to a method carried out by one or more servers for managing a cloud-based meeting involving multiple languages, the method including: receiving, from a first client device, a preselection of a first language for a first speaker; receiving, from a second client device, a preselection of a second language for a first listener/reader, the second language differing from the first language; assigning a first glossary to the first speaker and a second glossary to the first listener/reader, wherein the first glossary and the second glossary configure one or more servers to boost, filter, or replace transcribed terms transcribed for the first speaker and translated terms translated for the first listener/reader; receiving, from a microphone at the first client device, first audio content which originated from the first speaker; transcribing the first audio content for the first speaker into transcribed text in the first language of the first speaker; translating the transcribed text in the first language into translated transcribed text in the second language for the first listener/reader into a transcript; generating first altered text by altering, according to a first glossary, the transcribed text in the first language of the first speaker; and generating second altered text by altering, according to a second glossary, the translated transcribed text in the second language for the first listener/reader.

In some aspects, the techniques described herein relate to a method, wherein the first glossary and the second glossary includes a set of languages expected to be spoken during the cloud-based meeting.

In some aspects, the techniques described herein relate to a method, wherein each of the first glossary and the second glossary includes N languages of common languages that are spoken in the world, N being a positive integer.

In some aspects, the techniques described herein relate to a method, wherein each of the first glossary and the second glossary includes a plurality of lists for each language in the set of languages, and each list of the plurality of lists for each language is at least one of: a boost list including one or more of spoken words, spoken terms, spoken phrases, spoken passages, spoken names, spoken abbreviations with expansions, and spoken acronyms with expansions expected to be spoken during a meeting that can be used for replacement in the transcript to improve recognition results; a block list including one or more of offensive spoken words, offensive spoken terms, offensive spoken phrases, and offensive spoken passages expected to be spoken during a meeting that can be used to filter out from the transcript to avoid offending readers and listeners of speech synthesis; or a correction list including one or more typed error words, terms, phrases and passages and their replacement for correcting common mistakes in a transcript or a translated transcript.

In some aspects, the techniques described herein relate to a system for managing a cloud-based meeting involving multiple languages, the system including: at least one server device including a processor device and a memory device coupled to the processor device, wherein the memory device stores a first glossary, a second glossary, and an application that configures the server device to perform: receiving, from a first client device, a preselection of a first language for a first speaker; assigning a first glossary to the first speaker, wherein the first glossary configures one or more servers to boost, filter, or replace transcribed terms transcribed for the first speaker; receiving, from a microphone at the first client device, first audio content which originated from the first speaker; transcribing the first audio content for the first speaker into transcribed text in the first language of the first speaker; and generating first altered text by altering, according to the first glossary, the transcribed text in the first language of the first speaker.

In some aspects, the techniques described herein relate to a system for managing a cloud-based meeting, wherein: the application further configures the server device to perform: receiving, from a second client device, a preselection of a second language for a first listener/reader, the second language differing from the first language; assigning a second glossary to the first listener/reader, wherein the second glossary configures one or more servers to boost, filter, or replace translated terms translated for the first listener/reader; translating the transcribed text in the first language into translated transcribed text in the second language for the first listener/reader; and generating second altered text by altering, according to the second glossary, the translated transcribed text in the second language for the first listener/reader.

In some aspects, the techniques described herein relate to a method carried out by one or more servers for building one or more glossaries for a cloud-based meeting for differing spoken languages, the method including: generating a set of languages to be spoken during a cloud-based meeting between human attendees; and generating a plurality of lists for each language in the set of languages, wherein the plurality of lists configures the one or more servers to boost, filter, or replace transcribed terms in a transcript that was transcribed for a speaker in the cloud-based meeting.

In some aspects, the techniques described herein relate to a method wherein each list of the plurality of lists for each language is one of: a boost list including one or more of spoken words, spoken terms, spoken phrases, spoken passages, spoken names, spoken abbreviations with expansions, and spoken acronyms with expansions expected to be spoken during a meeting that can be used for replacement in the transcript to improve recognition results; a block list including one or more of offensive spoken words, offensive spoken terms, offensive spoken phrases, and offensive spoken passages expected to be spoken during a meeting that can be used to filter out from the transcript to avoid offending readers and listeners of speech synthesis; or a correction list including one or more typed error words, terms, phrases and passages and their replacement for correcting common mistakes in a transcript or a translated transcript.

In some aspects, the techniques described herein relate to a method, wherein the one or more glossaries includes a set of languages expected to be spoken during the cloud-based meeting.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A is a block diagram of a system of using a cloud structure in real time speech transcription and translation involving a plurality of participants some of which can speak and read a different language from others.

FIG. 1B is a block diagram of a client-server system of using a cloud structure to provide real time speech transcription and translation into multiple languages.

FIG. 1C is a block diagram of a client device.

FIG. 1D is a block diagram of a server system device.

FIG. 1E is a block diagram of a client-server system of using a cloud structure to provide real time speech transcription, language translation into multiple languages, and speech synthesis into digital audio.

FIG. 2 is an example of a running meeting transcript.

FIGS. 3A-3D are conceptual diagrams of capturing spoken words (speech) in a first language, generating transcripts, translating transcripts, and generating spoken words (speech) in a second language that a participant can listen to.

FIG. 3E is a block diagram depiction of the services provided to the multiple participants in a conference meeting.

FIG. 4 is a conceptual diagram of language transcription and multi-language translation of spoken content.

FIGS. 5A-5C are diagrams of graphical user interfaces displayed on a monitor or display device to support the transcription and translation client-server system.

FIG. 6 is a conceptual diagram of a conference between a speaker/host participant in one room and listener participants in a remotely located room using the transcription and translation client server system.

FIG. 7A is a diagram of a graphical user interface displayed on a monitor or display device to support switching between spoken languages in the transcription and translation client-server system.

FIG. 7B is a diagram of a graphical user interface displayed on a monitor or display device to support entry of the preselected languages to be used by a speaker during the conference or session.

FIGS. 8A-8C illustrate the formation of preliminary transcripts of different languages spoken by three different speakers and the formation of a master speaker transcript for each.

FIG. 9 illustrates the formation of a compiled session transcript for a virtual meeting or audio-video conference.

FIG. 10A illustrates a glossary with boost lists and block lists of words, terms or phrases in one or more languages.

FIG. 10B illustrates a glossary with boost lists, block lists, and correction lists of words, terms or phrases in one or more languages.

FIG. 11A illustrates an example block list in the English language.

FIG. 11B illustrates an example block list in the French language.

FIG. 12A illustrates an example correction list in the English language.

FIG. 12B illustrates an example correction list in the French language.

FIG. 13A illustrates an example boost list in the English language.

FIG. 13B illustrates an example boost list in the French language.

FIG. 14 illustrates a diagram of a graphical user interface displayed on a monitor or display device that supports the improved transcription and translation using the block lists and the correction lists in the glossary for the various languages.

DETAILED DESCRIPTION

In the following detailed description of the disclosed embodiments, numerous specific details are set forth in order to provide a thorough understanding. However, it will be obvious to one skilled in the art that the disclosed embodiments can be practiced without these specific details. In other instances, well known methods, procedures, components, and subsystems have not been described in detail so as not to unnecessarily obscure aspects of the disclosed embodiments.

The embodiments disclosed herein includes methods, apparatus, and systems for near instantaneous translation of spoken voice content in many languages in settings involving multiple participants, themselves often speaking many different languages. A voice translation can be accompanied by a text transcription of the spoken content. As a participant hears the speaker's words in the language of the participant's choice, text of the spoken content is displayed on the participant's viewing screen in the language of the participant's choice. In an embodiment, the text can be simultaneously displayed for the participant in both the speaker's own language and in the language of the participant's choice.

Features are also provided herein that can enable participants to access a transcript as it is being dynamically created while presenters or speakers are speaking. Participants can provide contributions including summaries, annotations, and highlighting to provide context and broaden the overall value of the transcript and conference. Participants can also selectively submit corrections to material recorded in transcripts. Nonverbal sounds occurring during a conference are additionally identified and added to the transcript to provide further context.

A participant chooses the language he or she wishes to hear and view transcriptions, independent of a language the presenter has chosen for speaking. Many parties, both presenters and participants, can participate using various languages. Many languages can be accommodated simultaneously in a single group conversation. Participants can use their own chosen electronic devices without having to install specialized software.

The systems and methods disclosed herein use advanced natural language processing (NLP) and artificial intelligence to perform transcription and language translation. The speaker speaks in his/her chosen language into a microphone connected to a device using IOS, Android, or other operating system. The speaker's device and/or a server (e.g., server device) executes an application with the functionally described herein. Software associated with the application transmits the speech to a cloud platform.

The transcribing and translating system is an on-demand system. That is, as a presentation or meeting is progressing, a new participant can join the meeting in progress. The cloud platform includes at least one server (e.g., server device) that can start up transcribing engines and transcription engines on demand. Artificial intelligence (natural language processing) associated with the server software translates the speech into many different languages. The server software provides the transcript services and translation services described herein.

Participants join the session using an attendee application provided herein. Attendees select their desired language to read text and listen to audio. Listening attendees receive translated text and translation audio of the speech as well as transcript access support services in near real time in their own selected language.

Functionality is further provided that can significantly enhance the quality of translation and therefore the participant experience and overall value of the conference or meeting. Intelligent back end systems can improve translation and transcription by selectively using multiple translation engines, in some cases simultaneously, to produce a desired result. Translation engines are commercially available, accessible on a cloud-provided basis, and be selectively drawn upon to contribute. The system can use two or more translation engines simultaneously depending upon one or more factors. These one or more factors can include the languages of speakers and attendees, the subject matter of the discussion, the voice characteristics, demonstrated listening abilities and attention levels of participants, and technical quality of transmission. The system can select one or two or more translation engines for use. One translation engine can function as a primary source of translation while a second translation engine is brought in as a supplementary source to confirm translation produced by the first engine. Alternatively, a second translation engine can be brought in when the first translation engine encounters difficulty. In other embodiments, two or more translation engines can simultaneously be used to perform full translation of the different languages into which transcribed text is to be translated and audible content generated.

Functionality provided herein that executes in the cloud, on the server, and/or on the speaker's device can instantaneously determine which translation and transcript version are more accurate and appropriate at any given point in the session. The system can toggle between the multiple translation engines in use in producing the best possible result for speakers and participants based on their selected languages and the other factors listed above as well as their transcript needs.

A model can effectively be built of translation based on the specific factors mentioned above as well as number and location of participants and complexity and confidentiality of subject matter and further based on strengths and weaknesses of available translation engines. The model can be built and adjusted on a sentence by sentence basis and can dynamically choose which translation engine or combination thereof to use.

Context can be established and dynamically adjusted as a meeting session proceeds. Context of captured and translated material can be carried across speakers and languages and from one sentence to the next. This action can improve quality of translation, support continuity of a passage, and provide greater value, especially to participants not speaking the language of a presenter.

Individual portions (e.g., sentences) of captured speech are not analyzed and translated in isolation from one another but instead in context of what has been said previously. As noted, carrying of context can occur across speakers such that during a session, for example a panel discussion or conference call, context can be carried forward, broadened out, and refined based on the spoken contribution of multiple speakers. The system can blend the context of each speaker's content into a single group context such that a composite context is produced of broader value to all participants.

A glossary of terms can be developed during a session or after a session. The glossary can draw upon a previously created glossary of terms. The system can adaptively change a glossary during a session. The system can detect and extract key terms and keywords from spoken content to build and adjust the glossary.

The glossary and contexts developed can incorporate preferred interpretations of some proprietary or unique terms and spoken phrases and passages. These can be created and relied upon in developing context, creating transcripts, and performing translations for various audiences. Organizations commonly create and use acronyms and other terms to facilitate and expedite internal communications. Glossaries for specific participants, groups, and organizations could therefore be built, stored and drawn upon as needed.

Services are provided for building transcripts as a session is ongoing and afterward. Transcripts are created and can be continuously refined during the session. Transcript text is displayed on monitors of parties in their chosen languages. Transcript text of the session can be finalized after the session has ended.

The transcript can rely on previously developed glossaries. In an embodiment, a first transcript of a conference can use a glossary appropriate for internal use within an organization, and a second transcript of the same conference can use a general glossary more suited for public viewers of the transcript.

Systems and methods also provide for non-verbal sounds to be identified, captured, and highlighted in transcripts. Laughter and applause, for example, can be identified by the system and highlighted in a transcript, providing further context.

In an embodiment, a system for using cloud structures in real time speech and translation involving multiple languages is provided. The system comprises a processor (e.g., processor device), a memory (e.g., memory device or other type of storage device), and an application stored in the memory that when executed on the processor receives audio content in a first spoken language from a first speaking device. The system also receives a first language preference from a first client device, the first language preference differing from the spoken language. The system also receives a second language preference from a second client device, the second language preference differing from the spoken language. The system also transmits the audio content and the language preferences to at least one translation engine. The system also receives the audio content from the engine translated into the first and second languages and sends the audio content to the client devices translated into their respective preferred languages.

The application selectively blends translated content provided by the first translation engine with translated content provided by the second translation engine. It blends such translated content based on factors comprising at least one of the first spoken language and the first and second language preferences, subject matter of the content, voice characteristics of the spoken audio content, demonstrated listening abilities and attention levels of users of the first and second client devices, and technical quality of transmission. The application dynamically builds a model of translation based at least upon one of the preceding factors, based upon locations of users of the client devices, and based upon observed attributes of the translation engines.

In another embodiment, a method for using cloud structures in real time speech and translation involving multiple languages. The method comprises a computer receiving a first portion of audio content spoken in a first language. The method also comprises the computer receiving a second portion of audio content spoken in a second language, the second portion spoken after the first portion. The method also comprises the computer receiving a first translation of the first portion into a third language. The method also comprises the computer establishing a context based on at least the first translation. The method also comprises the computer receiving a second translation of the second portion into the third language. The method also comprises the computer adjusting the context based on at least the second translation.

Actions of establishing and adjusting the context are based on factors comprising at least one of subject matter of the first and second portions, settings in which the portions are spoken, audiences of the portions including at least one client device requesting translation into the third language, and cultural considerations of users of the at least one client device. The factors further include cultural and linguistic nuances associated with translation of the first language to the third language and translation of the second language to the third language.

In yet another embodiment, a system for using cloud structures in real time speech and translation involving multiple languages and transcript development is provided. The system comprises a processor, a memory, and an application stored in the memory that when executed on the processor receives audio content comprising human speech spoken in a first language. The system also translates the content into a second language and displays the translated content in a transcript displayed on a client device viewable by a user speaking the second language.

The system also receives at least one tag in the translated content placed by the client device, the tag associated with a portion of the content. The system also receives commentary associated with the tag, the commentary alleging an error in the portion of the content. The error can allege concerns at least one of translation, contextual issues, and idiomatic issues. The system also corrects the portion of the content in the transcript in accordance with the commentary. The application verifies the commentary prior to correcting the portion in the transcript.

Referring now to FIG. 1A, a block diagram of a transcribing and translating system 10 is shown with four participants 11-14 in communication with a cloud structure 110. Each of the four participants can speak a different language (language 1 through language 4) or one or more can speak the same language while a few speak a different language. A first participant 11 is a speaker while the other three participants 12-14 are listeners. If a different participant speaks, the other three participants become listeners. That is, each participant can both be a speaker and a listener. For ease in explanation, we consider the first participant to be the speaker and the other participants listeners. The plurality of participants are part of a group in a meeting or conference to communicate with each other. Some or all of the participants can participate locally or some or all can participate remotely as part of the group.

A very low latency by the software application to deliver voice transcription and language translation enables conferences to progress naturally, as if attendees are together in a single venue. The transcription and translation are near instantaneous. Once a speaker finishes a sentence, it is translated. The translation can introduce a slight, and in many cases imperceptible, delay before a listener can hear the sentence in his/her desired language with text to speech conversion. Furthermore, speaking by a speaker often occurs faster than a recipient can read the translated transcript of that speech in his/her desired language. Because of lag effects associated with waiting until a sentence is finished before it can be translated and presented in the chosen language of a listening participant, the speed of the speech as heard by the listener in his/her desired language can be sped up slightly, so it seems synchronized. The speed of text to speech conversion is therefore adaptive for better intelligibility and user experience. The speed of speech can be adjusted in either direction (faster or slower) to adjust for normalcy and the tenor of the interaction. The speaking rate can be adjusted for additional reasons. A “computer voice” used in the text to speech conversion can naturally speak faster or slower than the presenter. The translation of a sentence can include more or fewer words to be spoken than in the original speech of the speaker. In any case, the system ensures that the listener does not fall behind because of these effects.

The system can provide quality control and assurance. The system monitors the audio level and audio signals for intelligibility of input. If the audio content is too loud or too soft, the system can generate a visual or audible prompt to the speaker in order to change his/her speaking volume or other aspect of interaction with his/her client electronic device, such as a distance from a microphone. The system is also configured to identify audio that is not intelligible, is spoken in the wrong language, or is overly accented. The system can use heuristics or rules of thumb that have been discovered to be successful in the past of maintaining quality. The heuristics can prove sufficient to reach an immediate goal of an acceptable transcription and translations thereof. Heuristics can be generated based on confidence levels on interactive returns of a speaker's previous spoken verbiage.

The cloud structure 110 provides real time speech transcription and translation involving multiple languages according to an embodiment of the present disclosure. FIG. 1A depicts the cloud structure 110 having at least one software application with artificial intelligence being executed to perform speech transcription and language translation. When participant 1, the speaker, speaks in his/her chosen language, language 1, it is transcribed and translated in the cloud for the benefit of the other participants 12-14 into the selected language (language 2 through language 4) of those participants so they can read the translated words and sentences associated with the language 1 of the spoken speech of participant 1.

Referring now to FIG. 1B, a block diagram of a transcribing and translating system 100 is shown using the cloud structure 110 in real time speech and translation involving multiple languages, context setting, and transcript development features in accordance with an embodiment of the present disclosure. The transcribing and translating system 100 uses advanced natural language processing (NLP) with artificial intelligence to perform transcription and translation.

FIG. 1B depicts components and interactions of the clients and the one or more servers of the system 100. In a cloud structure 110, one or more servers 102A-102B can be physical or virtual with the physical processors located anywhere in the world. One server 102A can be geographically located to better serve the electronic client devices 106A-160C while the server 102B can be geographically located to better serve the electronic client device 106D. In this case the servers 102A-102B are coupled in communication together to support the conference or meeting between the electronic devices 106A-106D.

The system 100 includes one or more translation and transcription servers 102A-102B executing one or more copies of the translation and transcription application 104A-104B. For brevity, the translation and transcription server 102A-102B can simply be referred to herein as the server 102 and the translation and transcription application 104A-104B can be simply referred to as the application 104. The server 102 executes the application 104 to provide much of the functionality described herein.

The system 100 further includes a client devices 106A-106D with one referred to as a speaker (host) device 106A and others as listener (attendee) client devices 106B-106D. These components can be identical as the speaker device 106A and client devices 106B-106D can be interchangeable as the roles of their users change during a meeting or conference. A user of the speaker device 106A can be a speaker (host) or conference leader on one day and on another day can be an ordinary attendee (listener). The roles of the users can also change during the progress of meeting or conference. For example, the device 106B can become the speaker device while the device 160A can become a listener client device. The speaker device 106A and client devices 106B-160D have different names to distinguish their users but their physical makeup can be the same, such as a mobile device or desktop computer with hardware functionality to perform the tasks described herein.

The system 100 also includes the attendee application 108A-108D that executes on the speaker device 106A and client devices 106B-106D. As speaker and participant roles can be interchangeable from one day to the next as described briefly above, the software executing on the speaker device 106A and client devices 106B-106D is the same or similar depending on whether a person is a speaker or participant. When executed by the devices 106A-160D, the attendee application 108A-108D can provide the further functionality described herein (e.g., a graphical user interface).

On-Demand System

The transcribing and translating system 100 is an on-demand system. In the cloud 110, the system 100 includes a plurality of computing resources including computing power with physical resources widely dispersed and with on-demand availability. As a presentation or meeting is progressing, a new participant can join the presentation or meeting in progress and obtain transcription and translation on demand in his or her desired language. The system 100 does not need advanced knowledge of the language spoken or the user desired languages into which the translation is to occur. The cloud platform includes at least one server that can start up transcribing engines and transcription engines on demand. As shown in FIG. 1B, the cloud 110 includes translation engines 112A-112D and transcription engines 113A-113D that can be drawn upon by the server application 104A,104B and the attendee applications 108A-108D executing on the client devices 106A-106D. The system can start up a plurality of transcription engines 113A-113D and translation engines 112A-112D upon demand by the participants as they join a meeting.

Typically, one transcription engine 113A-113D per participant is started up as shown. If each participant speaks a different language, then typically, one translation engine 112A-112D per participant is started up as shown. The translation engine adapts to the input language that is currently being spoken and transcribed. If another person speaks a different language, the translation adapts to the different input language to maintain the same output language desired by the given participant.

Client-Server Devices

Referring now to FIG. 1C, an instance of a client electronic device 106 for the client electronic devices 106A-106D shown in FIG. 1B. The client electronic device can be a mobile device, tablet, or laptop or desktop computer. The electronic device includes a processor 151 and a memory 152 (e.g., memory device or other type of storage device) coupled to the processor 151. The processor 151 executes the operating system (OS) and the attendee application 108.

The speaker speaks in his/her chosen language into a microphone 154 connected to the client device 106. The client device 106 executes the attendee application 108 to process the spoken speech into the microphone into audio content. The client electronic device 106 further includes a monitor 153 or other type of viewing screen to display the translated transcript text of the speech in their chosen language. The translated transcript text of the speech can be displayed within a graphical user interface (GUI) 155 displayed by the monitor 153 of the electronic device 150.

Referring now to FIG. 1D, an instance of a server system 102 for the one or more servers 102A-102B is shown in FIG. 1B. The server system 102 comprises a processor 171, and a memory 172 or other type of data storage device coupled to the processor 171. The translation and transcription application 104 is stored in the memory 172 and executed by the processor 171. The translation and transcription application 104 can start up one or more transcription engines 113 in order to transcribe one or more speaker's spoken words and sentences (speech) in their native language and can start up one or more translation engines 112 translate the transcription into one or more foreign languages of readers and listeners of a text to speech service. Many languages can be accommodated simultaneously in a single group conversation. Accordingly, a plurality of translation engines 112 can operate in parallel to simultaneously translate the speaker's transcription into a plurality of different languages that is selected by the readers/listeners.

Models

A translation model 132 and a transcription model 133 are dynamically built by the translation and transcription application 104 and can be stored in the memory 172. The translation model 132 and the transcription model 133 are for the specific meeting session of services provided to the participants shown by FIGS. 3A-3E. The translation model (model of translation) 132 and the transcription model 133 can be based on the locations of users of the client devices, and on observed attributes of the translation engines and the transcription engines (e.g., selected reader/listener languages, spoken languages, and translations made between languages). Additional factors that can used by the models are at least one of the first spoken language and the first and second language preferences, the subject matter of the content of speech/transcription (complexity, confidentiality), voice characteristics of the spoken audio content, demonstrated listening abilities and attention levels of users of the first and second client devices, technical quality of transmission, and strengths and weaknesses of the transcription and translation engines. The models are dynamic in that they adapt as participants add and/or drop out of the meeting, as different languages are spoken or selected to provide different services, and as other factors change. The models can be built and adjusted on a sentence by sentence basis. The models can dynamically choose which translation and transcription engines to use in order to support the meeting and the participants. In other words, these are models of the system that can learn as the meeting is started and as the meeting progresses.

Context and Glossaries

The context of spoken content in a meeting, that clarifies meaning, can be established from the first few sentences that are spoken and translated. The context can be established from what is being spoken as well as the environment and settings in which the speaker is speaking. The context can be established from one or more of the subject matters being discussed, the settings in which the sentences or other parts are spoken, the audience to which the sentences are being spoken (e.g., the requests for translations into other languages on client devices) and cultural considerations of the users of the client devices. Further context can be gathered from the cultural and linguistic nuances associated with the translations between the languages.

The context can be dynamically adjusted as a meeting session proceeds. The context of the captured, transcribed, and translated material can be carried across speakers, languages, and from one sentence to the next. This action of carrying the context can improve the quality of a translation, support the continuity of a passage, and provide greater value, especially to listening participants that do not speak or understand the language of a presenter/speaker.

As discussed herein, individual portions (e.g., sentences, words, phrases) of captured and transcribed speech are not analyzed and translated in isolation from one another. Instead, the transcribed speech is translated in the context of what has been said previously. As noted, the carrying of the context of speeches can occur across speakers during a meeting session. For example, consider a panel discussion or conference call where multiple speakers often make speeches or presentations. The context, the meaning of the spoken content, can be carried forward, broadened out, and refined based on the spoken contribution of the multiple speakers. The system can blend the context of each speaker's content into a single group context such that a composite context is produced of broader value to all participants. The one or more types of context 134 can be stored in memory 172 or other storage device that can be readily updated.

For a meeting session, the system can build one or more glossaries 135 of terms for specific participants, groups, and organizations that can be stored in memory 172 or other storage device of a server 120 as is shown in FIG. 1D. Organizations commonly create and use acronyms and other terms to facilitate and expedite internal communications. Glossaries of these terms for specific participants, groups, and organizations could therefore be built, stored and drawn upon as needed. The system can detect and extract key terms and keywords from spoken content to build and adjust the glossaries.

The glossaries 135 and contexts 134 developed can incorporate preferred interpretations of some proprietary or unique terms and spoken phrases and passages. These can be created and relied upon in developing context, creating transcripts, and performing translations for various audiences.

The transcript can rely on previously developed glossaries. In an embodiment, a first transcript of a conference can use a glossary (private glossary) appropriate for internal use within an organization. A second transcript of the same conference can use a general glossary (public glossary) more suited for public viewers of the transcript of the conference.

Services

Referring now to FIGS. 3A-3D, a speaker speaks in his/her chosen language (e.g., language 1) into a microphone 154 connected to the device 150. The microphone device 154 forms audio content (e.g., speech signal) from the spoken language. The audio content spoken in the first language (language 1) is sent to the server 102 in the cloud 110.

The server 102 in the cloud provides a transcription service converting the speech signal from a speaker into transcribed words of a first language. A first transcription engine 113A can be called to transcribe the first attendee (speaker) associated with the electronic device 106A. If other attendees speak, additional transcription engines 113B-113D can be called up by the one or more servers and used to transcribe their respective speech from their devices 106B-106C in their respective languages.

For the client device 106B, the server 102 in the cloud further provides a translation service by a first translation engine 112A to convert the transcribed words in the first language into transcribed words of a second language differing from the first language. Additional server translation engines 112B-112C can be called on demand, if different languages are requested by other attendees at their respective devices 106C-106D of the group meeting. If a plurality of client devices 106B-106C request the same language translation of the transcript, only one translation engine need be called into service by the server and used to translate the speaker transcript. The translated transcript in the second language can be displayed on a monitor M.

In FIG. 3D, an attendee may desire to listen to the translated transcript in the second language as well. In which case, a text to speech service can be used with the translated transcribed words in the second language to provide a speech signal. The speech signal can drive a loudspeaker 354 to generate spoken words from the translated transcript in the second language. In some embodiments a client electronic device 106 with a loudspeaker can provide the text to speech service and generate a speech signal. In other embodiments, the server 102 can call up a text to speech engine with a text to speech service and generate a speech signal for the loudspeaker 354 of a client electronic device 106.

Referring now to FIG. 3E, a block diagram is shown on the services being provided by the client server system to each attendee in a group meeting. The services allow each attendee to communicate in their own respective language in the group meeting with the other attendees that can understand different languages. Each attendee can have their own transcript service to transcribe their audio content into text of their selected language. Each attendee can have their own translate service to translate the transcribed text of others into their selected language so that it can be displayed on a monitor M and read by the respective attendee in their selected language. Each attendee can have their own text to speech (synthesis) service to convert the translated transcribed text in their selected language into audio content that it can be played by a loudspeaker and listened to by the respective attendee in their selected language.

Referring now to FIG. 4, a conceptual diagram of the transformation process by the system is shown. The spoken content in a meeting conference is transformed into a transcription of text and then undergoes multi-language translation into a plurality of transcriptions in different languages representing the spoken content.

The audio content 401 is spoken in a first language, such as English. While speech recognition applications typically works word by word, voice transcription of speech into a text format works on more than one word at a time, such as phrases, based on the context of the meeting. For example, speech to text recognizes the portions 402A-402 of the audio content 401 as each respective word of the sentence, Eat your raisins out-doors on the porch steps. However, transcription works on converting the words into proper phrases of text based on context. For example, the phrase 404A of words Eat your raisins is transcribed first, the phrase 404B out-doors is transcribed, and the phrase 404C on the porch steps is transcribed into text. The entire sentence is checked for proper grammar and sentence structure. Corrections are made to the sentence as needed and the text of the sentence is fully transcribed for display on one or more monitors M that desire to read the first language English. For example, participants that selected the first language English to read would directly 413, without language translation, each have a monitor or display device to display a speech bubble 410 with the sentence “Eat your raisins out-doors on the porch steps”. However, participants that selected a different language to read need further processing of the audio content 401 that was transcribed into a sentence of text in the first language, such as English.

A plurality of translations 412A-412C of the first language (English) transcript are made for a plurality of participants that want to read a plurality of different languages (e.g., Spanish, French, Italian) that differ from the first language (e.g., English) that was spoken by the first participant/speaker. A first translation 412A translates the first transcript in the first language into the second language generating a second transcript 420A of text in the second language. Assuming Spanish was selected to be read, a monitor or display device displays a speech bubble 420A of the sentence of translated transcribed text such as “Coma sus pasas al aire libre en los escalones del porche”. Simultaneously for another participant, translation 412B of the first transcript in the first language into a third language generates a third transcript 420B of text in the third language. Assuming French was selected to be read, a monitor or display device displays a speech bubble 420B of the sentence of translated transcribed text such as “Mangez vos raisins secs à l′extérieur sur les marches du porche”. Simultaneously for another participant, translation 412C of the first transcript in the first language into a fourth language generates a fourth transcript 420C of text in the fourth language. Assuming Italian was selected to be read, a monitor or display device displays a speech bubble 420C of the sentence of translated transcribed text such as “Mangia l′uvetta all′aperto sui gradini del portico”.

The server identifies ends of sentences in order to make final translations. Once a speaking participant finishes speaking a sentence and it is transcribed into text of his/her native language, it can be translated into the other languages that are selected by the participants. That is, in one embodiment, the translation from one language to another works on an entire sentence at a time based on the context of the meeting. In another embodiment, the translation process can chunk a sentence into multiple phrases of a plurality of words and separately translate the multiple phrases, particularly if the sentence is very long of words are spoken slowly. To avoid significant lag, a translation engine may work on phrases of a sentence in parallel without waiting for a translation of an earlier phrase. With the translated chunks of the sentence, a final translation can then be made over the entire sentence based on the context of the meeting and check for accuracy of the phrase translations, updating as needed.

Other participants can speak and use a different language that that of the first language. For example, the participant that selected the second language, such as Spanish, can speak. This audio content 401 is spoken in the second language. Speech to text recognizes the portions 402A-402 of the audio content 401 as each respective word of the sentence and is transcribed into the second language. The other participants will then desire translations from the text of the second language into text of their respective selected languages. The system adapts to the user that is speaking and makes translations for those that are listening in different languages. Assuming each participant selects a different language to read, each translation engine 112A-112D shown in FIG. 1B adapts to the plurality (e.g., three) of languages that can be spoken to translate the original transcription from and into their respective selected language for reading.

With a translated transcript of text, each participant can choose to hear the sentence in the speech bubble in their selected language. A text to speech service can generate the audio content. The audio content can then be processed to drive a loudspeaker so the translation of the transcript can be listened to as well.

Graphical User Interfaces

Referring now to FIG. 5A, the attendee client application 108 generates a graphical user interface (GUI) 155 that is displayed on a monitor or display device 153 of the electronic device 106. The GUI 155 includes a language selector menu 530 from which to select the desired language the participant wants to read and optionally listen as well. A mouse, a pointer or other type of GUI input device can be used to select the menu 530 and display a list of a plurality of languages from which one can be selected. The GUI 155 can further include one or more control buttons 510A-510B that can be selected with a mouse, a pointer or other type of GUI input device. The one or more control buttons 510A-510D and the menu 530 can be arranged together in a control panel portion 501A of the GUI 155.

A display window portion 502A of the GUI 155 receives a plurality of speech bubbles 520A-520C each displaying one or more translated transcribe sentences for reading by a participant in his/her selected language. The speech bubbles can display the transcribed and translated from speech spoken by the same participant of by speech that is spoken by two or more participants. Regardless of the language that is spoken by the two or more participants, the text is displayed in the selected language by the user.

The speech bubbles can be selected by the user and highlighted by highlighting or tagged, such as shown by a tag 550 to speech bubble 520B in FIG. 5A. The one or more control buttons 510A-510D can be used to control how the user interacts with the GUI 155.

FIGS. 5B-5C illustrate other user interfaces that can be supported by the system 100.

Tags, Highlights, Annotations and Running Meeting Transcripts

FIG. 2 illustrates an example of a running meeting transcript 200. The entire spoken audio content captured during a meeting session is transformed into text 202-205 by the speech to text service of a transcription engine. The text 203-205 is further translated by each translation engine of the system if multiple speakers are involved using a different language. Some text 202, if already in the desired language of the transcript need not be translated by a translation engine. The transcript text is translated in real time and displayed in speech bubbles on client devices in their requested language.

Participants can interact with the transcript 200 through the speech bubbles displayed on their display devices. The participants can quickly tag the translated transcript text with one or more tags 210A-210B as shown in FIG. 2. Using the software executed on their devices, participants can also submit annotations 211 to their running meeting transcript 200 to highlight portions of a meeting. The submitted annotations can summarize, explain, add to, and question portions of transcribed text.

Multiple final meeting transcripts can be generated based on a meeting that can have a confidential nature to it. In which case, a first transcript of the meeting conference can use a glossary (private glossary) appropriate for internal use within an organization. A second transcript of the same meeting conference can use a general glossary (public glossary) more suited for public viewers of the transcript.

When a participant, whether speaker or listener, sees what he/she believes is a translation or other error (e.g., contextual issue or idiomatic issue) in the transcript, the participant can tag or highlight the error for later discussion and correction. Participants are enabled, as the session is ongoing and translation is taking place on a live or delayed basis, to provide tagging of potentially erroneous words or passages. The participant can also enter corrections to the transcript during the session. The corrections can be automatically entered into an official or secondary transcript. Alternatively, the corrections can be held for later review and official entry into the transcript by others, such as the host or moderator.

Transcripts can be developed in multiple languages as speakers make presentations and participants provided comments and corrections. The software application can selectively blend translated content provided by one translation engine with translated content provided by other translation engines. During a period of the meeting conference, one translation engine can translate better than the other translation engines based on one or more factors. The application can selectively blend translated content based on the first spoken language, the language preferences, subject matter of the content, voice characteristics of the spoken audio content, demonstrated listening abilities and attention levels of users at their respective client devices, and the technical quality of transmission.

Participants can annotate transcripts while the transcripts are being created. Participants can mark or highlight sections of a transcript that they find interesting or noteworthy. A real time running summary (running meeting transcript) can be generated for participants unable to devote full attention to a conference. For example, participants can arrive late or be distracted by other matters during the meeting conference. The running summary (running meeting transcript) can allow them to review what was missed before they arrived or while they were distracted.

The system can be configured by authorized participants to isolate selected keywords to capture passages and highlight other content of interest. When there are multiple speakers, for example during a panel discussion or conference call, the transcript can identify the speaker of translated transcribed text. Summaries limited to a particular speaker's contribution can be generated while other speakers' contributions may not be included or can be limited in selected transcriptions.

User Interfaces with Dual/Switchable Translations

Systems and methods described herein provide for listener verification of translation of content spoken in a first language displayed in a text format of the translated content in a second language of the listener's choosing. A speaker of content in the first language can have his/her content translated for the benefit of an audience that wishes to hear and read the content in a chosen second language. While the speaker is speaking in his/her own language and the spoken content is being translated on a live basis, the spoken content is provided in translated text form in addition to the translated audio.

The present disclosure concerns the translation of the content spoken in the first language into translated text in the second language, and situations in which the text in the second language translation may not be clear or otherwise understandable to the listener/reader. The system 100 further provides the listener/reader a means to select the translated and displayed text and be briefly provided a view of the text in the spoken or first language. The listener/reader can thus get clarification of what the speaker said in the speaker's own language, as long as the listener/reader can read in the speaker's language.

Referring now to FIG. 5B, the speaker and the listener/reader (participants) can view a graphical user interface (GUI) 155 on a monitor 153 of an electronic device. The electronic device can be a mobile device, tablet, or laptop or desktop computer, for example. The speaker's spoken content is viewable in the speaker's language in a first panel 502B of the interface, for example a left-hand panel or pane. The first panel 502B illustrates speech bubbles 520D is an untranslated transcription. The content is then displayed as translated text 520E in the listener's language in a second panel 502A of the interface, for example a right-hand panel or pane.

In one embodiment, the left-hand panel 502B may not be viewable by the listener/reader to avoid confusion, such as shown by FIG. 5A. In another embodiment, one of the control buttons 501A-510D can be used to view the left-hand panel 502B particularly when a listener becomes a speaker in the meeting, such as when asking questions or becoming the host.

As the speaker speaks in a first language (e.g., French), the system can segment the speaker's spoken content into logical portions, for example individual sentences or small groups of sentences. If complete sentences are not spoken, utterances can be translated. The successive portions of the spoken content can be displayed as text in the listener's chosen language (e.g., English) in speech cells or bubbles 520A-520D of the listener's panel 502A.

The listener can also audibly hear the translated content in his/her chosen language while he/she sees the translated content in text form in the successive bubbles 520A-520D. If the listener is briefly distracted from listening to the spoken translation, he/she can read the successive bubbles 520A-520D to catch up or get a quick summary of what the speaker said. In situations wherein the listener may not be proficient at understanding the audible translation, having the displayed text of the translation can help in understanding the audible translation. For example, if the participants in a room (meeting) insist on everyone using the same translated language for the audible content, a language in which a listener is not proficient, having the displayed text of the translation can help understanding the audible translation.

There may be instances in which a listener is not certain he/she correctly heard what a speaker said. For example, an audio translation may not come through clearly due to lengthy transmission lines and/or wireless connectively issues. As another example, the listener may have been distracted and may have muted the audio portion of the translated content. As another example, the listener may be in a conference room with other persons listening to the presenter on a speaker phone for all to hear. However, all other participants only speak the translated or second language, but the one listener does not. With both the translated panel 502A and the untranslated panel 502B of text displayed by the GUI 155, a listener can read and understand the translated content better in his/her selected displayed language when he/she audibly hears the translated content in a different language.

Referring now to FIG. 5C, instead of side-by-side panels 502A-502B shown in FIG. 5B, the system can provide an alternate method of showing the untranslated content of a speech bubble 520A-520C. In this case, the listener (participant) in this situation who needs clarification can click on or otherwise select the bubble or cell that displays the portion of content about which he/she seeks clarification. For example, the listener (participant) selects the speech bubble 520A in FIG. 5A that is in a translated language (e.g., English—“Translated transcription is viewable here in this panel or window.”) selected by the listener (participant) but the speaker is speaking in a different language (e.g., French). When the listener does so, the speech bubble or cell 520A briefly transforms (switches) from the translated transcribed content (e.g., English) to the transcribed content in the speaker's language (e.g., French), such as shown by the speech bubble 520A′ (“La transcription traduite est visible ici dans ce panneau ou cette fenêtre.”) shown in FIG. 5C.

Referring now to FIG. 5A, an alternate embodiment is shown, instead of the speech bubble or cell 520A transforming into the speech bubble 520A′ shown in Figure C. The listener (participant) selects the speech bubble 520A in FIG. 5A such that a second speech bubble or cell 521A briefly appears nearby in the panel 502A of the user interface 155. The bubble or cell 521A displays the transcribed text content in the speaker's language (e.g., French). The appearance of the second cell 521A in the panel 502A of the user interface 155 can be displayed a for predetermined period of time (e.g., several seconds) and then disappear. Alternatively, the appearance of the second cell 521A can be displayed for so long as the user positions or hovers the device's cursor over the speech bubble or cell 520A with the translated transcribed content. Alternatively, the appearance of the second speech bubble or cell 521A can be displayed until the listener (participant) takes some other explicit action. For example, the user can select one of the control buttons 510A-510D or the second speech bubble 521A itself displayed in the monitor or display device to make the second speech bubble 521A disappear.

Example Remote Conference

Referring now to FIG. 6, consider an example of a conference or a cloud-based meeting involving multiple participants where the host/presenter 601 speaks English (native language) in one room 602 that is broadcast through the internet cloud to a remotely located room 605 including a plurality of people (listeners) 612A-612N. One of the people (participant listener) in the room, such as participant N 612N, can couple his/her electronic device 106 to one or more room loudspeakers 608 and one or more room monitors M 610 to share with the other participants in the room 605. Accordingly, the transcribing and translating system 600 disclosed herein can be configured to audibly broadcast the presenter's spoken content translated into French (foreign language) into the room 605 over one or more loudspeakers 608 therein. The system 600 can be further configured to display the presenter's textual content translated into French text on a monitor M 610 in the remotely located room 605. The translated textual content is displayed in speech bubbles or cells on the monitor 610 in the remotely located room 605. One or more people 612A-612N in the remotely located room 605 can also access the system 600 via their own personal electronic devices. On monitors or display devices 153 of their personal electronic devices 106 shown in FIG. 1C, the people (listeners) 612A-612N can read the displayed content in French text (or other user selected language) while hearing the presenter's spoken content in French language over the loudspeakers 608.

While the presenter 601 is speaking the English language and the attendees (people) 612A-612N in the remote room are hearing French language and seeing/reading in French text. However, consider the case that a portion of the presenter's broadcasted spoken material translated into French does not sound quite right (e.g., participant identifies broadcasted spoken material as being an inaccurate translation) or does not read quite correctly to one or more attendees (e.g., participant identifies the written translation as being an inaccurate translation). Jargon and slang in English, both American English and other variants of English do not always translate directly into French or other languages. Languages around the world feature nuances and differences that can make translation difficult. This can be particularly true in business conversations wherein industry jargon, buzzwords, and internal organizational jargon simply do not translate well into other languages. This system can assist the attendee if something in French text in the speech bubble does not read quite right or is something generated in French language audio is not heard quite right from the loudspeakers. The listener attendee may want to see the English text transcribed from what the presenter/host spoke, particularly if they are bilingual or have some understanding of the language that the presenter is speaking.

The attendee can click on or otherwise activate the speech bubble (e.g., bubble 520A′ shown in FIG. 5C) displaying French text associated with the translated sentence. The system can transform from displaying French text into displaying English text in the speech bubble (e.g., bubble 520A shown in FIG. 5A) transcribed from what the presenter said in English. The attendee can read the English text instead of a confusing translation of English language. The transcribed words and sentence the presenter spoke in English are displayed in the speech bubble 520A. In this manner the attendee listener can momentarily read the untranslated transcribed text of the speaker/host for clarification.

Systems and methods provided herein therefore provide an attendee listening and reading in the attendee's chosen language to review text of a speaker's content in the speaker's own language. The attendee can gain clarity by taking discreet and private action without interrupting the speaker or otherwise disturbing the flow of a meeting or presentation.

Graphical User Interface for Switching Between Spoken Languages

Many persons are bilingual or multilingual with the capability of speaking two or more different languages. A speaker during a meeting or conference may desire to speak different languages while giving a presentation or speech or during a question and answer period. For example, the speaker may want to particularly speak a different language to part of a session to be sure the proper message is given to the audience.

A speaker or presenter can wish to alternate (swap) between a first and second spoken language and have his/her spoken material seamlessly translated into the third language both audibly and textually for the benefit of the listening audience. The speaker may know beforehand of the need to quickly switch to speaking from the first language to solely interested in hearing the speaker's material in their desired language, the third language. In embodiments, the listeners may be remote from the speaker and may not be viewing the speaker via video link or may not be viewing the speaker at all.

The speaker, whose primary language can be the first language, may need to speak some content that he/she only knows how to say or is comfortable speaking about in the second language. The speaker may need to paraphrase or read printed or written material verbatim that is in the second language. Possibly there may be a subset of participants of the event with particular interest in a specific portion of the subject matter that is best spoken for their benefit in the second language but which the larger audience still needs to hear and read in the third language.

Such a need for switching by the speaker from the first language to the second language may be sudden, may be in mid-sentence, and may occur with little or no warning to the speaker, to the translation system, or to the audience. An attendee may ask a question that is best answered in the second language, or an interruption can occur where speaking the second language becomes necessary. For example, those who only speak the second language can suddenly join the conference and require answers to questions or subject matter be addressed in the second language. The systems and methods disclosed herein, once configured, allow the speaker to quickly switch between the first and second languages (and vice versa) and notify the system of such switching with a single mouse click or other minimal action.

Referring now to FIG. 1B, the transcribing and translating system 100 described herein can allow a speaker to switch or swap between spoken languages during an electronic conference or meeting session (e.g., audio-video conference) held between electronic devices 106A-106D. In one embodiment, the speaker from his electronic device 106A can notify the transcribing and translation system 100 when he or she desires to change between different spoken languages. A single click of an input device (e.g., mouse) in the graphical user interface can notify the transcribing and translation system 100 upon each change or swap of languages during the session. Receipt of such notification by the transcribing and translation system 100 allows it to seamlessly continue transcribing and translating into an audience's chosen language, without delay, despite a change in spoken language by the speaker, even though such change in language can be sudden and impromptu.

Referring now to FIG. 7A, a graphical user interface (GUI) 155 is shown to a bilingual speaker to configure the system 100 to allow him/her to choose two languages in which to speak during a session that will be translated for the benefit of listeners/readers. A control panel 501C of the GUI 155 includes a first spoken language selector menu 530A and a second spoken language selector menu 530B to select the different first and second languages which the speaker will use to speak in the meeting or conference session. Once configured, the speaker during the session can, with a single mouse click, keystroke, or other input action with an input device, swap languages he/she is speaking. The control panel 501C of the GUI 155 further includes one or more control buttons 510A-510B and/or a control icon 730 to receive the input action from the input device to notify the system the speaker intends to swap between the different first and second languages.

The servers of the system 100 initially expect to transcribe the first spoken language into text and translate the text from the first spoken language into the selected audience language of one or more users. The speaker may know in advance that a majority of his audience will desire a specific audience language. In some embodiments, the control panel 501C can include a translation language selector menu 732 in which the majority of the audience may desire. The servers of the system 100 can be further prepared in advance to translate from the two different spoken languages by the speaker into at least the one selected translation language. Different selections for the translation language can be made at each user device 106B-106C by their own language selector menu 530 shown in FIGS. 5A-5C.

During the spoken speech or discussion by the speaker, the servers 102A-102B of the system 100 expect to receive a swap language input signal from the speaker's electronic device. The servers 102A-102B, after receiving the single click or other action of the input device via the control input button or icon, is thereby notified that the speaker has changed languages, recognizes the earlier configured alternate language now being spoken, and translates from the alternate language into the audience's chosen language. The audience may be unaware that the speaker has changed languages as the audience remains interested in hearing the content in its own chosen translation language. The selected translation language (audience language) typically differs from the two spoken languages that are chosen by the speaker and thus requires a translation engine.

Systems and methods described herein provide for a speaker to configure at the outset the two languages he/she can speak during a session. The speaker makes a selection into a graphical user interface that notifies the system of the two languages that will need translation services. As discussed previously, a control button 510A-510D can be used to notify the system 100 when the speaker swaps between languages being spoken. Alternatively, the speaker is provided a simple interface with a selectable icon 730 that reads “swap languages” or something similar. The speaker can easily activate the selectable icon 730 as needed to notify the system 100 that the speaker has changed to speak in a different language. The system 100 can readily change to translating and transcribing the speaker's content. The audience may not even notice that the speaker has changed to speaking a different language because the system 100 continues to transcribe and translate so the audience can read and hear the translated spoken content in the same language that the listener requests.

Referring now to FIG. 1B, the system 100 can provide various transcription and translation services such as simultaneously translating a speaker's content spoken in one language into another language and perhaps even yet another language at the request of audience members. The system can also concurrently transcribe and display the translated text on a display device so audience members can read the content in their chosen language in addition to hearing the content in their chosen language. The speaker and the listeners/readers can be remotely located from each other. Each can view a graphical user interface 155 generated by the client application on an electronic device 106A-106D, such as a mobile electronic device (e.g., smart phone, or tablet with a touch screen; or laptop) or a stationary electronic device (e.g., a desktop computer).

When preconfigured by the speaker at the speaker electronic device 106A through the execution of application 108A and generation of the graphical user interface 155 shown in FIG. 7A, the system 100 can assign multiple transcription engines 113A-113C to the speaker for the multiple languages in which he intends to speak. For example, consider the case where the speaker selects English as the first spoken language and French as the second spoken language. The speaker is assigned transcription engine 113A to transcribe spoken English into English text. The speaker is further assigned transcription engine 113B to transcribe spoken French into French text.

The audience of listener attendees, through one or more of their user interfaces 155 generated by execution of the applications 108B-108D by their electronic devices 106B-106D, select to read and listen using the Japanese language. Accordingly, when English is spoken by the speaker, a first translation engine 112A is assigned to client devices to translate English text into Japanese text. When French is spoken by the speaker, the second translation engine 112B assigned to client devices is used to translate French text into Japanese text. The system 100 utilizes the swap language notification from the speaker to change between using the first and second transcription engines 112A-112B and swap between the first and second translation engines 112A-112B in order to select what translated transcribed text to display and what translated speech to generate for the listeners/readers at their electronic devices 106B-106D.

Referring now back to FIG. 7A, at the speaker's electronic device 106A, the untranslated transcription of the speaker's spoken content is viewable in one or more untranslated speech bubbles 520D in a first panel 502B of the graphical user interface 155. The language of the untranslated text in the one or more untranslated speech bubbles 520D is the one or more languages spoken by the speaker. The first panel 502B can be a left-side panel or pane for example. The servers 102A-120B of the system 100 translate the untranslated transcription of the speakers spoken content into the one or more languages requested by the listeners/readers attendees. In one embodiment, the servers 102A-120B of the system 100 also generate audio content from the translated transcription of the speakers spoken content. In another embodiment, the user or client electronic devices 106A-106B can generate the audio content from the translated transcription of the speakers spoken content.

At the electronic devices 106B-106D, the listener/readers (audience members) can read the translated transcribed text 520E in the listener's language in a second panel 502A that is displayed by the GUI 155. The translated transcribed text can be displayed in one or more translated transcribed speech bubbles (translated speech bubbles or translated text bubbles) 520E in a right-hand panel or right-hand pane 502A, for example. In some embodiments, the left-hand pane 502B is not viewable by the listener/reader that are not designated as hosts or speakers. The panel 502B can be reserved for when the listener/reader is designated to be a host speaker (or questioner during question and answer) and speaks during the meeting, in which case, the panel 502B can display untranslated transcribed text in their spoken language.

In addition to being able to read the translated transcribed text, the listener/reader can hear translated spoken content by the synthesis service provided by the server system or the client electronic devices. That is, the listener/reader can listen to the translated transcribed text from the speakers spoken content at the same time as reading the translated transcribed text displayed in the GUI 155.

Referring now momentarily to FIG. 4, as the speaker speaks, the system can segment the speaker's spoken content into logical portions, for example individual sentences, small groups of sentences, or even mere utterances that are not complete sentences. The successive portions of the spoken content can be displayed as translated transcribed text in the listener's panel 502A and in the listener's chosen language in a plurality of speech cells or bubbles 520A-520B, such as shown in FIG. 5A. The listener/reader can hear the translated content in his/her chosen language and can see and read the translated content in text form in his/her chosen languages in the successive bubbles.

The present disclosure provides for the speaker, based on the pre-configuration provided to the system through the user interface, to quickly change spoken languages from a first spoken language to a second spoken language and be confident that transcription and translation will not be interrupted. With one click or keystroke on the graphical user interface, the speaker can instruct the system that he/she is beginning to speak in the second language. The audience of listeners/readers need not be made aware that the speaker has changed his/her spoken language given the seamless transition that can be provided by the system.

As an example, consider a speaker that can speak both English and French, but the bulk of his/her global audience can only speak and read Japanese. The majority of the global audience wants to read and hear all spoken content in Japanese, regardless of whether the speaker is speaking in English or French languages.

The speaker can lead a meeting by speaking both English and French with the flexibility of switching back and forth, at will, between the spoken languages. The transcribing and translating system 100, prior to the meeting, is configured to translate either spoken language as needed. Before the meeting begins, the speaker enters into the system instructions that English and French will be spoken and will need to be translated into at least one other language. In accordance with one embodiment, the speaker can instruct the system that the English or French must be translated into Japanese by means of the translation language selector menu 732. In an alternate embodiment, a member of the audience can instruct the system using the translation language selector menu 732 on his/her graphical user interface that the spoken English or French language spoken by the speaker is to be translated into Japanese.

During the meeting, the speaker can switch at will back and forth between English and French, each time making a single mouse click on a control button 510A-510D or an icon 730 shown in FIG. 7A to notify the system of such change. The audience reading the translated transcribed text and listening to the spoken content in Japanese may be unaware of the speaker's changes from speaking the English language into speaking the French language and from speaking the French language into the English language.

Determining Speaker Language from Transcripts of Presentation

As discussed previously, the graphical user interface can be used by a speaker to set up in advance the use of a control button to switch between two different spoken languages during a meeting or conference session. However, it can be cumbersome for a speaker to notify the system of a switch in spoken languages. A speaker may forget to notify the system of a switch in spoken languages. Furthermore, the speaker may be more than just bilingual and want to speak in more than two languages during the session. Moreover, a swap or switch notification between two languages is not enough to support three or more languages. Accordingly, it is desirable to generate, from the spoken content, a plurality of transcripts of different preselected languages and select the transcription of one or more transcribed sentences that makes the most sense. The language spoken by the speaker is most likely the language associated with the selected transcription. In this manner, the transcribing and translating system 100 can automatically switch translations when the language being spoken changes, without the speaker having to notify the server of the transcribing and translation system 100 of the change.

Referring now to FIG. 7B, a multi-lingual speaker preparing to make a spoken presentation or speech at a cloud based video-conference or meeting session may choose to speak in a plurality of preselected languages. In a control panel 501D of the GUI 155, the speaker can preselect the multiple languages that he/she can speak during the session. The control panel 501D includes a plurality of speaker language selector menus 530A-530N. The speaker preselects each language that he intends to use in the session by means of the plurality of speaker language selector menus 530A-530N. The control panel 501D further includes one or more control buttons 510A-510D. In one embodiment, a speaker can use one of the one or more control buttons 510A-510D to indicate a beginning of his speech and an ending of his speech so that a question-and-answer session can begin, or a next speaker can begin to speak. In an alternate embodiment, a plurality of the control buttons 510A-510D are each assigned a different language of the plurality of preselected languages. The speaker can press one button to notify the system of the language currently being spoken and then press a different button to notify the system of a change to other languages being spoken.

Based on the preselected languages chosen through the use of the plurality of speaker language selector menus 530A-530N, the servers 102A-102B shown in FIG. 1B start up a plurality of transcription engines 113A-113D, one for each preselected language, to try and transcribe the spoken content into each different language. The plurality of transcription engines 113A-113D that are started up for the speaker will operate in parallel to try to transcribe the spoken content into text of the plurality of different languages that are preselected, before translation.

The GUI 155 can include a transcription panel 502B that displays untranslated transcribed speech (untranslated transcribed text) of the speaker. One or more untranslated speech bubbles (untranslated text bubbles or transcribed speech bubbles) 520D can be displayed in the transcription panel 502B to display the untranslated textual speech of the speaker. The system may annotate the untranslated speech bubbles with a tag 550 indicating the spoken language that was detected being spoken. The tag 550 of spoken language can provide information to assist in translations as needed for other attendees/participants. As a speaker views the untranslated transcript of text in panel 502B, he may also annotate certain untranslated speech bubbles 520D with a tag 550 or other highlight for further review and possible correction at the end of his speech.

At will, the speaker can change from speaking one of the preselected languages to another one of the preselected languages during his/her presentation in the virtual conference or meeting held amongst the attendee electronic devices in the system. Systems and methods disclosed herein provide for detection of a speaker's spoken content leading to a correct text transcription of the content and a correct text translation of the text transcription representing the speaker's audio content. The language of the transcribed text in a plurality of untranslated speech bubbles 520D can change as the speaker continues to speak and the spoken language changes. The untranslated speech bubbles 520D displayed in the transcription panel 502B represents the best untranslated transcribed text generated by one or more of the plurality of transcription engines 113A-113D. The one or more untranslated speech bubbles 520D also represent the spoken language that is detected because the best untranslated transcribed text is most likely to be in the spoken language.

Referring now to FIG. 8A, the transcribing and translation system 100 disclosed herein receives audio content of a speaker and converts the audio into text using each of the languages that the speaker has preselected. This process yields a text transcription of the speaker's content in each of the different languages the speaker preselected. A plurality of preliminary transcripts 802A-802N are generated, one for each of the different preselected languages chosen by the speaker. A first preliminary transcript 802A is text transcribed using a first language (transcript language 1). A second preliminary transcript 802B is text transcribed using a second language (transcript language 2) differing from the first language. The Nth preliminary transcript 802N is text transcribed using an Nth different language (transcript language N). The transcribed text in a preliminary transcript is in the same language. For example, English may be the first language such that all of the text in the first preliminary transcript is attempted to be transcribed into English text, whether or not the speaker speaks in English. For example, French may be the second language such that all of the text in the second preliminary transcript is attempted to be transcribed into French text even though the speaker can be speaking a different language at points in time. The plurality of preliminary transcripts 802A-802N are files temporarily stored in memory (e.g., DRAM, NVRAM) or a storage device (e.g., optical disk, magnetic disk, solid state storage device).

Each of the transcription engines 113A-113D assigned to the given speaker tries to transcribe the spoken speech into untranslated text. For example, a plurality of first untranslated text can be generated by simultaneously transcribing a first spoken sentence into text of the different languages preselected by the speaker. The plurality of first untranslated text can be represented by a plurality of untranslated speech bubbles (SB 1A) 803A,803B,806A through 803N across the plurality of preliminary transcripts 802A-802N for each of the different preselected languages. A second spoken sentence is transcribed by each of the transcription engines 113A-113D assigned to the speaker to generate a plurality of second untranslated text in each of the plurality of preliminary transcripts 802A-802N represented by a plurality of second untranslated speech bubbles (SB 2A). The speaker can change spoken languages from one sentence to the next. For example, the first spoken sentence can be spoken in a first language while a second spoken sentence can be spoken in a second language. The system 100 continues simultaneously transcribing spoken audio content into text of different languages through to a last Nth spoken sentence. The transcription engines 113A-113D can generate a plurality of Nth untranslated text across each of the plurality of preliminary transcripts 802A-802N represented by a plurality of untranslated speech bubbles (SB nA) in the different languages.

After each transcription of text, the system synchronizes the plurality of different text transcriptions in the different preselected languages together so they can be compared and examined together. The system examines each of the text transcriptions and chooses the one of plurality of text transcriptions that makes the most sense to be the official transcription. Each of the untranslated speech bubbles (SB 1A) 803A,803B,806A through 803N in each preliminary transcript 802A-802N are synchronized and examined together. For example, a server makes a determination that the untranslated speech bubble 806A in the preliminary transcript 802C makes the most sense for that sentence. Accordingly, the untranslated speech bubble 806A that is selected to be the best transcription represents the official transcription for that sentence. The system automatically makes a determination regarding which text transcription should be selected by using various techniques for judging the intelligibility of transcribed speech. Such techniques may include automatic grammar checkers or sentence vector techniques that attempt to measure coherence. Sentence vector techniques further involve machine learning models and artificial intelligence that create meaning vectors that can be used to judge the meaningfulness of the ordered words in a sentence. The transcription judged to have the highest meaningfulness is likely the official transcript and language in comparison with others that may have transcribed gibberish or meaningless words.

The system continuously makes transcriptions from the speaker's spoken content with each transcription engine 113A-113D that is started up for the speaker. The servers 102A-102B of the system continuously make comparisons and the selections 806A-806N of the best transcription from each preliminary transcript 802A-802N to be the official transcribed speech bubble (untranslated speech bubble) of each speaker. The one or more servers 102A-102B store the selections 806A-806N of the best transcription in an official speaker transcript (master transcript) 804A. The official speaker transcript (master transcript) is a file stored in memory of a memory device or storage of a storage device that can be subsequently reviewed by the speaker. The transcribed text in the official speaker transcript (master transcript) 804A can be text in different languages.

The language of the chosen transcript is deemed to be the language of the speaker for at least that particular passage (e.g., sentence) or portion of the presentation or session. The chosen transcript 806A-806N, representing the best transcription from each of the preliminary transcripts 802A-802N, can become the official speaker transcript (master transcript) 804A for the given speaker and may be used for translation into the various languages. The selected or chosen transcript portions 806A-806N that make the most sense, is how the speaker's language can be detected and how it changes during a session. The transcript language that generated the chosen transcript 806A-806N also determines each presenter's language being spoken during those specific untranslated speech bubbles. Accordingly, knowing the spoken language from the transcription process, the one or more translation engines 112A-112D shown in FIG. 1B started up by the servers can perform translations. Based on the detected spoken language, the translations engines 112A-112D can then translate the various selected transcribed text 806A-806N in the master speaker transcript 804A into the various languages that are chosen by the audience. As the presenter's (speaker's) language changes and is detected, the one or more translation engines 112A-112D can adapt to the change in language. The translation engines alter their translation based on the detected language being input, so that the language output for the audience (attendees) remains constant. As shown in FIG. 7A, the translated transcribed text can be shown to the reader/listener audience by one or more translated speech bubbles 520E in the panel 502A that is displayed in the GUI 155 on the display device 153.

Referring now to FIGS. 8A-8C, the systems and methods disclosed herein can support a plurality of presenters speaking different languages. Consider each of FIGS. 8A-8C represent a different presenter (speaker) of the same session or meeting. For example, the preliminary transcripts 802A-802N and the master speaker transcript 804A shown in FIG. 8A can be generated for a first speaker that speaks his/her audio content during the session in a first plurality of different languages. The preliminary transcripts 812A-812N and master speaker transcript 804B shown in FIG. 8B can be generated for a second speaker that speaks his/her audio content during the session in a second plurality of different languages. The preliminary transcripts 822A-822N and the master speaker transcript 804C shown in FIG. 8C can be generated for a third speaker that speaks his/her audio content during the session in a third plurality of different languages.

The numerous preliminary transcripts 802A-802N,812A-812N,822A-822N for each presenter (speaker) can be formed based on the plurality of presenters, the multitude of different languages each speaks, and the transcription engines that are used. The master speaker transcripts 804A-804C can be formed by selecting portions 806A-806N, 816A-816N, 826A-826N of each preliminary transcript 802A-802N,812A-812N,822A-822N that makes the most sense. The selected transcript portions that make the most sense, also determine each presenter's language being spoken during that specific speech bubble at that moment in time. The language that the translation engine used to transcribe the selected transcript portion determines, with a high likelihood, the language used by the speaker for the spoken content.

Referring now to FIG. 9, while the master speaker transcripts 804A-804C are being formed for each speaker, a composite meeting transcript 904 for the entire session can also be formed. The composite meeting transcript 904 can be generated automatically by the system in the different languages spoken. The selected portions 806A-806N, 816A-816N, 826A-826N of the master speaker transcripts 804A-804C are used to form a composite meeting transcript 904 for the entire session. The transcribed text is serially added to the preliminary transcripts, the master speaker transcripts, and the composite meeting transcript, in order, in near real time. Optionally, the system can form the preliminary speaker transcripts, the master speaker transcripts, and the composite meeting transcript chronologically according to timestamps that can be associated with the transcribed text. Upon user request, the composite meeting transcript 904 can be translated uniformly into a user chosen language.

The master speaker transcripts and the composite transcript can be saved and stored for a period of time. A speaker may review his/her master speaker transcript with any annotations and then suggest changes to it that can also be reflected into the composite transcript. Like the master speaker transcripts 804A-804C, the composite transcript 904 can be translated as needed into a desired uniform language by the one or more translation engines. A number of words, phrases and sentences can be saved from the various transcripts for future context/glossary, such as part of the context 134 and the glossary 135 stored in memory 172 or other storage device. Otherwise, after a period of time, the master speaker transcripts and the composite transcript can be discarded.

Cloud Based Speech Synthesis

In addition to being able to read the translated transcribed text, the listener/reader can hear translated spoken content by a speech synthesis service provided by the server system or the client electronic devices. That is, the listener/reader can listen to the translated transcribed text from the speakers spoken content at the same time as reading the translated transcribed text displayed in the GUI 155.

In FIG. 1B, the transcription engines 113A-113D can be called up by the one or more servers 102A-102B to transcribe the respective speech from the client devices 106A-106D in the respective languages being spoken. The translation engines 112A-112D of the servers 102A-120B of the system 100 can translate the untranslated transcription of the speakers spoken content into one or more languages requested by the listeners/readers attendees. To listen to spoken content in their requested language, the user or client electronic devices 106A-106D can generate the audio content from the translated transcription of the speakers spoken content.

Referring now to FIG. 1E, the system 100E further includes one or more synthesis engines 114A-114D to generate synthesized speech from translated transcriptions. The servers 102A-120B of the system 100E can start up synthesis engines 114A-114D to generate different digital audio content in the cloud from the different translated transcription of the speakers spoken content. That is, the language of the digital audio content for listening is the same selected by the user for reading and forming the translated transcript. The digital audio content is transmitted from the servers, along with the translated transcribed text to the client electronic devices 106A-106D.

With the coordination provided by the servers in the cloud, the synthesized audio can be better synchronized with the translated transcribed text when sent to the user or client electronic devices 106A-106D. The client electronic devices 106A-106D, instead of synthesizing, need only convert the digital audio content into analog signals, amplify, and drive a loudspeaker of the client electronic device for the user to listen while the GUI 155 synchronously displays the translated transcribed speech.

Block Lists and Correction Lists

As shown in FIG. 1D, glossaries for users or host organizations can be stored in memory or other storage device for use by a server 102 and its one or more transcription engines 113 and its one or more translation engines 112. As mentioned previously, the glossaries 135 in typed form can incorporate preferred interpretations of some proprietary or unique terms, spoken phrases and passages that are likely to be used in meeting sessions. These glossaries can be created and relied upon in developing context, creating transcripts, and performing translations for various audiences. The words, terms, spoken phrases and passages that boost recognition results in a transcript can be put in typed form on a boost list for each language in the glossary. The boost list can bias the automatic speech transcription (recognition) to the words, terms, phrases and passages on the list over those that may sound similar. For example, people's names or product names, that are not part of ordinary speech patterns may be placed on the boost list.

Boost lists are a type of a positive list that can only bias artificial intelligence (AI) language models so much. Block lists (stop lists), a form of negative list, can be used to adjust (filter) output results from the artificial intelligence model. Correction lists, a form of a replacement list, can be further used to adjust the output results from an artificial intelligence model in transcribing and translating speech. Boost lists, block lists, and correction lists can be included in a glossary for each language that is spoken during a meeting. A private glossary may be used for boost lists, block lists, and correction lists that are to remain confidential and unavailable to the public.

Block (stop) lists consists of typed offensive spoken words, terms, phrases, and passages for each language in the glossary can be generated and used to prohibit them from being displayed in a transcript and speech bubbles. Subsequently, the block lists also prohibits speech being synthesized (text to speech synthesis) from the transcript and made audible (machine spoken) to users in their selected or native language.

Furthermore, correction lists can be generated of common recognition/transcription and/or translation mistakes for each language in the glossary. The typed error and replacement words, terms, phrases and passages on the correction list can then be used to correct a transcript and/or translated transcript in the respective languages in order to correct the common mistakes made in automated (machine) speech recognition/transcription and/or automated translation to provide more accurate transcribed and transcribed-translated results to users.

The boosting, filtering, and replacing with the use of the lists in the glossary for each selected and recognized languages are performed in real time by the system. Boosting is performed by the transcription engines with the boost lists in recognizing speech in the speech to text process in the various languages. The filtering and replacing can be performed after the transcription by a transcription engine prior to display of a transcription and/or after translation by a translation engine prior to display of translated transcription. The filtering and replacing are performed after transcription is performed by a transcription engine, for the readers/listeners/participants that chose to read in the speaker's language such that no translation is needed for them. For the readers/listeners/participants that chose to read in a language differing from the speaker's language, the filtering and replacing is performed after translation is performed by the respective one or more translation engines with the improved results in the translated transcription. This is so the filtering and replacing processes do not affect the raw transcribed input into each of the one or more translation engines and interfere with output translation results.

Boost, block, and correction lists can be generated for each language that is spoken by speakers and selected to be read in a transcript by attendees/readers/listeners. Each user can have their own glossary with boost, block, and correction lists for their respective native language and others that they may be able to speak or read during a meeting session with the system. A glossary, including boost, block, and correction lists, can be associated with a meeting session setup by a host user of the system.

The block lists are used after the transcription/translation in the respective languages to block and replace the words, terms, and phrase with replacement marks, so they are not displayed to the users of the system. In other cases, a sentence with words on the list may be transcribed/translated into a sentence that does not include the word, term, or phrase on the list. The correction lists are used after the transcription/translation to correct common mistakes made in automated transcription and translation by the system to provide improved results. The boost lists, on the other hand, are used in advance of transcription/translation to assist in automated transcription and translation by the system to provide improved results.

In one embodiment, the users generate their own boost, block, and correction lists of typed words, terms or phrases in one or more languages. In another embodiment, the system provides initial boost, block, correction lists of initial typed words, terms or phrases in a plurality (N, where N is a positive integer) of common languages (e.g., French, German, Spanish, Portuguese, Farsi, Persian, Japanese, Chinese (Mandarin, Wu), Hindi, Urdu, Punjabi, Javanese, Telugu, Turkish, Korean, Arabic, Bengali, Russian, Portuguese, Indonesian, Marathi, and English) from which the users can build further upon to generate their own customized lists in the one or more languages that are used in their meeting sessions. For example, the initial block lists for each language can be words, phrases, and terms of profanity. A user can improve upon the initial block lists for each language by including be words, phrases, and terms that are culturally sensitive. For example, the word GOD may be sensitive in some cultures and included on the block list by a user to avoid it being displayed during a meeting session.

Referring now to FIG. 10A, a glossary 1000A is shown as one example of the glossary 135. A glossary may include a set of languages (e.g., Language 1 and Language 2) expected to be spoken during a meeting session. Alternatively, a glossary may include N languages of the common languages that are spoken in the world, wherein N is positive integer. The glossary 1000A has a plurality of lists, boost lists 1004 and block lists 1005, for each language 1002 of one or more languages 1002A-1002N. Lists for a first language, Language 11002A, includes a boost list 1004A and a block list 1005A of words, terms or phrases typed in the first language. Lists for a second language, Language 21002B, includes a boost list 1004B and a block list 1005B of words, terms or phrases typed in the second language. Lists for an Nth language, Language N 1002N, includes a boost list 1004N and a block list 1005N of words, terms or phrases typed in the Nth language.

Referring now to FIG. 10B, a glossary 1000B is shown as another example of the glossary 135. A glossary may include a set of languages (e.g., Language 1 and Language 2) expected to be spoken during a meeting session. Alternatively, a glossary may include N languages of the common languages that are spoken in the world. The glossary 1000A has a plurality of lists, boost lists 1004, block lists 1005, and correction lists 1006 for each language 1002 of one or more languages 1012A-1012N. Lists for a first language, Language 11012A, includes a boost list 1004A, a block list 1005A, and a correction list 1006A of words, terms or phrases typed in the first language. Lists for a second language, Language 21012B, includes a boost list 1004B, a block list 1005B, and a correction list 1006A of words, terms or phrases typed in the second language. Lists for an Nth language, Language N 1012N, includes a boost list 1004N, a block list 1005N, and a correction list 1006A of words, terms or phrases typed in the Nth language.

In FIGS. 10A-10B, the plurality of lists are organized by languages into a single glossary. The plurality lists can also be organized differently and stored differently in a storage device. In another embodiment, the plurality of lists can be organized under one or more glossaries 135, one glossary for each language of the one or more languages expected to be used in a meeting. In another embodiment, the plurality of lists can be organized under a plurality of glossaries 135, one glossary for each language of a plurality of commonly used languages. Initial lists of the plurality of lists of the plurality of commonly used languages can be generated and provided to users. Initial boost lists, initial block lists, and initial correction lists can be generated and provided to users with standard words, terms or phrases to improve speech recognition and transcription results. When a user logs into the system software, the user can edit his/her the lists within one or more glossaries. The one or more glossaries and the plurality of lists are stored and maintained in the server associated with each user's login identification (ID). A user glossary can be associated with the type of meeting session, such as medical or computer type of meeting session so that the list differ for each.

FIG. 11A illustrates an example block list 1100A typed in the English language. The block list 1100A includes offensive words, terms or phrases typed in the English language that can be used to filter or block from being displayed in the transcript windows of speech that is displayed to users. The offensive words, terms or phrases can be English swear or cuss words, or English vulgarities that are commonly known. The offensive words, terms or phrases can also be English medical words that would be inappropriate in non-medical meetings such as penis or vagina. The offensive words, terms or phrases can also be English ethnic slurs, English religious slurs, or English racists words that are commonly known. While typed in English, block lists can be formed and typed for other languages that may be used in a meeting.

FIG. 11B illustrates an example block list 1100B typed in the French language. The block list 1100B includes offensive words, terms or phrases typed in the French language that can be used to filter or block from being displayed in the transcript windows of speech that is displayed to users. The offensive words, terms or phrases can be French swear or cuss words, or French vulgarities that are commonly known. The offensive words, terms or phrases can also be French medical words that would be inappropriate in non-medical meetings such as penis or vagina. The offensive words, terms or phrases can also be French ethnic slurs, French religious slurs, or French racists words that are commonly known. While typed in French, block lists can be formed and typed for other languages that may be used in a meeting.

Instead of displaying, the offensive words, terms or phrases on a display device, the offensive words, terms or phrases that are recognized in a transcript can be replaced in real time with a sequence or series of a non-letter character (e.g., pound or number sign #) such as ####### for the word, term or phrase for display on a display device. For example, if the word “hell” is on the block list, the sentence “What the hell are you doing?” may instead be transcribed/translated and displayed as “What the **** are you doing?”. In other cases, the sentence spoken and generated by transcription, or generated by translation of the transcription, can be reinterpreted to avoid the use of offensive word, term or phrase in the sentence. For example, if the word “hell” is on the block list, the sentence “What the hell are you doing?” may be translated or transcribed instead to display “What is going on?”.

FIG. 14 illustrates a diagram of a graphical user interface (GUI) 155 displayed on a monitor or a display device 153. The GUI 155 that supports the improved transcription and translation using block lists and the correction lists in the glossary for the various languages. The GUI 155 displays English speech bubbles 1420A-1420C after a French transcription by a transcription engine recording and transcribing a French speaker and a translation from French to English by a translation engine. Alternatively, the GUI 155 can display French speech bubbles 1421A and 1421B or 1421B′ after French speaker recognition and transcription of the spoken words and sentences by a French speaker.

If the French Speaker speaks an offensive word, term, or phrase, such as merde (shit) 1404, the transcription engine of the system can use a French block list to filter it out or block that offensive word, term, or phrase if on the list. If the offensive word, term, or phrase is on the French block lists for a French speaker, then the blocking/replacement carries forward overt to translation. However, if the offensive word, term, or phrase is not on the French block lists for a French speaker, other block lists for other languages for the listeners can be used to filter or block it if on the other language block list. For example, assuming there are English readers/listeners in the same meeting, if the French Speaker speaks an offensive word, term, or phrase, such as merde (shit) 1404, the translation engine of the system can use an English block list to filter out or block the offensive word, term, or phrase if on the list. For example, the word merde 1404 was not on a French block list and is displayed in the speech bubble 1421 in the sentence “Vous est plein de merde.” However, if the translated word “shit” is on an English block list, it can be blocked/filtered out after the translation into English using the block list. Assuming so, the word “shit” is removed from the transcript and can be replaced with a replacement indicator 1402. Then, the sentence “You are full of ######” can be displayed in the speech bubble 1420A to avoid offending the reader/listener. The replacement indicator 1402 indicates the missing offensive word, term, or phrase from what was spoken.

In this example, the replacement indicator (marks) 1402 is a series of a plurality of pound symbols ######. However, other non-letter characters or symbols (e.g., !!!, *** , $$$, @@@, ˜˜˜, +++) can be used to highlight the replacement and missing speech that was blocked using the block list. In another embodiment, the replacement indicator (marks) 1402 are not used, the words or phrases are just removed leaving spaces, so nothing is shown to the reader but gaps between words or words and punctuation.

Referring now to FIG. 12A, a correction list 1200A in the English language is shown. The correction list includes the expected transcription or translation error and the respective replacement that is to be made for the expected error. The expected error from the automated transcription is often misheard words, misspelled names, or homophones, for example. The expected error from automated translation is often a mistranslation of geographic destinations or names for example.

In FIG. 12A, a name Abunassi, for example, is often mis-transcribed as Aboo Nazi in English for example. The name may be a guest speaker, a host name, or a company name that participates in the given meeting session. In order to correct the mis-transcription error, the error is added to the correction list 1200A so that it can be corrected. As another example, a microprocessor chip name Xeon when spoken is often mis-transcribed in the English language into Zion. The error, Zion, is placed on the correction list of the user or host of the meetings and the replacement Zeon so that it can be corrected in real time of meetings that would be discussing the microprocessor chip name. Errors can also be made when other language speakers speak during a meeting session, and it undergoes transcription or speech recognition in the foreign language and translation into another language.

Referring now FIG. 12B, a correction list 1200B for words transcribed into the French language and/or translated into the French language. For example, a speaker may say “merci beaucoup” (thank you) and it is mis-transcribed in error as “merci beau cul”. It is desirable to correct the transcription error before displaying the French transcript of the French speaker to users in the meeting. As another example, a speaker may say “poisson” (fish) and it is mis-transcribed in error as “poison”. As another example, a speaker may say “baisser” and it is mis-transcribed in error as “baiser”. In each case, the error and its respective replacement is placed on the correction list associated with the user or host of the meetings so that it can be improve the transcription and/or translation of speech in a meeting in real time.

Referring now to FIG. 14, as an example of correction using the correction list, the spoken word poisson is mis-transcribed as poison 1406 in the spoken sentence “Manger du poisson”. The mis-transcribed word is shown typed in sentence “Manger du poison” in the speech bubble 1421B. With a correction list, the mis-transcribed word is corrected and instead shown properly to the user in sentence in the speech bubble 1421B′. With the proper word, a translation engine can properly translate the sentence in the speech bubble 1421B′into “Eat the fish” shown in the speech bubble 1420A. Accordingly, the correction lists in the various languages can improve transcription and translation of spoken words during a meeting.

Referring now FIG. 13A, a boost list 1300A in the English language is shown. A boost list includes spoken words, spoken terms, spoken phrases, spoken passages, spoken abbreviations and spoken acronyms in typed form that can be spoken (used) in a meeting session. The goal of a boost list is to boost recognition results in the transcript during speech recognition. Unique terms such as English medical terms used in a meeting can be provided on the boost list to improve recognition. For example, osteochondrodysplasia can be provided in the boost list that can help the transcription engines recognize the word when spoken. Another example boost word on the boost list is gastrology. Other examples of English boost words for a medical related meeting are hypoglycemia, hyperthermia, and hemorrhage.

With abbreviations and acronyms, the expanded form can also be provided on the boost list that can be used for replacement in the transcript. In FIG. 13A, for example, the abbreviation DB for the phrase data base can be provided for replacement in the transcript. A boost list can be generated by a user for each language expected to be spoken during a meeting.

Referring now FIG. 13B, a boost list 1300B in the French language is shown. An abbreviation spoken by a French speaker in a medical meeting can be ADN for example, that refers to the French phrase adide desoxyribonnucleque (DNA). The abbreviation and expanded French replacement words are provided on the boost list so they can be transcribed (speech recognition) more efficiently into a French transcription. Another abbreviation example is AOC and its expansion appelation d′origine contrôlée shown in FIG. 13B. Unique French medial terms for example, such as saigner, glycémie, asphyxier, cicatrizer, and hyperventiler, can be placed on the boost list to better recognize spoken words in a medical related meeting.

The glossary generated by one user can be shared with another user by email or by the login ID that is used with the cloud based transcription/translation system. Alternatively, selected lists on the glossary that have been generated by one user can be shared with another user by email or by the login identification (ID). The cloud based transcription/translation system can provide initial lists and initial glossary(ies) of initial words, initial phrases, and initial terms to a new user from which to start editing and adding. The user selectively turns on blocking, correcting, and/or boosting using the respective block lists, correction lists, and boost lists using a control button/command and provides the respective lists in the glossary(ies).

In FIG. 14, a control panel 1401A is provided below the viewable window 1400A in the GUI 155. The control panel 1401A includes a plurality of control buttons 1410A-1410D and one or more pull down menus 1430. The one or more pull down menus 1430 provides a means for selecting languages that can be spoken, transcribed, and translated that a user may desire. The one or more pull down menus 1430 in the GUI can be used to select languages in which to generate or edit the boost, block, and corrections lists. One or more control buttons 1410A-1410D in the GUI can be used to selectively turn on and off the boost/block/correction filtering rules that utilize the respective lists to improve the transcription and translation. The server in the cloud with its transcription engines and translation engines performs the desired boosting, filtering, and replacement with the respective lists in the respective language being used by each.

A user setting up a meeting as the host can select the lists in the glossary that are used for a meeting session with the automated transcription/translation system. A speaker can also set up his/her own lists in the glossary for their chosen one or more languages. If he misspeaks an offensive word on the speakers block list, it is avoided in the transcription by the transcription engine of the server. Any translation engine would not see it and therefore any translation thereof would not include it either. A reader/listener/participant in a meeting can also set up his/her own lists in the glossary for their chosen one or more languages. If a speaker misspeaks an offensive word and it is not on the speaker's block list for a language, but it is on the reader/listener/participant block list for a different language, the transcription engine can transcribe the offensive word, term, or phrase but the translation engine of the server can use the block list to filter it out from the translated transcription. If neither the speaker nor the reader/listener/participant have lists generated, which is often the case, the host's glossary and its lists are used by the transcription engines and translation engines of the server to filter out words on the block lists, replace words on the correction lists, and boost words on the boost list in each respective language.

Advantages

There are a number of advantages to the disclosed transcribing and translating system. Unnecessarily long meetings and misunderstandings among participants can be made fewer using systems and methods provided herein. Participants that are not fluent in other participants' languages are less likely to be stigmatized or penalized. Invited persons, who might otherwise be less inclined to participate because of language shortcomings, can participate in their own native language, enriching their experience. The value of their participation to meeting is also enhanced because everyone, in the language(s) of their choice, can read the meeting transcript in real time while concurrently hearing and speaking in the language(s) of their choice. Furthermore, the systems and methods disclosed herein eliminate the need for special headsets, sound booths, and other equipment to perform translations for each meeting participant.

As a benefit, extended meetings can be shorter and fewer through use of the systems and methods provided herein. Meetings can have an improved overall tenor as the flow of a meeting is interrupted less frequently due to language problems and the need for clarifications and corrections. Misunderstandings among participants can be reduced and less serious.

Participants that are not fluent in other participants' languages are less likely to be stigmatized, penalized, or marginalized. Invited persons who might otherwise be less inclined to participate because of language differences can participate in their own native language, enriching their experience and enabling them to add greater value.

The value of participation by such previously shy participants to others is also enhanced as these heretofore hesitant participants can read the meeting transcript in their chosen language in near real time while hearing and speaking in their chosen language as well. The need for special headsets, sound booths, and other equipment to perform language translation by a human being is eliminated.

The graphical user interfaces can be used to inform the system, from client to server, of two or more languages that multiple speakers can speak during a virtual conference or meeting. The server can be ready in advance to transcribe multiple languages. An input device can be used by the speaker to select an icon or button in the graphical user interface to notify the system of a swap in language being spoken. Alternatively, the system can automatically select the proper transcription and detect the present language being spoken by multiple speakers. In this manner, the server can get translation engines for the readers/listeners ready to translate from the appropriate transcribed text and language so that the readers/listeners can continue to read or listen the spoken content in the different language spoken by the speakers with minimal lag from that which is spoken.

Closing

The embodiments are thus described. While certain exemplary embodiments have been described and shown in the accompanying drawings, it is to be understood that such embodiments are merely illustrative of and not restrictive on the broad invention, and that the embodiments are not limited to the specific constructions and arrangements shown and described, since various other modifications can occur to those ordinarily skilled in the art.

When implemented in software, the elements of the disclosed embodiments are essentially the code segments to perform the necessary tasks. The program or code segments can be stored in a processor readable medium or transmitted by a computer data signal embodied in a carrier wave over a transmission medium or communication link. The “processor readable medium” can include any medium that can store information. Examples of the processor readable medium include an electronic circuit, a semiconductor memory device, a read only memory (ROM), a flash memory, an erasable programmable read only memory (EPROM), a floppy diskette, a CD-ROM, an optical disk, a hard disk, a fiber optic medium, a radio frequency (RF) link, etc. The computer data signal can include any signal that can propagate over a transmission medium such as electronic network channels, optical fibers, air, electromagnetic, RF links, etc. The code segments can be downloaded using a computer data signal via computer networks such as the Internet, Intranet, etc. and stored in a storage device (processor readable medium).

Some portions of the preceding detailed description have been presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the tools used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of operations leading to a desired result. The operations are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.

It should be kept in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the above discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing” or “computing” or “calculating” or “determining” or “displaying” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices. A computer “device” includes computer hardware, computer software, or a combination thereof.

While this specification includes many specifics, these should not be construed as limitations on the scope of the disclosure or of what may be claimed, but rather as descriptions of features specific to particular implementations of the disclosure. Certain features that are described in this specification in the context of separate implementations can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations, separately or in sub-combination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination may in some cases be excised from the combination, and the claimed combination can be directed to a sub-combination or variations of a sub-combination. Accordingly, while embodiments have been particularly described, they should not be construed as limited by such disclosed embodiments.

Number	Date	Country
63424906	Nov 2022	US
63192264	May 2021	US
62877013	Jul 2019	US
62885892	Aug 2019	US
63157595	Mar 2021	US
62897936	Sep 2019	US
63163981	Mar 2021	US

	Number	Date	Country
Parent	17736941	May 2022	US
Child	18507074		US
Parent	17752826	May 2022	US
Child	18507074		US
Parent	17750345	May 2022	US
Child	17752826		US
Parent	16992489	Aug 2020	US
Child	17736941		US

BOOSTING, CORRECTING, AND BLOCKING TO PROVIDE IMPROVED TRANSCRIBED AND TRANSLATED RESULTS OF CLOUD-BASED MEETINGS

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

CROSS REFERENCE TO RELATED APPLICATIONS

Provisional Applications (7)

Continuation in Parts (4)