The present disclosure relates to voice to text processing, and more specifically to scoring the accuracy of voice to text transcription engines based on factors such as speed and accuracy, then training the system to obtain the optimum transcription based on those scores.
The general idea behind using automated speech-to-text systems for transcribing documents is to reduce the need for human beings to do the transcribing. While in some cases the transcription produced by the speech-to-text system can be a “final” version, where no humans edit or confirm the correctness of the transcription, in other cases a human being can edit the transcription, with the goal of saving time by not having a human do the initial transcription (just the editing). However, in practice, editing an already drafted document is often more time intensive for human beings than simply doing the transcription from scratch. Part of the reason for the discrepancy between the projected time savings of using a speech-to-text system and the reality is that transcribing a document from a recording, and editing a document to ensure it aligns with the recording, are different skills.
To counter this problem, engineers have attempted to create context-specific speech-to-text systems, where the speech recognition is particularly tailored to a given context or topic. In such systems, the vocabulary and the combinations of phonemes (e.g., diphones and triphones) can result in improved speech recognition. However, determining which context-specific speech-to-text system is appropriate for a given scenario remains a problem.
Additional features and advantages of the disclosure will be set forth in the description that follows, and in part will be understood from the description, or can be learned by practice of the herein disclosed principles. The features and advantages of the disclosure can be realized and obtained by means of the instruments and combinations particularly pointed out in the appended claims. These and other features of the disclosure will become more fully apparent from the following description and appended claims, or can be learned by the practice of the principles set forth herein.
Disclosed are systems, methods, and non-transitory computer-readable storage media which provide a technical solution to the technical problem described. A method for performing the concepts disclosed herein can include: receiving, at a computer system, a first digital audio recording; randomly assigning, via a processor of the computer system, speech-to-text engines to transcribe the first digital audio recording, resulting in transcriptions, each transcription within the transcriptions respectfully associated with a speech-to-text engine within the speech-to-text engines; scoring, via the processor, the transcriptions based on transcription scoring factors, resulting in transcription scores; scoring, via the processor and based at least in part on the transcription scores and speech-to-text engine scoring factors, the speech-to-text engines, resulting in speech-to-text engine scores; generating, via the processor and based at least in part on the speech-to-text engine scores, a model for selecting a speech-to-text engine from within the speech-to-text engines; receiving, at the computer system, a second digital audio recording; and assigning, via the processor executing the model, at least one selected speech-to-text engine from the speech-to-text engines to transcribe the second digital audio recording.
A system configured to perform the concepts disclosed herein can include: a modeling repository; a score database; at least one processor; and a non-transitory computer-readable storage medium having instructions stored which, when executed by the at least one processor, cause the at least one processor to perform operations comprising: executing a task manager service; and executing a scoring service; wherein the system generates a speech-to-text engine assignment model by: receiving a first digital audio recording; randomly assigning, via the task manager service, speech-to-text engines to transcribe the first digital audio recording, resulting in transcriptions, each transcription within the transcriptions respectfully associated with a speech-to-text engine within the speech-to-text engines; scoring the transcriptions based on transcription scoring factors, resulting in transcription scores; storing the transcription scores in the score database; scoring, based at least in part on the transcription scores stored in the score database and speech-to-text engine scoring factors, the speech-to-text engines, resulting in speech-to-text engine scores; storing the transcription scores in the score database; generating, based at least in part on the speech-to-text engine scores stored in the score database, a model for selecting a speech-to-text engine from within the speech-to-text engines for a future transcription; and storing the model in the modeling repository; and wherein the system uses the model to make additional speech-to-text engine assignments by: receiving a second digital audio recording; retrieving the model from the modeling repository; and assigning, by executing the model, a particular speech-to-text engine from the speech-to-text engines to transcribe the second digital audio recording.
A non-transitory computer-readable storage medium configured as disclosed herein can have instructions stored which, when executed by a computing device, cause the computing device to perform operations which include: receiving a first digital audio recording; randomly assigning speech-to-text engines to transcribe the first digital audio recording, resulting in transcriptions, each transcription within the transcriptions respectfully associated with a speech-to-text engine within the speech-to-text engines; scoring the transcriptions based on transcription scoring factors, resulting in transcription scores; scoring, based at least in part on the transcription scores and speech-to-text engine scoring factors, the speech-to-text engines, resulting in speech-to-text engine scores; generating, based at least in part on the speech-to-text engine scores, a model for selecting a speech-to-text engine from within the speech-to-text engines for a future transcription; receiving a second digital audio recording; and assigning, by executing the model, a particular speech-to-text engine from the speech-to-text engines to transcribe the second digital audio recording.
Various embodiments of the disclosure are described in detail below. While specific implementations are described, it should be understood that this is done for illustration purposes only. Other components and configurations may be used without parting from the spirit and scope of the disclosure.
The present disclosure addresses how to determine which engine is appropriate for a given scenario, and uses AI (Artificial Intelligence) and a feedback/machine learning process to do so. The result is a system which can assign multiple engines to perform a task, score how those engines perform, and then assign subsequent tasks to one or more engines based on how those engines are previously scored. Framed in this manner, there are two phases: (1) the training of the system, where the various engines are scored for their accuracy, speed, cost, computational factors, etc.; and (2) predicting using machine learning models which of the various engines is best suited for a task. The machine learning models may be training in view of the results.
The training of the system can occur as the system is deployed, meaning that engines can be assigned to perform the task, and scores related to those tasks can be saved and used for future assignments. Of note, the training of the system can be more than a single iteration, and can be repeated as often as is necessary. In any additional training, the subsequent scores can replace the previous scores or can be used as additional scores which, together with the previous scores, the system can use to assign subsequent transcriptions.
Consider the following example in the context of a context specific speech to text system, although the architecture is not limited to this type of system, and may be used with video processing engines, etc. The system has access to various speech recognition engines, “A,” “B,” and “C.” Engines A, B, and C are randomly assigned to perform a received transcription job, converting speech to text using a combination of phoneme recognition and natural language processing (NLP). The resulting text from the randomly assigned engines is then analyzed and scored either manually or, preferably, using a machine-learning classifier, and those scores are used to assign future jobs.
The system herein can include and/or utilize various hardware and software components. In some configurations all of these hardware and software components can be included within a single computer system, such as a server. In other configurations, some of the hardware and software components may be co-located within a server while others are located on the cloud, and the system can enable communications to those remote components. Among the components are:
The system can initially receive the jobs through the transcription communication module, then again use the transcription communication module to deliver the transcriptions to users/customers after the transcription has been completed. At that point, metadata about the transcription (such as the job identification) can delivered to and/or used by platform module for automated word error scoring and recording. This error scoring can be used to update the ratings associated with the engine(s) that performed the transcription and/or other machine learning models.
Job metadata can be retrieved via a callback to the platform module caller. For the transcription communication module, platform module can call existing transcription communication module APIs to retrieve the metadata necessary for predictions regarding what engine(s) to use for the transcription process. Exemplary fields which may be used as part of the prediction process may include: a tenant identifier, a job identifier, a job type identifier, organizational identifiers (providing a relevant level of detail for the department, team, etc., associated with the job and/or parties associates with the job), critical time references (such as due dates and WIP (Work In Progress) deadlines), priority factors, author identifiers, job-specific metadata, template references, workflow parameters, etc.
In one example embodiment, the speech recognition configuration service can ignore most or all of this data, neither using it, nor storing it. However, in other example embodiments, after a prediction service has been initiated and configured, this data can be used to predict what engines are to be used. Instead of relying on the metadata in the initial implementation, the speech recognition configuration service can use a series of rules, that will be matched against a subset of this data. An exemplary subset can include: Instance, TenantID, OrganizationID, and DepartmentID. In other words, different subsets of attributes within the metadata can be selected, and based on the results of that subset different engines can be selected.
For example, the rules trigger on a subset including the instance, tenant, organization, and department for a specific piece of multimedia to specify how the configuration (i.e., which engines to use) decision is made. In this example, there three are valid kinds of rules:
There can also be a default rule. This rule specifies the action to take when none of the other rules match.
When the speech recognition configuration service receives the request for a configuration of speech-to-text engines:
After the initial implementation and training of the predictor system, an additional rule can be implemented, with the system being configured to call a prediction service (a combination of machine learning software and/or models) which can make a recommendation for a speech recognition configuration (i.e., which engines to use) based on aspects of the metadata described above.
For each job processed by the speech recognition configuration service, after the rules are evaluated and the configuration selected, the service can write a record to the score database identifying how the configuration was selected, the record containing, for example:
Rules can also have an identification which uniquely identifies each rule in the service configuration. The rule that matches for a particular job can be recorded, along with the triggering metadata and the result, for debugging and auditing purposes.
An example rule follows, represented as XML. It is meant as a non-limiting example and illustration only. The specific implementation should follow whatever naming and representation conventions are already used for service configuration, to achieve the same results. In this example, it is assumed that rules are part of static, initialization-time service configuration. To change the rules used by a service, it may be necessary to restart the service.
At the top of the rules, the target speech recognition configurations are listed. These descriptors can be used to designate the different configurations known to this instance of platform module. There are three rules in the example. Each has an ID, a <match> part, and an <action> part. The ID for each rule should be unique at load time. The <match> part of each rule should also be different. The <match> parts of the three rules in the example demonstrate each of the allowed kind of rules: one that specifies matching to the department level, one to the organization level, and one to the tenant level. There is also a default rule. The default rule looks like an action because that is the only thing it needs to specify (i.e., the action to take when no rule matches).
Using the example rule, following are example transcription jobs processed by the rule:
When rules are created and/or loaded, preferably the following aspects should be validated:
However, the service does not need to validate the values used for any of the <match> arguments. For example, in the example above, when loading the rules, the speech recognition configuration service should not verify if a system meets certain values (e.g., system=?, if the tenant=?, etc.). How this is done can vary between system configurations.
Configurations and rules could be created and persisted in a database. This could be done, for example, in two separate tables. For example, each table can contain the Transcription communication module administrative fields for creation date and creating user, modification date and modifying user, a soft delete flag, soft delete date, and deleting user. A row of the configuration table can contain the name/identifier for a configuration, and the administrative fields. A row of the rule table can contain:
When the rules specify a random selection, the system can assume that there is an equal probability of selecting different configurations of speech recognition engines. Alternatively, the system can have a bias, thereby ensuring that no engine is selected too often (i.e., that the selections are balanced within a predetermined threshold range), and/or ensuring that no engine is selected too infrequently.
The scoring of transcriptions and/or engines can provide data on finished transcription jobs. platform module can then use this data to compute and track speech recognition accuracy, with the accuracy scores being an input to machine learning models which predict the best speech recognition engine(s) and configuration(s) for incoming transcription jobs.
Preferably, platform module obtains access to the speech recognition draft, and the transcription communication module obtains access to the final draft. However, in some configurations only Transcription communication module knows when both drafts are available (i.e., when the job state reaches a “ready to deliver” state). Therefore, it can be necessary for the transcription communication module to call an platform module API, providing at least a notification regarding status. However, communications between Transcription communication module and platform module can occur in any manner necessary for a given configuration.
Once the data is available, the platform module API service can post a message for a scoring request on the task manager queue. The scoring service can pick up and process each of these messages, scoring the speech recognition transcription against the final draft for one pair of documents. This vector of scores can then be stored in the score database, associated with the input instance/tenant/job ID.
The scoring code can produce a vector of output scores and data that can be stored in the score database for the job. Exemplary score/outputs can include:
In some configurations, in addition to scoring the transcriptions, the speech recognition engines can be scored. Like the transcriptions, the engines themselves can be scored based on accuracy, however they can also be scored based on cost, power consumption, bandwidth usage, time required for a transcription process to occur, computing cycles/flops, etc. The system can use a combination of the transcription scores and the engine scores to form a model of which engines should be selected to produce the best transcriptions going forward.
This selection based on scores is referred to as a “Predictor Service.” An exemplary sequence using the predictor service can be:
Platform Module Proceeds with SR Using the Selected Configuration
The inputs to the predictor service model can be, for example, all or a portion of the metadata associated with transcription job—that is, the data provided by Transcription communication module when the job is received. The inputs can also include previous scores for the respective engines, and any topic, context, or other information.
The predictor service can utilize machine learning, where the system self-corrects over time by using feedback and scores from new transcriptions to adjust which engines are assigned and under what circumstances those engines are assigned. This machine learning can, for example, include training a neural network to identify particular variables which make a difference in the overall quality of the final product/transcription, such as the contexts, topics, job creator, time, length of audio, etc. The predictor service model can incorporate the neural network, or (if done without a neural network) can include feedback mechanisms to modify weights associated with different engine assignments.
This metadata can be specific by a specific research area or topic. Prior to executing speech recognition, the platform module 202 task manager 210 can call a new platform module service (“SR configuration service”) 212, passing in the new job metadata. In some configurations, this speech recognition configuration service 212 can implement a simple lookup table to determine the desired speech-to-text engine (also known as a “configuration”) to use for the job. The lookup table can be tied, for example, to attributes of the metadata associated with a piece of multimedia content. Exemplary engines could be, for example, tuned to specific accents or geography, such as American English, Australian English, Canadian English, Texas English, Scottish English, etc. The use of English is purely exemplary, as other languages, locations, geo-tags, GPS data, or other attributes of the metadata can also be used to determine what engine to use in a given circumstance.
Engines can also be tuned to specific contexts, themes, topics, etc., which can, in some configurations, be determined based on information within the metadata received in the API call 206. The context or topic can also be determined by sampling a portion of an audio file or audio stream and performing a keyword analysis
The selected engine, along with unique identifiers for the job (such as “JobID” with respect to the job identification, and an instance number), can be recorded in a database 214. Throughout, this database is referred to as the “Score DB” 214. This database 214 can be used to record metadata associated with platform module 202 actions, as specified here. This database 214 can be combined with, or part of, an existing platform module 202 database, or a separate database might be used for this purpose in implementation.
When training the system, the system may ignore the draft quality prediction data. For example, the system may wait until a predetermined threshold of documents have been processed, or until a certain amount of training, has been completed before transmitting or otherwise relying on the predicted quality scores.
This enables updating prediction models automatically, an example of which is illustrated in
The system can also contain an audio channel and splitter service 912, which can split audio signals 914 and split channels 918 using step functions. The system can further contain a transcoder service 916, which can transcode media via a transcoder step function 920 and communicate media content using a media engine 922, which can in an elastic computing cloud format).
It is noted that in some configurations not all these components may be present. For example, in some configurations cloud transcription services 302 may not be present. In other configurations, non-cloud-based transcription services may not be present. In yet other configurations modules such as the modeling repository 902, the sampling service 904, and/or the research repository 906 may not be present.
In some configurations, the scoring of the speech-to-text engines is further based on metadata of the original audio.
In some configurations, the speech-to-text engines generate transcription metadata, and the scoring of the speech-to-text engines is further based on the transcription metadata.
In some configurations the speech-to-text engines are cloud based.
In some configurations the transcriptions are generated by the speech-to-text engines operating in parallel.
In some configurations the model is a neural network.
In some configurations, the illustrated method can be further augmented to include: receiving, from the at least one selected speech-to-text engines, second transcriptions, the second transcriptions being transcriptions of the second digital audio recording; scoring, via the processor, the second transcriptions based on the transcription scoring factors, resulting in second transcription scores; scoring, via the processor and based at least in part on the second transcription scores and the speech-to-text engine scoring factors, the at least one selected speech-to-text engines, resulting in second speech-to-text engine scores; and modifying the model based on the second speech-to-text engine scores. In such configurations, the modifying of the model can be further accomplished by: storing the model in a repository of models; periodically retrieving prediction data generated by the model prior to the assigning of the at least one selected speech-to-text engines, the prediction data stored in a database until retrieved; periodically retrieving workflow job data generated by the model, the workflow job data stored in the database until retrieved; retrieving the model from the repository of models; modifying, via the processor, the model based on at least the second speech-to-text engine scores, the prediction data, and the workflow job data, resulting in an updated model; and replacing, within the repository of models, the model with the updated model.
In some configurations, the scoring of the transcriptions via the processor is done in combination with human based review of the transcriptions. In other configurations, the scoring is done using an automated scoring system which calculates at least one of word error, diarization error, and/or punctuation edit metrics within the transcriptions.
In some configurations, the transcription scoring factors include at least one of accuracy and context; and the speech-to-text engine scoring factors include at least one of speed (of the speech-to-text engines), computational requirements (of the speech-to-text engines), and/or bandwidth of communications (with the speech-to-text engines).
With reference to
The system bus 1110 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. A basic input/output (BIOS) stored in ROM 1140 or the like, may provide the basic routine that helps to transfer information between elements within the computing device 1100, such as during start-up. The computing device 1100 further includes storage devices 1160 such as a hard disk drive, a magnetic disk drive, an optical disk drive, tape drive or the like. The storage device 1160 can include software modules 1162, 1164, 1166 for controlling the processor 1120. Other hardware or software modules are contemplated. The storage device 1160 is connected to the system bus 1110 by a drive interface. The drives and the associated computer-readable storage media provide nonvolatile storage of computer-readable instructions, data structures, program modules and other data for the computing device 1100. In one aspect, a hardware module that performs a particular function includes the software component stored in a tangible computer-readable storage medium in connection with the necessary hardware components, such as the processor 1120, bus 1110, display 1170, and so forth, to carry out the function. In another aspect, the system can use a processor and computer-readable storage medium to store instructions which, when executed by the processor, cause the processor to perform a method or other specific actions. The basic components and appropriate variations are contemplated depending on the type of device, such as whether the device 1100 is a small, handheld computing device, a desktop computer, or a computer server.
Although the exemplary embodiment described herein employs the hard disk 1160, other types of computer-readable media which can store data that are accessible by a computer, such as magnetic cassettes, flash memory cards, digital versatile disks, cartridges, random access memories (RAMs) 1150, and read-only memory (ROM) 1140, may also be used in the exemplary operating environment. Tangible computer-readable storage media, computer-readable storage devices, or computer-readable memory devices, expressly exclude media such as transitory waves, energy, carrier signals, electromagnetic waves, and signals per se.
To enable user interaction with the computing device 1100, an input device 1190 represents any number of input mechanisms, such as a microphone for speech, a touch-sensitive screen for gesture or graphical input, keyboard, mouse, motion input, speech and so forth. An output device 1170 can also be one or more of a number of output mechanisms known to those of skill in the art. In some instances, multimodal systems enable a user to provide multiple types of input to communicate with the computing device 1100. The communications interface 1180 generally governs and manages the user input and system output. There is no restriction on operating on any particular hardware arrangement and therefore the basic features here may easily be substituted for improved hardware or firmware arrangements as they are developed.
Use of language such as “at least one of X, Y, and Z,” “at least one of X, Y, or Z,” “at least one or more of X, Y, and Z,” “at least one or more of X, Y, or Z,” “at least one or more of X, Y, and/or Z,” or “at least one of X, Y, and/or Z,” are intended to be inclusive of both a single item (e.g., just X, or just Y, or just Z) and multiple items (e.g., {X and Y}, {X and Z}, {Y and Z}, or {X, Y, and Z}). The phrase “at least one of” and similar phrases are not intended to convey a requirement that each possible item must be present, although each possible item may be present.
The various embodiments described above are provided by way of illustration only and should not be construed to limit the scope of the disclosure. Various modifications and changes may be made to the principles described herein without following the example embodiments and applications illustrated and described herein, and without departing from the spirit and scope of the disclosure.