Embodiments of the invention relate to speech-based systems, and in particular, to systems, methods, and program products for improving speech cognition in speech-directed or speech-assisted work environments that utilize synthesized speech.
Speech recognition has simplified many tasks in the workplace by permitting hands-free communication with a computer as a convenient alternative to communication via conventional peripheral input/output devices. A user may enter data and commands by voice using a device having a speech recognizer. Commands, instructions, or other information may also be communicated to the user by a speech synthesizer. Generally, the synthesized speech is provided by a text-to-speech (TTS) engine. Speech recognition finds particular application in mobile computing environments in which interaction with the computer by conventional peripheral input/output devices is restricted or otherwise inconvenient.
For example, wireless wearable, portable, or otherwise mobile computer devices can provide a user performing work-related tasks with desirable computing and data-processing functions while offering the user enhanced mobility within the workplace. One example of an area in which users rely heavily on such speech-based devices is inventory management. Inventory-driven industries rely on computerized inventory management systems for performing various diverse tasks, such as food and retail product distribution, manufacturing, and quality control. An overall integrated management system typically includes a combination of a central computer system for tracking and management, and the people who use and interface with the computer system in the form of order fillers and other users. In one scenario, the users handle the manual aspects of the integrated management system under the command and control of information transmitted from the central computer system to the wireless mobile device and to the user through a speech-driven interface.
As the users process their orders and complete their assigned tasks, a bi-directional communication stream of information is exchanged over a wireless network between users wearing wireless devices and the central computer system. The central computer system thereby directs multiple users and verifies completion of their tasks. To direct the user's actions, information received by each mobile device from the central computer system is translated into speech or voice instructions for the corresponding user. Typically, to receive the voice instructions, the user wears a headset coupled with the mobile device.
The headset includes a microphone for spoken data entry and an ear speaker for audio data feedback. Speech from the user is captured by the headset and converted using speech recognition into data used by the central computer system. Similarly, instructions from the central computer or mobile device in the form of text are delivered to the user as voice prompts generated by the TTS engine and played through the headset speaker. Using such mobile devices, users may perform assigned tasks virtually hands-free so that the tasks are performed more accurately and efficiently.
An illustrative example of a set of user tasks in a speech-directed work environment may involve filling an order, such as filling a load for a particular truck scheduled to depart from a warehouse. The user may be directed to different warehouse areas (e.g., a freezer) in which they will be working to fill the order. The system vocally directs the user to particular aisles, bins, or slots in the work area to pick particular quantities of various items using the TTS engine of the mobile device. The user may then vocally confirm each location and the number of picked items, which may cause the user to receive the next task or order to be picked.
The speech synthesizer or TTS engine operating in the system or on the device translates the system messages into speech, and typically provides the user with adjustable operational parameters or settings such as audio volume, speed, and pitch. Generally, the TTS engine operational settings are set when the user or worker logs into the system, such as at the beginning of a shift. The user may walk though a number of different menus or selections to control how the TTS engine will operate during their shift. In addition to speed, pitch, and volume, the user will also generally select the TTS engine for their native tongue, such as English or Spanish, for example.
As users become more experienced with the operation of the inventory management system, they will typically increase the speech rate and/or pitch of the TTS engine. The increased speech parameters, such as increased speed, allows the user to hear and perform tasks more quickly as they gain familiarity with the prompts spoken by the application. However, there are often situations that may be encountered by the worker that hinder the intelligibility of speech from the TTS engine at the user's selected settings.
For example, the user may receive an unfamiliar prompt or enter into an area of a voice or task application that they are not familiar with. Alternatively, the user may enter a work area with a high ambient noise level or other audible distractions. All these factors degrade the user's ability to understand the TTS engine generated speech. This degradation may result in the user being unable to understand the prompt, with a corresponding increase in work errors, in user frustration, and in the amount of time necessary to complete the task.
With existing systems, it is time consuming and frustrating to be constantly navigating through the necessary menus to change the TTS engine settings in order to address such factors and changes in the work environment. Moreover, since many such factors affecting speech intelligibility are temporary, is becomes particularly time consuming and frustrating to be constantly returning to and navigating through the necessary menus to change the TTS engine back to its previous settings once the temporary environmental condition has passed.
Accordingly, there is a need for systems and methods that improve user cognition of synthesized speech in speech-directed environments by adapting to the user environment. These issues and other needs in the prior art are met by the invention as described and claimed below.
In an embodiment of the invention, a communication system for a speech-based work environment is provided that includes a text-to-speech engine having one or more adjustable operational parameters. Processing circuitry monitors an environmental condition related to intelligibility of an output of the text-to-speech engine, and modifies the one or more adjustable operational parameters of the text-to-speech engine in response to the monitored environmental condition.
In another embodiment of the invention, a method of communicating in a speech-based environment using a text-to-speech engine is provided that includes monitoring an environmental condition related to intelligibility of an output of the text-to-speech engine. The method further includes modifying one or more adjustable operational parameters of the text-to-speech engine in response to the environmental condition.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and, together with the general description of the invention given above and the detailed description of the embodiments given below, serve to explain the principles of the invention.
It should be understood that the appended drawings are not necessarily to scale, presenting a somewhat simplified representation of various features illustrative of the basic principles of embodiments of the invention. The specific design features of embodiments of the invention as disclosed herein, including, for example, specific dimensions, orientations, locations, and shapes of various illustrated components, as well as specific sequences of operations (e.g., including concurrent and/or sequential operations), will be determined in part by the particular intended application and use environment. Certain features of the illustrated embodiments may have been enlarged or distorted relative to others to facilitate visualization and provide a clear understanding.
Embodiments of the invention are related to methods and systems for dynamically modifying adjustable operational parameters of a text-to-speech (TTS) engine running on a device in a speech-based system. To this end, the system monitors one or more environmental conditions associated with a user that are related to or otherwise affect the user intelligibility of the speech or audible output that is generated by the TTS engine. As used herein, environmental conditions are understood to include any operating/work environment conditions or variables which are associated with the user and may affect or provide an indication of the intelligibility of generated speech or audible outputs of the TTS engine for the user. Environmental conditions associated with a user thus include, but are not limited to, user environment conditions such as ambient noise level or temperature, user tasks and speech outputs or prompts or messages associated with the tasks, system events or status, and/or user input such as voice commands or instructions issued by the user. The system may thereby detect or otherwise determine that the operational environment of a device user has certain characteristics, as reflected by monitored environmental conditions. In response to monitoring the environmental conditions or sensing of other environmental characteristics that may reduce the ability of the user to understand TTS voice prompts or other TTS audio data, the system may modify one or more adjustable operational parameters of the TTS engine to improve intelligibility. Once the system operational environment or environmental variable has returned to its original or previous state, a predetermined amount of time has passed, or a particular sensed environmental characteristic ceases or ends, the adjusted or modified operational parameters of the TTS engine may be returned to their original or previous settings. The system may thereby improve the user experience by automatically increasing the user's ability to understand critical speech or spoken data in adverse operational environments and conditions while maintaining the user's preferred settings under normal conditions.
In one embodiment of the invention, device 12 may be carried or otherwise transported, such as on the user's waist or forearm, or on a lift truck, harness, or other manner of transportation. The user 13 and the device 12 communicate using speech through the headset 14, which may be coupled to the device 12 through a cable 17 or wirelessly using a suitable wireless interface. One such suitable wireless interface may be Bluetooth®. As noted above, if a wireless headset is used, the device 12 may be stationary, since the mobile worker can move around using just the mobile or wireless headset. The headset 14 includes one or more speakers 18 and one or more microphones 19. The speaker 18 is configured to play TTS audio or audible outputs (such as speech output associated with a speech dialog to instruct the user 13 to perform an action), while the microphone 19 is configured to capture speech input from the user 13 (such as a spoken user response for conversion to machine readable input). The user 13 may thereby interface with the device 12 hands-free through the headset 14 as they move through various work environments or work areas, such as a warehouse.
The device 12 includes suitable processing circuitry that may include a processor 22, a memory 24, a network interface 26, an input/output (I/O) interface 28, a headset interface 30, and a power supply 32 that includes a suitable power source, such as a battery, for example, and provides power to the electrical components comprising the device 12. As noted, device 12 may be a mobile device and various examples discussed herein refer to such a mobile device. One suitable device is a TALKMAN® terminal device available from Vocollect, Inc. of Pittsburgh, Pa. However, device 12 may be a stationary computer that the user interfaces with through a wireless headset, or may be integrated with the headset 14. The processor 22 may consist of one or more processors selected from microprocessors, micro-controllers, digital signal processors, microcomputers, central processing units, field programmable gate arrays, programmable logic devices, state machines, logic circuits, analog circuits, digital circuits, and/or any other devices that manipulate signals (analog and/or digital) based on operational instructions that are stored in memory 24.
Memory 24 may be a single memory device or a plurality of memory devices including but not limited to read-only memory (ROM), random access memory (RAM), volatile memory, non-volatile memory, static random access memory (SRAM), dynamic random access memory (DRAM), flash memory, cache memory, and/or any other device capable of storing information. Memory 24 may also include memory storage physically located elsewhere in the device 12, such as memory integrated with the processor 22.
The device 12 may be under the control and/or otherwise rely upon various software applications, components, programs, files, objects, modules, etc. (hereinafter, “program code”) residing in memory 24. This program code may include an operating system 34 as well as one or more software applications including one or more task applications 36, and a voice engine 37 that includes a TTS engine 38, and a speech recognition engine 40. The applications may be configured to run on top of the operating system 34 or directly on the processor 22 as “stand-alone” applications. The one or more task applications 36 may be configured to process messages or task instructions for the user 13 by converting the task messages or task instructions into speech output or some other audible output through the voice engine 37. To facilitate synthesizing the speech output, the task application 36 may employ speech synthesis functions provided by TTS engine 38, which converts normal language text into audible speech to play to a user. For the other half of the speech-based system, the device 12 uses speech recognition engine 40 to gather speech inputs from the user and convert the speech to text or other usable system data
The processing circuitry and voice engine 37 provide a mechanism to dynamically modify one or more operational parameters of the TTS engine 38. The text-to-speech engine 38 has at least one, and usually more than one, adjustable operational parameter. To this end, the voice engine 37 may operate with task applications 36 to alter the speed, pitch, volume, language, and/or any other operational parameter of the TTS engine depending on speech dialog, conditions in the operating environment, or certain other conditions or variables. For example, the voice engine 37 may reduce the speed of the TTS engine 38 in response to the user 13 asking for help or entering into an unfamiliar area of the task application 36. Other potential uses of the voice engine 37 include altering the operational parameters of the TTS engine 38 based on one or more system events or one or more environmental conditions or variables in a work environment. As will be understood by a person of ordinary skill in the art, the invention may be implemented in a number of different ways, and the specific programs, objects, or other software components for doing so are not limited specifically to the implementations illustrated.
Referring now to
To that end, and in accordance with an embodiment of the invention, in block 54 the environmental condition of the speech prompt or message type is monitored and the speech prompt is checked to see if it is a system message or system message type. To allow this determination to be made, the message may be flagged as a system message type by the task application 36 of the device 12 or by the central computer system 21. Persons having ordinary skill in the art will understand that there are many ways by which the determination that the speech prompt is a certain type, such as a system message, may be made, and embodiments of the invention are not limited to any particular way of making this determination or of the other types of speech prompts or messages that might be monitored as part of the environmental conditions.
If the speech prompt is determined to not be a system message or some other message type (“No” branch of decision block 54), the task application 36 proceeds to block 62. In block 62, the message is played to the user 13 though the headset 14 in a normal manner according to operational parameter settings of the TTS engine 38 as set by the user. However, if the speech prompt is determined to be a system message or some other type of message (“Yes” branch of decision block 54), the task application 36 proceeds to block 56 and modifies an operational parameter for the TTS engine. In the embodiment of
Once the message has been played, the task application 36 proceeds to block 60, where the operational parameter (i.e., speed setting) is restored to its previous level or setting. The operational parameters of the text-to-speech engine 38 are thus returned to their normal user settings so the user can proceed as desired in the speech dialog. Usually, the speech dialog will then resume as normal. However, if further monitored conditions dictate, the modified settings might be maintained. Alternatively, the modified setting might be restored only after a certain amount of time has elapsed. Advantageously, embodiments of the invention thereby provide certain messages and message types with operational parameters modified to improve the intelligibility of the message automatically while maintaining the preferred settings of the user 13 under normal conditions for the various task applications 36.
Additional examples of environmental conditions, such as voice data or message types that may be flagged and monitored for improved intelligibility, include messages over a certain length or syllable count, messages that are in a language that is non-native to the TTS engine 38, and messages that are generated when the user 13 requests help, speaks a command, or enters an area of the task application 36 that is not commonly used, and where the user has little experience. While the environmental condition may be based on a message status, or the type of message, or language of the message, length of message, or commonality or frequency of the message, other environmental conditions are also monitored in accordance with embodiments of the invention, and may also be used to modify the operational parameters of the TTS engine 38.
Referring now to
If, on the other hand, the user 13 does not understand the speech prompt, the user 13 responds with a command type or phrase such as “Say Again”. That is, the speech prompt was not understood, and the user needs it repeated. In this event, the task application 36 proceeds to block 78 (“Yes” branch of decision block 74) where the processing circuitry and task application 36 uses the mechanism provided by the processing circuitry and voice engine 37 to reduce the speed setting of the TTS engine 38. The task application 36 then proceeds to re-play the speech prompt (Block 80) before proceeding to block 82. In block 82, the modified operational parameter, such as speed setting for the TTS engine 38, may be restored to its previous pre-altered setting or original setting before returning to block 74.
As previously described, in block 74, the user 13 responds to the slower replayed speech prompt. If the user 13 understands the repeated and slowed speech prompt, the user response may be an affirmative response (e.g., “4 Cases Picked”) so that the task application proceeds to block 72 and issues the next speech prompt in the task list or dialog. If the user 13 still does not understand the speech prompt, the user may repeat the phrase “Say Again”, causing the task application 36 to again proceed back to block 78, where the process is repeated. Although speed is the operational parameter adjusted in the illustrated example, other operational parameters or combinations of such parameters (e.g., volume, pitch, etc.) may be modified as well.
In an alternative embodiment of the invention, the processing circuitry and task application 36 defers restoring the original setting of the modified operational parameter of the TTS engine 38 until an affirmative response is made by the user 13. For example, if the operational parameter is modified in block 78, the prompt is replayed (Block 80) at the modified setting, and the program flow proceeds by arrow 81 to await the user response (Block 74) without restoring the settings to previous levels. An alternative embodiment also incrementally reduces the speed of the TTS engine 38 each time the user 13 responds with a certain spoken command, such as “Say Again”. Each pass through blocks 76 and 78 thereby further reduces the speed of the TTS engine 38 incrementally until a minimum speed setting is reached or the prompt is understood. Once the prompt is sufficiently slowed so that the user 13 understands the prompt, the user 13 may respond in an affirmative manner (“No” branch of decision block 76). The affirmative response, indicating by the environmental condition a return to a previous state (e.g., user intelligibility), causes the speed setting or other modified operational parameter settings of the TTS engine 38 to be restored to their original or previous settings (Block 83) and the next speech prompt is issued.
Advantageously, embodiments of the invention provide a dynamic modification of an operational parameter of the TTS engine 38 to improve the intelligibility of a TTS message, command, or prompt based on monitoring one or more environmental conditions associated with a user of the speech-based system. More advantageously, in one embodiment, the settings are returned to the previous preferred settings of the user 13 when the environmental condition indicates a return to a previous state, and once the message, command, or prompt has been understood without requiring any additional user action. The amount of time necessary to proceed through the various tasks may thereby be reduced as compared to systems lacking this dynamic modification feature.
While the dynamic modification may be instigated by a specific type of command from the user 13, an environmental condition based on an indication that the user 13 is entering a new or less-familiar area of a task application 36 may also be monitored and used to drive modification of an adjustable operational parameter. For example, if the task application 36 proceeds with dialog that the system has flagged as new or not commonly used by the user 13, the speed parameter of the TTS engine 38 may be reduced or some other operational parameter might be modified.
While several examples noted herein are directed to monitoring environmental conditions related to the intelligibility of the output of the TTS engine 38 that are based upon the specific speech dialog itself, or commands in a speech dialog, or spoken responses from the user 13 that are reflective of intelligibility, other embodiments of the invention are not limited to these monitored environmental conditions or variables. It is therefore understood that there are other environmental conditions directed to the physical operating or work environment of the user 13 that might be monitored rather than the actual dialog of the voice engine 37 and task applications 36. In accordance with another aspect of the invention, such external environmental conditions may also be monitored for the purposes of dynamically and temporarily modifying at least one operational parameter of the TTS engine 38.
The processing circuitry and software of the invention may also monitor one or more external environmental conditions to determine if the user 13 is likely being subjected to adverse working conditions that may affect the intelligibility of the speech from the TTS engine 38. If a determination that the user 13 is encountering such adverse working conditions is made, the voice engine 37 may dynamically override the user settings and modify those operational parameters accordingly. The processing circuitry and task application 36 and/or voice engine 37, may thereby automatically alter the operational parameters of the TTS engine 38 to increase intelligibility of the speech played to the user 13 as disclosed.
Referring now to
If the task application 36 makes a determination that the user 13 is in an adverse environment, such as a high ambient noise environment (“Yes” branch of decision block 94), the task application 36 proceeds to block 100. In block 100, the task application 36 and/or voice engine 37 causes the operational parameters of the text-to-speech engine 38 to be altered by, for example, increasing the volume. The task application 36 then proceeds to block 102 where the prompt is played with the modified operational parameter settings before proceeding to block 104. In block 103, a determination is again made, based on the monitored environmental condition, if it is an adverse or noisy environment. If not, and the environmental condition indicates a return to a previous state, i.e., normal noise level, the flow returns to block 104, and the operational parameter settings of the TTS engine 38 are restored to their previous pre-altered or original settings (e.g., the volume is reduced) before proceeding to block 98 where the task manager 36 waits for a user response in the normal manner. If the monitored condition indicates that the environment is still adverse, the modified operational parameter settings remain.
The adverse environment may be indicated by a number of different external factors within the work area of the user 13 and monitored environmental conditions. For example, the ambient noise in the environment may be particularly high due to the presence of noisy equipment, fans, or other factors. A user may also be working in a particularly noisy region of a warehouse. Therefore, in accordance with an embodiment of the invention, the noise level may be monitored with appropriate detectors. The noise level may relate to the intelligibility of the output of the TTS engine 38 because the user may have difficulty in hearing the output due to the ambient noise. To monitor for an adverse environment, certain sensors or detectors may be implemented in the system, such as on the headset or device 12, to monitor such an external environmental variable.
Alternatively, the system 10 and/or the mobile device 12 may provide an indication of a particular adverse environment to the processing circuitry. For example, based upon the actual tasks assigned to the user 13, the system 10 or mobile device 12 may know that the user 13 will be working in a particular environment, such as a freezer environment. Therefore, the monitored environmental condition is the location of a user for their assigned work. Fans in a freezer environment often make the environment noisier. Furthermore, mobile workers working in a freezer environment may be required to wear additional clothing, such as a hat. The user 13 may therefore be listening to the output from the TTS engine 38 through the additional clothing. As such, the system 10 may anticipate that for tasks associated with the freezer environment, an operational parameter of the TTS engine 38 may need to be temporarily modified. For example, the volume setting may need to be increased. Once the user is out of a freezer and returns to the previous state of the monitored environmental condition (i.e., ambient temperature), the operational parameter settings may be returned to a previous or unmodified setting. Other detectors might be used to monitor environmental conditions, such as a thermometer or temperature sensor to sense the temperature of the working environment to indicate the user is in a freezer.
By way of another example, system level data or a sensed condition by the mobile device 12 may indicate that multiple users are operating in the same area as the user 13, thereby adding to the overall noise level of that area. That is, the environmental condition monitored is the proximity of one user to another user. Accordingly, embodiments of the present invention contemplate monitoring one or more of these environmental conditions that relate to the intelligibility of the output of the TTS engine 38, and temporarily modifying the operational parameters of the TTS engine 38 to address the monitored condition or an adverse environment.
To make a determination that the user 13 is subject to an adverse environment, the task application 36 may look at incoming data in near real time. Based on this data, the task application 36 makes intelligent decisions on how to dynamically modify the operational parameters of the TTS engine 38. Environmental variables—or data—that may be used to determine when adverse conditions are likely to exist include high ambient or background noise levels detected at a detector, such as microphone 19. The device 12 may also determine that the user 13 is in close proximity to other users 13 (and thus subjected to higher levels of background noise or talking) by monitoring Bluetooth® signals to detect other nearby devices 12 of other users. The device 12 or headset 14 may also be configured with suitable devices or detectors to monitor an environmental condition associated with the temperature and detect a change in the ambient temperature that would indicate the user 13 has entered a freezer as noted. The processing circuitry task application 36 may also determine that the user is executing a task that requires being in a freezer as noted. In a freezer environment, as noted, the user 13 may be exposed to higher ambient noise levels from fans and may also be wearing additional clothing that would muffle the audio output of the speakers 18 of headset 14. Thus, the task application 36 may be configured to increase the volume setting of the text-to-speech engine 38 in response to the monitored environmental conditions being associated with work in a freezer.
Another monitored environmental condition might be time of day. The task application 36 may take into account the time of day in determining the likely noise levels. For example, third shift may be less noisy than first shift or certain periods of a shift.
In another embodiment of the invention, the experience level of a user might be the environmental condition that is monitored. For example, the total number of hours logged by a specific user 13 may determine the level of user experience (e.g., a less experienced user may require a slower setting in the text-to-speech engine) with a text-to-speech engine, or the level of experience with an area of a task application, or the level of experience with a specific task application. As such, the environmental condition of user experience may be checked by system 10, and used to modify the operational parameters of the TTS engine 38 for certain times or task applications 36. For example, a monitored environmental condition might include monitoring the amount of time logged by a user with a task application, part of a task application, or some other experience metric. The system 10 tracks such experience as a user works.
In accordance with another embodiment of the invention, an environmental condition, such as the number of users in a particular work space or area, may affect the operational parameters of the TTS engine 38. System level data of system 10 indicating that multiple users 13 are being sent to the same location or area may also be utilized as a monitored environmental condition to provide an indication that the user 13 is in close proximity to other users 23. Accordingly, an operational parameter such as speed or volume may be adjusted. Likewise, system data indicating that the user 13 is in a location that is known to be noisy as noted (e.g., the user responds to a prompt indicating they are in aisle 5, which is a known noisy location) may be used as a monitored environmental condition to adjust the text-to-speech operational parameters. As noted above, other location or area based information, such as if the user is making a pick in a freezer where they may be wearing a hat or other protective equipment that muffles the output of the headset speakers 18 may be a monitored environmental condition, and may also trigger the task application 36 to increase the volume setting or reduce the speed and/or pitch settings of the text-to-speech engine 38, for example.
It should be further understood that there are many other monitored environmental conditions or variables or reasons why it may be desirable to alter the operational parameters of the text-to-speech engine 38 in response to a message, command, or prompt. In one embodiment, an environmental condition that is monitored is the length of the message or prompt being converted by the text-to-speech engine. Another is the language of the message or prompt. Still another environmental condition might be the frequency that a message or prompt is used by a task application to indicate how frequently a user has dealt with the message/prompt. Additional examples of speech prompts or messages that may be flagged for improved intelligibility include messages that are over a certain length or syllable count, messages that are in a language that is non-native to the text-to-speech engine 38 or user 13, important system messages, and commands that are generated when the user 13 requests help or enters an area of the task application 36 that is not commonly used by that user so that the user may get messages that they have not heard with great frequency.
Referring now to
If the task application 36 makes a determination that the prompt contains a non-native word or phrase (e.g., “Boeuf Bourguignon”) (“Yes” branch of decision block 114), the task application 36 proceeds to block 120. In block 120, the operational parameters of the text-to-speech engine 38 are modified to speak that section of the phrase by changing the language setting. The task application 36 then proceeds to block 122 where the prompt or section of the prompt is played using a text-to-speech engine library or database modified or optimized for the language of the non-native word or phrase. The task application 36 then proceeds to block 124. In block 124, the language setting of the text-to-speech engine 38 is restored to its previous or pre-altered setting (e.g., changed from French back to English) before proceeding to block 98 where the task manager 36 waits for a user response in the normal manner.
In some cases, the monitored environmental condition may be a part or section of the speech prompt or utterance that may be unintelligible or difficult to understand with the user selected TTS operational settings for some other reason than the language. A portion may also need to be emphasized because the portion is important. When this occurs, the operational settings of the TTS engine 38 may only require adjustment during playback of a single word or subset of the speech prompt. To this end, the task application 36 may check to see if a portion of the phrase is to be emphasized. So, as illustrated in
The present invention and voice engine 37 may thereby improve the user experience by allowing the processing circuitry and task applications 36 to dynamically adjust text-to-speech operational parameters in response to specific monitored environmental conditions or variables, including working conditions, system events, and user input. The intelligibility of critical spoken data may thereby be improved in the context in which it is given. The invention thus provides a powerful tool that allows task application developers to use system and context aware environmental conditions and variables within speech-based tasks to set or modify text-to-speech operational parameters and characteristics. These modified text-to-speech operational parameters and characteristics may dynamically optimize the user experience while still allowing the user to select their original or preferable TTS operational parameters.
A person having ordinary skill in the art will recognize that the environments and specific examples illustrated in
Furthermore, while specific operational parameters are noted with respect to the monitored environmental conditions and variables of the examples herein, other operational parameters may also be modified as necessary to increase intelligibility of the output of a TTS engine. For example, operational parameters, such as pitch or speed, may also be adjusted when volume is adjusted. Or, if the speed has slowed down, the volume may be raised. Accordingly, the present invention is not limited to the number of parameters that may be modified or the specific ways in which the operational parameters of the TTS engine may be modified temporarily based on monitored environmental conditions.
Thus, a person having skill in the art will recognize that other alternative hardware and/or software environments may be used without departing from the scope of the invention. For example, a person having ordinary skill in the art will appreciate that the device 12 may include more or fewer applications disposed therein. Furthermore, as noted, the device 12 could be a mobile device or stationary device as long at the user can be mobile and still interface with the device. As such, other alternative hardware and software environments may be used without departing from the scope of embodiments of the invention. Still further, the functions and steps described with respect to the task application 36 may be performed by or distributed among other applications, such as voice engine 37, text-to-speech engine 38, speech recognition engine 40, and/or other applications not shown. Moreover, a person having ordinary skill in the art will appreciate that the terminology used to describe various pieces of data, task messages, task instructions, voice dialogs, speech output, speech input, and machine readable input are merely used for purposes of differentiation and are not intended to be limiting.
The routines executed to implement the embodiments of the invention, whether implemented as part of an operating system or a specific application, component, program, object, module or sequence of instructions executed by one or more computing systems are referred to herein as a “sequence of operations”, a “program product”, or, more simply, “program code”. The program code typically comprises one or more instructions that are resident at various times in various memory and storage devices in a computing system (e.g., the device 12 and/or central computer 21), and that, when read and executed by one or more processors of the computing system, cause that computing system to perform the steps necessary to execute steps, elements, and/or blocks embodying the various aspects of embodiments of the invention.
While embodiments of the invention have been described in the context of fully functioning computing systems, those skilled in the art will appreciate that the various embodiments of the invention are capable of being distributed as a program product in a variety of forms, and that the invention applies equally regardless of the particular type of computer readable media or other form used to actually carry out the distribution. Examples of computer readable media include but are not limited to physical and tangible recordable type media such as volatile and nonvolatile memory devices, floppy and other removable disks, hard disk drives, optical disks (e.g., CD-ROM's, DVD's, Blu-Ray disks, etc.), among others. Other forms might include remote hosted services, cloud based offerings, software-as-a-service (SAS) and other forms of distribution.
While the present invention has been illustrated by a description of the various embodiments and the examples, and while these embodiments have been described in considerable detail, it is not the intention of the applicants to restrict or in any way limit the scope of the appended claims to such detail. Additional advantages and modifications will readily appear to those skilled in the art.
As such, the invention in its broader aspects is therefore not limited to the specific details, apparatuses, and methods shown and described herein. A person having ordinary skill in the art will appreciate that any of the blocks of the above flowcharts may be deleted, augmented, made to be simultaneous with another, combined, looped, or be otherwise altered in accordance with the principles of the embodiments of the invention. Accordingly, departures may be made from such details without departing from the scope of applicants' general inventive concept.
The present application claims the benefit of U.S. patent application Ser. No. 14/561,648 for Systems and Methods for Dynamically Improving User Intelligibility of Synthesized Speech in a Work Environment filed Dec. 5, 2014 (and published Mar. 26, 2015 as U.S. Patent Publication No. 2015/0088522), now U.S. Pat. No. 9,697,818, which claims the benefit of U.S. patent application Ser. No. 13/474,921 for Systems and Methods for Dynamically Improving User Intelligibility of Synthesized Speech in a Work Environment filed May 18, 2012 (and published Nov. 22, 2012 as U.S. Patent Application Publication No. 2012/0296654), now U.S. Pat. No. 8,914,290, which claims the benefit of U.S. Patent Application No. 61/488,587 for Systems and Methods for Dynamically Improving User Intelligibility of Synthesized Speech in a Work Environment filed May 20, 2011. Each of the foregoing patent applications, patent publications, and patents is hereby incorporated by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
4882757 | Fisher et al. | Nov 1989 | A |
4928302 | Kaneuchi et al. | May 1990 | A |
4959864 | Van Nes et al. | Sep 1990 | A |
4977598 | Doddington et al. | Dec 1990 | A |
5127043 | Hunt et al. | Jun 1992 | A |
5127055 | Larkey | Jun 1992 | A |
5230023 | Nakano | Jul 1993 | A |
5297194 | Hunt et al. | Mar 1994 | A |
5349645 | Zhao | Sep 1994 | A |
5428707 | Gould et al. | Jun 1995 | A |
5457768 | Tsuboi et al. | Oct 1995 | A |
5465317 | Epstein | Nov 1995 | A |
5488652 | Bieiby et al. | Jan 1996 | A |
5566272 | Brems et al. | Oct 1996 | A |
5602960 | Hon et al. | Feb 1997 | A |
5625748 | McDonough et al. | Apr 1997 | A |
5640485 | Ranta | Jun 1997 | A |
5644680 | Bielby et al. | Jul 1997 | A |
5651094 | Takagi et al. | Jul 1997 | A |
5684925 | Morin et al. | Nov 1997 | A |
5710864 | Juang et al. | Jan 1998 | A |
5717826 | Setlur et al. | Feb 1998 | A |
5737489 | Chou et al. | Apr 1998 | A |
5737724 | Atal et al. | Apr 1998 | A |
5742928 | Suzuki | Apr 1998 | A |
5774841 | Salazar et al. | Jun 1998 | A |
5774858 | Taubkin et al. | Jun 1998 | A |
5797123 | Chou et al. | Aug 1998 | A |
5799273 | Mitchell et al. | Aug 1998 | A |
5832430 | Lleida et al. | Nov 1998 | A |
5839103 | Mammone et al. | Nov 1998 | A |
5842163 | Weintraub | Nov 1998 | A |
5870706 | Alshawi | Feb 1999 | A |
5893057 | Fujimoto et al. | Apr 1999 | A |
5893059 | Raman | Apr 1999 | A |
5893902 | Transue et al. | Apr 1999 | A |
5895447 | Ittycheriah et al. | Apr 1999 | A |
5899972 | Miyazawa et al. | May 1999 | A |
5946658 | Miyazawa et al. | Aug 1999 | A |
5960447 | Holt et al. | Sep 1999 | A |
5970450 | Hattori | Oct 1999 | A |
6003002 | Netsch | Dec 1999 | A |
6006183 | Lai et al. | Dec 1999 | A |
6073096 | Gao et al. | Jun 2000 | A |
6076057 | Narayanan et al. | Jun 2000 | A |
6088669 | Maes | Jul 2000 | A |
6094632 | Hattori | Jul 2000 | A |
6101467 | Bartosik | Aug 2000 | A |
6122612 | Goldberg | Sep 2000 | A |
6151574 | Lee et al. | Nov 2000 | A |
6182038 | Balakrishrian et al. | Jan 2001 | B1 |
6192343 | Morgan et al. | Feb 2001 | B1 |
6205426 | Nguyen et al. | Mar 2001 | B1 |
6230129 | Morin et al. | May 2001 | B1 |
6230138 | Everhart | May 2001 | B1 |
6233555 | Parthasarathy et al. | May 2001 | B1 |
6233559 | Balakrishnan | May 2001 | B1 |
6243713 | Nelson et al. | Jun 2001 | B1 |
6246980 | Glorion et al. | Jun 2001 | B1 |
6292782 | Weideman | Sep 2001 | B1 |
6330536 | Parthasarathy et al. | Dec 2001 | B1 |
6374212 | Phillips | Apr 2002 | B2 |
6374220 | Kao | Apr 2002 | B1 |
6374221 | Haimi-Cohen | Apr 2002 | B1 |
6377662 | Hunt et al. | Apr 2002 | B1 |
6377949 | Gilmour | Apr 2002 | B1 |
6397179 | Crespo et al. | May 2002 | B2 |
6397180 | Jaramillo et al. | May 2002 | B1 |
6421640 | Dolfing et al. | Jul 2002 | B1 |
6438519 | Campbell et al. | Aug 2002 | B1 |
6438520 | Curt et al. | Aug 2002 | B1 |
6456973 | Fado | Sep 2002 | B1 |
6487532 | Schoofs et al. | Nov 2002 | B1 |
6496800 | Kong et al. | Dec 2002 | B1 |
6505155 | Vanbuskirk et al. | Jan 2003 | B1 |
6507816 | Ortega | Jan 2003 | B2 |
6526380 | Thelen et al. | Feb 2003 | B1 |
6539078 | Hunt et al. | Mar 2003 | B1 |
6542866 | Jiang et al. | Apr 2003 | B1 |
6567775 | Maali et al. | May 2003 | B1 |
6571210 | Hon et al. | May 2003 | B2 |
6581036 | Varney, Jr. | Jun 2003 | B1 |
6587824 | Everhart et al. | Jul 2003 | B1 |
6594629 | Basu et al. | Jul 2003 | B1 |
6598017 | Yamamoto et al. | Jul 2003 | B1 |
6606598 | Holthouse et al. | Aug 2003 | B1 |
6629072 | Thelen et al. | Sep 2003 | B1 |
6675142 | Ortega et al. | Jan 2004 | B2 |
6701293 | Bennett et al. | Mar 2004 | B2 |
6725199 | Brittan et al. | Apr 2004 | B2 |
6732074 | Kuroda | May 2004 | B1 |
6735562 | Zhang et al. | May 2004 | B1 |
6754627 | Woodward | Jun 2004 | B2 |
6766295 | Murveit et al. | Jul 2004 | B1 |
6799162 | Goronzy et al. | Sep 2004 | B1 |
6813491 | McKinney | Nov 2004 | B1 |
6829577 | Gleason | Dec 2004 | B1 |
6832224 | Gilmour | Dec 2004 | B2 |
6834265 | Balasuriya | Dec 2004 | B2 |
6839667 | Reich | Jan 2005 | B2 |
6856956 | Thrasher et al. | Feb 2005 | B2 |
6868381 | Peters et al. | Mar 2005 | B1 |
6868385 | Gerson | Mar 2005 | B1 |
6871177 | Novell et al. | Mar 2005 | B1 |
6876968 | Veprek | Apr 2005 | B2 |
6876987 | Bahler et al. | Apr 2005 | B2 |
6879956 | Honda et al. | Apr 2005 | B1 |
6882972 | Kompe et al. | Apr 2005 | B2 |
6910012 | Hartley et al. | Jun 2005 | B2 |
6917918 | Rockenbeck et al. | Jul 2005 | B2 |
6922466 | Peterson et al. | Jul 2005 | B1 |
6922669 | Schalk et al. | Jul 2005 | B2 |
6941264 | Konopka et al. | Sep 2005 | B2 |
6961700 | Mitchell et al. | Nov 2005 | B2 |
6961702 | Dobler et al. | Nov 2005 | B2 |
6985859 | Morin | Jan 2006 | B2 |
6988068 | Fado et al. | Jan 2006 | B2 |
6999931 | Zhou | Feb 2006 | B2 |
7010489 | Lewis et al. | Mar 2006 | B1 |
7031918 | Hwang | Apr 2006 | B2 |
7035800 | Tapper | Apr 2006 | B2 |
7039166 | Peterson et al. | May 2006 | B1 |
7050550 | Steinbiss et al. | May 2006 | B2 |
7058575 | Zhou | Jun 2006 | B2 |
7062435 | Tzirkel-Hancock et al. | Jun 2006 | B2 |
7062441 | Townshend | Jun 2006 | B1 |
7065488 | Yajima et al. | Jun 2006 | B2 |
7069513 | Damiba | Jun 2006 | B2 |
7072750 | Pi et al. | Jul 2006 | B2 |
7072836 | Shao | Jul 2006 | B2 |
7103542 | Doyle | Sep 2006 | B2 |
7103543 | Hernandez-Abrego et al. | Sep 2006 | B2 |
7203644 | Anderson et al. | Apr 2007 | B2 |
7203651 | Baruch et al. | Apr 2007 | B2 |
7216148 | Matsunami et al. | May 2007 | B2 |
7225127 | Lucke | May 2007 | B2 |
7240010 | Papadimitriou et al. | Jul 2007 | B2 |
7266494 | Droppo et al. | Sep 2007 | B2 |
7305340 | Rosen et al. | Dec 2007 | B1 |
7319960 | Riis et al. | Jan 2008 | B2 |
7386454 | Gopinath et al. | Jun 2008 | B2 |
7392186 | Duan | Jun 2008 | B2 |
7401019 | Seide et al. | Jul 2008 | B2 |
7406413 | Geppert et al. | Jul 2008 | B2 |
7430509 | Jost | Sep 2008 | B2 |
7454340 | Sakai et al. | Nov 2008 | B2 |
7457745 | Kadambe et al. | Nov 2008 | B2 |
7493258 | Kibkalo et al. | Feb 2009 | B2 |
7542907 | Epstein et al. | Jun 2009 | B2 |
7565282 | Carus et al. | Jul 2009 | B2 |
7684984 | Kemp | Mar 2010 | B2 |
7813771 | Escott | Oct 2010 | B2 |
7827032 | Braho et al. | Nov 2010 | B2 |
7865362 | Braho et al. | Jan 2011 | B2 |
7895039 | Braho et al. | Feb 2011 | B2 |
7949533 | Braho et al. | May 2011 | B2 |
7983912 | Hirakawa et al. | Jul 2011 | B2 |
8200495 | Braho et al. | Jun 2012 | B2 |
8255219 | Braho et al. | Aug 2012 | B2 |
8374870 | Braho et al. | Feb 2013 | B2 |
8914290 | Hendrickson et al. | Dec 2014 | B2 |
9697818 | Hendrickson et al. | Jul 2017 | B2 |
20020128838 | Veprek | Sep 2002 | A1 |
20020138274 | Sharma et al. | Sep 2002 | A1 |
20020143540 | Malayath et al. | Oct 2002 | A1 |
20020145516 | Moskowitz et al. | Oct 2002 | A1 |
20020152071 | Chaiken et al. | Oct 2002 | A1 |
20020178004 | Chang et al. | Nov 2002 | A1 |
20020184027 | Brittan | Dec 2002 | A1 |
20020184029 | Brittan | Dec 2002 | A1 |
20020198712 | Hinde et al. | Dec 2002 | A1 |
20030023438 | Schramm et al. | Jan 2003 | A1 |
20030061049 | Erten | Mar 2003 | A1 |
20030120486 | Brittan et al. | Jun 2003 | A1 |
20030141990 | Coon | Jul 2003 | A1 |
20030191639 | Mazza | Oct 2003 | A1 |
20030220791 | Toyama | Nov 2003 | A1 |
20040215457 | Meyer | Oct 2004 | A1 |
20040230420 | Kadambe | Nov 2004 | A1 |
20040242160 | Ichikawa et al. | Dec 2004 | A1 |
20050049873 | Bartur et al. | Mar 2005 | A1 |
20050055205 | Jersak et al. | Mar 2005 | A1 |
20050071161 | Shen | Mar 2005 | A1 |
20050080627 | Hennebert et al. | Apr 2005 | A1 |
20050177369 | Stoimenov | Aug 2005 | A1 |
20090099849 | Iwasawa | Apr 2009 | A1 |
20090192705 | Golding et al. | Jul 2009 | A1 |
20100057465 | Kirsch et al. | Mar 2010 | A1 |
20100250243 | Schalk et al. | Sep 2010 | A1 |
20110029312 | Braho et al. | Feb 2011 | A1 |
20110029313 | Braho et al. | Feb 2011 | A1 |
20110093269 | Braho et al. | Apr 2011 | A1 |
20110282668 | Stefan | Nov 2011 | A1 |
20130080173 | Talwar | Mar 2013 | A1 |
Number | Date | Country |
---|---|---|
0867857 | Sep 1998 | EP |
0905677 | Mar 1999 | EP |
1011094 | Jun 2000 | EP |
1377000 | Jan 2004 | EP |
63179398 | Jul 1988 | JP |
64004798 | Sep 1989 | JP |
04296799 | Oct 1992 | JP |
06-095828 | Apr 1994 | JP |
6059828 | Apr 1994 | JP |
6130985 | May 1994 | JP |
6161489 | Jun 1994 | JP |
07013591 | Jan 1995 | JP |
07199985 | Aug 1995 | JP |
11175096 | Feb 1999 | JP |
200081482 | Jun 2000 | JP |
2001042886 | Feb 2001 | JP |
2001343992 | Dec 2001 | JP |
2001343994 | Dec 2001 | JP |
2002328696 | Nov 2002 | JP |
2003177779 | Jun 2003 | JP |
2004126413 | Apr 2004 | JP |
2004334228 | Nov 2004 | JP |
2005173157 | Jun 2005 | JP |
2005331882 | Dec 2005 | JP |
2006058390 | Mar 2006 | JP |
2002011121 | Feb 2002 | WO |
2005119193 | Dec 2005 | WO |
2006031752 | Mar 2006 | WO |
Entry |
---|
Smith, Ronnie W., An Evaluation of Strategies for Selective Utterance Verification for Spoken Natural Language Dialog, Proc. Fifth Conference on Applied Natural Language Processing (ANLP), 1997, 41-48 Submitted previously in related application prosecution. |
Kellner, A., et al., Strategies for Name Recognition in Automatic Directory Assistance Systems, Interactive Voice Technology for Telecommunications Applications, IVTTA '98 Proceedings, 1993 IEEE 4th Workshop, Sep. 29, 1998 Submitted previously in related application prosecution. |
Chengyi Zheng and Yonghong Yan, “Improving Speaker Adaptation by Adjusting the Adaptation Data Set”; 2000 IEEE International Symposium on Intelligent Signal Processing and Communication Systems. Nov. 5-8, 2000. Submitted previously in related application prosecution. |
Christensen. “Speaker Adaptation of Hidden Markov Models using Maximum Likelihood Linear Regression”, Thesis, Aalborg University, Apr. 1996. Submitted previously in related application prosecution. |
Mokbel, “Online Adaptation of HMMs to Real-Life Conditions: A Unified Framework”, IEEE Trans. on Speech and Audio Processing, May 2001. Submitted previously in related application prosecution. |
Silke Goronzy, Krzysztof Marasek, Ralf Kompe, Semi-Supervised Speaker Adaptation, in Proceedings of the Sony Research Forum 2000, vol. 1, Tokyo, Japan, 2000. Submitted previously in related application prosecution. |
Jie Vi, Kei Miki, Takashi Yazu, Study of Speaker Independent Continuous Speech Recognition, Oki Electric Research and Development, Oki Electric Industry Co., Ltd., Apr. 1, 1995, vol. 62, No. 2, pp. 7-12 Submitted previously in related application prosecution. |
Osamu Segawa, Kazuya Takeda, An Information Retrieval System for Telephone Dialogue in Load Dispatch Center, IEEJ Trans. EIS, Sep. 1, 2005, vol. 125, No. 9, pp. 1438-1443. Submitted previously in related application prosecution. |
Notice of Allowance for U.S. Appl. No. 13/474,921, dated Aug. 15, 2014, 7 pages. |
Notice of Allowance for U.S. Appl. No. 14/561,648, dated Apr. 11, 2017, 8 pages. |
Notice of Allowance for U.S. Appl. No. 14/561,648, dated Mar. 1, 2017, 10 pages. |
Office Action for U.S. Appl. No. 14/561,648, dated Sep. 8, 2016, 21 pages. |
Number | Date | Country | |
---|---|---|---|
20180018955 A1 | Jan 2018 | US |
Number | Date | Country | |
---|---|---|---|
61488587 | May 2011 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14561648 | Dec 2014 | US |
Child | 15635326 | US | |
Parent | 13474921 | May 2012 | US |
Child | 14561648 | US |