The present invention relates generally to the field of natural language processing and machine learning, and more particularly to implementing these concepts for helping, in an automated manner, to provide pronunciation help for speakers of multiple languages.
Some text such as a word, a phrase, and/or a sentence may be unfamiliar to a reader who is reading the text, e.g., in a document. It may not be apparent for the reader how to pronounce some or all of the text. This challenge is especially heightened for those who are not expert speakers in a language, for example, for those who are learning a second language or who are non-native speakers of a language. A worker might be uncertain how to pronounce the name of a colleague when the worker meets that colleague in an online meeting. A reader may be participating in a technical discussion and come across a technical term which the reader has never before seen or heard. Travelers may especially experience these pronunciation challenges. Proper names often create these types of pronunciation challenges because of the large possible variety of pronunciations for proper names. Some languages include sounds which are not used in other languages, which intensifies these pronunciation challenges. There are often small or large differences in the pronunciation habits between various languages.
Among the existing solutions, international phonetic alphabets and phonics are sometimes helpful to provide a standard pronunciation of a new word. Some people are, however, unfamiliar with international phonetic alphabets and phonics.
Text-to-audio technology also helps with pronunciation challenges in some instances. But sometimes a speaker inexperienced with a particular language might find a text-to-audio computer pronunciation of text to be strange and not easy to follow the first time. The speaker might not catch subtleties regarding the correct pronunciation. Sometimes, the setting is not appropriate for a reader to hear a computer audio pronunciation of text. In some instances it would be helpful for the reader to see the proper pronunciation using syllables from their primary language and/or from a language with which they have more experience. Many people have the most confidence with a primary language, e.g., with their native language, and prefer tips to be provided in their primary language even more than using phonics or the international phonetic alphabet.
KR 10-1990021 Ba relates to a device and a method for displaying a foreign language and a mother tongue by using English phonetic symbols. The device includes a conversion server which separates the separated foreign language words into phonemes by using predetermined phonetic symbols and generates a part corresponding to a syllable of a foreign language word of the separated foreign language phonemes into a mother tongue phoneme of mother tongue consonants and vowels in accordance with a predetermined foreign language pronunciation rule.
“Incorporating Pronunciation Variation into Different Strategies of Term Transliteration” by Kuo et al. discloses that term transliteration addresses the problem of converting terms in one language into their phonetic equivalents in the other language via spoken form. Kuo et al. proposed several models, which take pronunciation variation into consideration, for term transliteration. The models describe transliteration from various viewpoints and utilize the relationships trained from extracted transliterated-term pairs.
The prior art has the disadvantage of requiring system trainers to have specific bilingual proficiency to extract transliterated-term pairs and/or to establish predetermine foreign language pronunciation rules.
According to one exemplary embodiment, a method for syllable-based pronunciation help are provided. An input text in a first language may be received. A selection of a target language that is different from the first language may be received. From the target language, syllables with a pronunciation most closely matching a pronunciation of the input text in the first language are obtained. The obtaining is based on a comparison of one or more spectrograms for the input text with one or more spectrograms for text of the target language. The obtained syllables in the target language are presented. A computer system and computer program product corresponding to the above method are also disclosed herein.
With these embodiments, automated pronunciation help may be achieved which provides the help using a preferred language of the person who is requesting the help. An automated pronunciation help system generates pronunciation tips between multiple languages. Different languages may be evaluated and interpreted even on the fly to provide translation and pronunciation help. This help may be provided without needing to rely on finding individuals with bilingual proficiency for many different language pair combinations. This help may be provided without needing to generate huge numbers of inter-language pronunciation pairs through brute force listening, evaluation, and recording of those inter-language pronunciation pairs. Automated pronunciation help is provided with pronunciation tips on a syllable-based level.
In some additional embodiments, the input text may be separated into syllables in the first language. The syllables in the first language are used to generate the one or more spectrograms for the input text. A time to pronounce each of the syllables in the first language may be calculated. Points of zero amplitude in an audio waveform generated via pronouncing the input text may be identified. An input text spectrogram for the input text may be divided into a spectrogram per syllable of the input text. The dividing may be based on the calculated time and/or on the identified points of zero amplitude.
In this way, pronunciation help may be achieved in a more precise automated manner by allowing comparison on a syllable-based level of various texts and spectrogram-producing pronunciations. Pronunciation help may be achieved in an automated manner without a person who is involved in the training needing expert knowledge in any of the languages for which the program will be trained.
In some additional embodiments, an audio waveform of the pronunciation of the input text in the first language may be recorded. The one or more spectrograms for the input text may be generated based on the audio waveform.
In this way, input information may be produced and pre-processed for better usage in a machine learning model which may be a part of or used by the program for syllable-based pronunciation help. By preparing the data for use with a machine learning model, the breadth of possible pronunciation tips may be exponentially expanded as compared to tips that could be generated based on brute manual recognition of similar-sounding syllables between languages.
In some additional embodiments, embeddings from the spectrograms may be generated for the comparisons. Cosine similarity calculations may be performed to compare the similarity of sounds (via their spectrograms) between languages.
In this way, machine learning models may be utilized to enhance language comparison so that the breadth of possible pronunciation tips may be vastly expanded across multiple languages, e.g., up to two hundred languages or more.
In some additional embodiments, presenting of the obtained syllables in the target language may include displaying the obtained syllables in the target language on a screen of the computer along with other text in the first language and/or playing an audio recording of the syllables in the target language.
In this manner, the computer operating this program may become an enhanced teleprompter to help a user who is giving a presentation. The pronunciation help may be given in a multi-sensory manner or in a sensory manner that is varied depending on the preferences or needs of the user.
In some additional embodiments, an indication of the first language may be received via a selection of the first language and/or via machine learning analysis of text being displayed on the computer. The input text may be received via a selection of a portion of text that is displayed on a screen of the computer. The selection of the portion of the text may be made via click-and-drag of a text box over the input text on the screen.
In this manner, the program facilitates nimble implementation so that a user and the program may quickly provide and receive input and allow the automated abilities of the program to perform syllable-based translation for pronunciation help. This nimble implementation is helpful for a user who is requesting the syllable-based pronunciation help during a presentation when a quick computing determination and response will enhance the presentation.
In at least some additional embodiments, the obtaining may be performed via a first machine learning model that is trained via a second machine learning model. For the training the second machine learning model may analyze embeddings representing the one or more spectrograms for the input text and the one or more spectrograms for text of the target language. Additionally and/or alternatively, an autoencoder may be used to train the first machine learning model. The autoencoder converts the one or more spectrograms for the input text and the one or more spectrograms for text of the target language into respective tokens. Additionally and/or alternatively, for the training the second machine learning model may analyze a combination of tokens representing textual syllables from the input text and tokens representing the one or more spectrograms for the input text.
In this manner, the program implements high-powered machine learning models to sift through large amounts of textual and audio data to quickly determine an appropriate cross-language pronunciation suggestion for helping enhance the clarity of speech of a presenter.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings. The various features of the drawings are not to scale as the illustrations are for clarity in facilitating one skilled in the art in understanding the invention in conjunction with the detailed description. In the drawings:
Detailed embodiments of the claimed structures and methods are disclosed herein; however, it can be understood that the disclosed embodiments are merely illustrative of the claimed structures and methods that may be embodied in various forms. This invention may, however, be embodied in many different forms and should not be construed as limited to the exemplary embodiments set forth herein. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete and will fully convey the scope of this invention to those skilled in the art. In the description, details of well-known features and techniques may be omitted to avoid unnecessarily obscuring the presented embodiments.
The following described exemplary embodiments provide a system, a method, and a computer program product for providing automated pronunciation help, e.g., for a speaker who is reading a language with which the speaker is inexperienced. The present embodiments provide automated pronunciation help which is capable of generating pronunciation tips in a different language compared to the language being read by a user. The present embodiments provide an automated pronunciation help system which has the capacity to interpret and evaluate different languages on the fly in order to provide translation and pronunciation help. The present embodiments provide pronunciation help without needing to perform, by a person, brute force listening, evaluation, and recording of inter-language pronunciation pairs to find or generate huge numbers of inter-language pronunciation pairs. The present embodiments provide an automated pronunciation guide which provides pronunciation tips on a syllable-based level. The present embodiments implement principles of machine learning in order to enhance automated pronunciation help across different languages.
Various aspects of the present disclosure are described by narrative text, flowcharts, block diagrams of computer systems and/or block diagrams of the machine logic included in computer program product (CPP) embodiments. With respect to any flowcharts, depending upon the technology involved, the operations can be performed in a different order than what is shown in a given flowchart. For example, again depending upon the technology involved, two operations shown in successive flowchart blocks may be performed in reverse order, as a single integrated step, concurrently, or in a manner at least partially overlapping in time.
A computer program product embodiment (“CPP embodiment” or “CPP”) is a term used in the present disclosure to describe any set of one, or more, storage media (also called “mediums”) collectively included in a set of one, or more, storage devices that collectively include machine readable code corresponding to instructions and/or data for performing computer operations specified in a given CPP claim. A “storage device” is any tangible device that can retain and store instructions for use by a computer processor. Without limitation, the computer readable storage medium may be an electronic storage medium, a magnetic storage medium, an optical storage medium, an electromagnetic storage medium, a semiconductor storage medium, a mechanical storage medium, or any suitable combination of the foregoing. Some known types of storage devices that include these mediums include: diskette, hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or Flash memory), static random access memory (SRAM), compact disc read-only memory (CD-ROM), digital versatile disk (DVD), memory stick, floppy disk, mechanically encoded device (such as punch cards or pits/lands formed in a major surface of a disc) or any suitable combination of the foregoing. A computer readable storage medium, as that term is used in the present disclosure, is not to be construed as storage in the form of transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide, light pulses passing through a fiber optic cable, electrical signals communicated through a wire, and/or other transmission media. As will be understood by those of skill in the art, data is typically moved at some occasional points in time during normal operations of a storage device, such as during access, de-fragmentation or garbage collection, but this does not render the storage device as transitory because the data is not transitory while it is stored.
Computing environment 100 contains an example of an environment for the execution of at least some of the computer code involved in performing the inventive methods, such as syllable-based text conversion 116. In addition to syllable-based text conversion 116, computing environment 100 includes, for example, computer 101, wide area network (WAN) 102, end user device (EUD) 103, remote server 104, public cloud 105, and private cloud 106. In this embodiment, computer 101 includes processor set 110 (including processing circuitry 120 and cache 121), communication fabric 111, volatile memory 112, persistent storage 113 (including operating system 122 and syllable-based text conversion 116, as identified above), peripheral device set 114 (including user interface (UI) device set 123, storage 124, and Internet of Things (IoT) sensor set 125), and network module 115. Remote server 104 includes remote database 130. Public cloud 105 includes gateway 140, cloud orchestration module 141, host physical machine set 142, virtual machine set 143, and container set 144.
COMPUTER 101 may take the form of a desktop computer, laptop computer, tablet computer, smart phone, smart watch or other wearable computer, mainframe computer, quantum computer or any other form of computer or mobile device now known or to be developed in the future that is capable of running a program, accessing a network or querying a database, such as remote database 130. As is well understood in the art of computer technology, and depending upon the technology, performance of a computer-implemented method may be distributed among multiple computers and/or between multiple locations. On the other hand, in this presentation of computing environment 100, detailed discussion is focused on a single computer, specifically computer 101, to keep the presentation as simple as possible. Computer 101 may be located in a cloud, even though it is not shown in a cloud in
PROCESSOR SET 110 includes one, or more, computer processors of any type now known or to be developed in the future. Processing circuitry 120 may be distributed over multiple packages, for example, multiple, coordinated integrated circuit chips. Processing circuitry 120 may implement multiple processor threads and/or multiple processor cores. Cache 121 is memory that is located in the processor chip package(s) and is typically used for data or code that should be available for rapid access by the threads or cores running on processor set 110. Cache memories are typically organized into multiple levels depending upon relative proximity to the processing circuitry. Alternatively, some, or all, of the cache for the processor set may be located “off chip.” In some computing environments, processor set 110 may be designed for working with qubits and performing quantum computing.
Computer readable program instructions are typically loaded onto computer 101 to cause a series of operational steps to be performed by processor set 110 of computer 101 and thereby effect a computer-implemented method, such that the instructions thus executed will instantiate the methods specified in flowcharts and/or narrative descriptions of computer-implemented methods included in this document (collectively referred to as “the inventive methods”). These computer readable program instructions are stored in various types of computer readable storage media, such as cache 121 and the other storage media discussed below. The program instructions, and associated data, are accessed by processor set 110 to control and direct performance of the inventive methods. In computing environment 100, at least some of the instructions for performing the inventive methods may be stored in syllable-based text conversion 116 in persistent storage 113.
COMMUNICATION FABRIC 111 is the signal conduction path that allows the various components of computer 101 to communicate with each other. Typically, this fabric is made of switches and electrically conductive paths, such as the switches and electrically conductive paths that make up busses, bridges, physical input/output ports and the like. Other types of signal communication paths may be used, such as fiber optic communication paths and/or wireless communication paths.
VOLATILE MEMORY 112 is any type of volatile memory now known or to be developed in the future. Examples include dynamic type random access memory (RAM) or static type RAM. Typically, volatile memory 112 is characterized by random access, but this is not required unless affirmatively indicated. In computer 101, the volatile memory 112 is located in a single package and is internal to computer 101, but, alternatively or additionally, the volatile memory may be distributed over multiple packages and/or located externally with respect to computer 101.
PERSISTENT STORAGE 113 is any form of non-volatile storage for computers that is now known or to be developed in the future. The non-volatility of this storage means that the stored data is maintained regardless of whether power is being supplied to computer 101 and/or directly to persistent storage 113. Persistent storage 113 may be a read only memory (ROM), but typically at least a portion of the persistent storage allows writing of data, deletion of data and re-writing of data. Some familiar forms of persistent storage include magnetic disks and solid state storage devices. Operating system 122 may take several forms, such as various known proprietary operating systems or open source Portable Operating System Interface-type operating systems that employ a kernel. The code included in syllable-based text conversion 116 typically includes at least some of the computer code involved in performing the inventive methods.
PERIPHERAL DEVICE SET 114 includes the set of peripheral devices of computer 101. Data communication connections between the peripheral devices and the other components of computer 101 may be implemented in various ways, such as Bluetooth connections, Near-Field Communication (NFC) connections, connections made by cables (such as universal serial bus (USB) type cables), insertion-type connections (for example, secure digital (SD) card), connections made through local area communication networks and even connections made through wide area networks such as the internet. In various embodiments, UI device set 123 may include components such as a display screen, speaker, microphone, wearable devices (such as goggles and smart watches), keyboard, mouse, printer, touchpad, game controllers, and haptic devices. Storage 124 is external storage, such as an external hard drive, or insertable storage, such as an SD card. Storage 124 may be persistent and/or volatile. In some embodiments, storage 124 may take the form of a quantum computing storage device for storing data in the form of qubits. In embodiments where computer 101 is required to have a large amount of storage (for example, where computer 101 locally stores and manages a large database) then this storage may be provided by peripheral storage devices designed for storing very large amounts of data, such as a storage area network (SAN) that is shared by multiple, geographically distributed computers. IoT sensor set 125 is made up of sensors that can be used in Internet of Things applications. For example, one sensor may be a thermometer and another sensor may be a motion detector.
NETWORK MODULE 115 is the collection of computer software, hardware, and firmware that allows computer 101 to communicate with other computers through WAN 102. Network module 115 may include hardware, such as modems or Wi-Fi signal transceivers, software for packetizing and/or de-packetizing data for communication network transmission, and/or web browser software for communicating data over the internet. In some embodiments, network control functions and network forwarding functions of network module 115 are performed on the same physical hardware device. In other embodiments (for example, embodiments that utilize software-defined networking (SDN)), the control functions and the forwarding functions of network module 115 are performed on physically separate devices, such that the control functions manage several different network hardware devices. Computer readable program instructions for performing the inventive methods can typically be downloaded to computer 101 from an external computer or external storage device through a network adapter card or network interface included in network module 115.
WAN 102 is any wide area network (for example, the internet) capable of communicating computer data over non-local distances by any technology for communicating computer data, now known or to be developed in the future. In some embodiments, the WAN 012 may be replaced and/or supplemented by local area networks (LANs) designed to communicate data between devices located in a local area, such as a Wi-Fi network. The WAN and/or LANs typically include computer hardware such as copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and edge servers.
END USER DEVICE (EUD) 103 is any computer system that is used and controlled by an end user (for example, a customer of an enterprise that operates computer 101) and may take any of the forms discussed above in connection with computer 101. EUD 103 typically receives helpful and useful data from the operations of computer 101. For example, in a hypothetical case where computer 101 is designed to provide a recommendation to an end user, this recommendation would typically be communicated from network module 115 of computer 101 through WAN 102 to EUD 103. In this way, EUD 103 can display, or otherwise present, the recommendation to an end user. In some embodiments, EUD 103 may be a client device, such as thin client, heavy client, mainframe computer, desktop computer and so on.
REMOTE SERVER 104 is any computer system that serves at least some data and/or functionality to computer 101. Remote server 104 may be controlled and used by the same entity that operates computer 101. Remote server 104 represents the machine(s) that collect and store helpful and useful data for use by other computers, such as computer 101. For example, in a hypothetical case where computer 101 is designed and programmed to provide a recommendation based on historical data, then this historical data may be provided to computer 101 from remote database 130 of remote server 104.
PUBLIC CLOUD 105 is any computer system available for use by multiple entities that provides on-demand availability of computer system resources and/or other computer capabilities, especially data storage (cloud storage) and computing power, without direct active management by the user. Cloud computing typically leverages sharing of resources to achieve coherence and economies of scale. The direct and active management of the computing resources of public cloud 105 is performed by the computer hardware and/or software of cloud orchestration module 141. The computing resources provided by public cloud 105 are typically implemented by virtual computing environments that run on various computers making up the computers of host physical machine set 142, which is the universe of physical computers in and/or available to public cloud 105. The virtual computing environments (VCEs) typically take the form of virtual machines from virtual machine set 143 and/or containers from container set 144. It is understood that these VCEs may be stored as images and may be transferred among and between the various physical machine hosts, either as images or after instantiation of the VCE. Cloud orchestration module 141 manages the transfer and storage of images, deploys new instantiations of VCEs and manages active instantiations of VCE deployments. Gateway 140 is the collection of computer software, hardware, and firmware that allows public cloud 105 to communicate through WAN 102.
Some further explanation of virtualized computing environments (VCEs) will now be provided. VCEs can be stored as “images.” A new active instance of the VCE can be instantiated from the image. Two familiar types of VCEs are virtual machines and containers. A container is a VCE that uses operating-system-level virtualization. This refers to an operating system feature in which the kernel allows the existence of multiple isolated user-space instances, called containers. These isolated user-space instances typically behave as real computers from the point of view of programs running in them. A computer program running on an ordinary operating system can utilize all resources of that computer, such as connected devices, files and folders, network shares, CPU power, and quantifiable hardware capabilities. However, programs running inside a container can only use the contents of the container and devices assigned to the container, a feature which is known as containerization.
PRIVATE CLOUD 106 is similar to public cloud 105, except that the computing resources are only available for use by a single enterprise. While private cloud 106 is depicted as being in communication with WAN 102, in other embodiments a private cloud may be disconnected from the internet entirely and only accessible through a local/private network. A hybrid cloud is a composition of multiple clouds of different types (for example, private, community or public cloud types), often respectively implemented by different vendors. Each of the multiple clouds remains a separate and discrete entity, but the larger hybrid cloud architecture is bound together by standardized or proprietary technology that enables orchestration, management, and/or data/application portability between the multiple constituent clouds. In this embodiment, public cloud 105 and private cloud 106 are both part of a larger hybrid cloud.
The user 202 may recognize, e.g., in advance, that one or more displayed words to be read may have a difficult pronunciation, e.g., due to the unusual nature of that displayed word and/or due to the inexperience of the user 202 in the particular language. For example, the user 202 may be from China and speak English as a second language. For many such speakers, unusual terms provide extra difficulty for the speaker to pronounce. This user 202 sees the word “Marylebone” and is uncertain how to pronounce this word. An incorrect or unclear pronunciation may reduce the clarity and/or understandability of the presentation that is being provided by the user 202 to one or more end users, e.g., to respective end users at the first, second, and third end user devices 103a, 103b, 103c.
The user 202 may seek for pronunciation help by invoking the syllable-based text conversion 116 that is a computer program stored on the client computer 101. The user 202 may actuate a mouse cursor 208 to generate a first text box 210 to provide an input text to the syllable-based text conversion 116. The user may actuate a first button of the mouse or other input device to signal a desired starting of a text box. The user may then move the mouse or other input device to enlarge the text box and shift the text box to cover one or more of the words that are being displayed on the display screen 204. This movement may occur as a click-and-drag performed via the user 202 actuating the mouse or other input device. In the present example, the user 202 has generated and moved the first text box 210 to encompass and/or surround the word “Marylebone” and no other word of the other text that is currently being displayed on the display screen 204. When the first text box 210 is in the desired position surrounding the text for which the user 202 seeks pronunciation help, the user 202 may actuate another input device or perform another actuation at the mouse in order to indicate a selection of the text.
The user 202 previously activated the program for syllable-based text conversion 116 on the client computer 101, e.g., via one or more actuations via the mouse, in order to enter and/or activate a stage in which mouse/input device actuation triggers generation of the text box, e.g., of the first text box 210, that may be placed around some displayed text of the display screen 204.
Besides providing the input of the selected input text, the user 202 also provides as input into the program for syllable-based text conversion 116 a target language for the text conversion. In the depicted embodiment, the user 202 selects Mandarin Chinese as the target language for the text conversion. Mandarin Chinese may be a native language of the user 202 and/or the user may speak and/or read Mandarin Chinese with native proficiency. The user 202 may use one or more input devices connected to the client computer 101 to input the target language. For example, the user 202 may actuate a mouse over a graphical user interface button of the program for syllable-based text conversion 116 to trigger selection of the target language. This actuation may trigger the generation of a target language text box into which the user 202 can type the name of the target language. Alternatively and/or additionally the actuation may trigger the display of a list of languages which the program for syllable-based text conversion 116 is capable of providing as output languages for the pronunciation help. The user 202 may use an input device such as the mouse to scroll through the list and to select one of the presented languages. The user may speak into a microphone that is connected to the client computer 101 in order to give verbal instructions for selecting the target language. The client computer 101 and/or the program may include speech-to-text transcription capabilities and other natural language processing capabilities to receive, understand, and carry out verbal instructions. The target language may also be selected by retrieving information from a user profile created by the user. Such a user profile may be created by the user and/or the program as the user registers for the program and/or downloads the program.
The program for syllable-based text conversion 116 may have various settings for selecting the target language. In one setting, the target language may be selected at the beginning of a session so that for every requested passage in the session a pronunciation help output is provided in the selected language. For example, in the depicted embodiment of
The program for syllable-based text conversion 116 also includes as input a source language. For the embodiments depicted in
In some embodiments where the user 202 is performing screen sharing as part of an online live presentation when the pronunciation help was requested, the pronunciation help converted text 212 may be redacted and/or removed from the screen view before screen content is transmitted to other computers. The screen content without the redacted/blocked portion may be transmitted from the client computer 101 of the user 202 and over the WAN 102 to be displayed on screens of the various users listening to the presentation and watching the screen sharing, e.g., on the first, second, and third end user devices 103a, 103b, 103c.
In addition to the pronunciation help converted text 212 being presented visibly with letters and/or characters, the program for syllable-based text conversion 116 may also generate the pronunciation help as audio sounds. For example, if the user 202 were wearing earphones the program for syllable-based text conversion 116 may also generate an audio presentation of the phrase “Marylebone” to assist the user 202 so that the user 202 can make a correct and/or improved pronunciation of a difficult phrase/term in his or her own voice. Such an audio presentation may be generated using a voice recording from a speaker speaking the target language or the source language. In some embodiments, the program for syllable-based text conversion 116 may allow the user 202 to select whether an audio pronunciation would be played by a source language speaker or by a target language speaker.
In step 302 of the converter training preparation process 300 shown in
In step 304 of the converter training preparation process 300 shown in
As a part of step 304 or later as a part of step 310, pre-processing of the text training corpus data may be performed. This pre-processing may include converting any non-word characters in the text into a corresponding textual form. For example, numbers, percentage data, ‘.com’ etc. may be converted in this pre-processing. The following three sentences contain examples of this preprocessing. The number “8” may be pre-processed to read “eight”. The text “37%” may be pre-processed to be thirty-seven percent. The text “website.com” may be preprocessed to be “web site dot com”.
In some embodiments, some special words may not follow traditional pronunciation rules so that the text-to-speech program might not have a proper pronunciation stored for a word and may predict an incorrect pronunciation. As a supplement to the text-to-speech feature, step 304 may include accessing a special word mapping table which includes correctly-segmented syllables and the correct audio pronunciation for some special words. The mapping table may be a data structure that includes multiple data storage columns such as a first column for storing a (source) language of the special word, a second column for storing the special word itself, a third column for storing syllables of the word which were segmented by a native speaker or language expert, and a fourth column for storing an audio pronunciation clip of the special word spoken by a native speaker or a language expert.
The mapping table may in some embodiments include one or more additional columns which track common context words associated with the special word. Such context word tracking may be useful in instances when a single spelling of a word may have multiple pronunciations. For example, a first word or a first pair of words may have a first pronunciation when referring to a geographical location but have a second pronunciation when referring to a famous person. Some names may have a first pronunciation when referring to a first famous person and a second pronunciation when referring to a different second famous person. Other nouns and/or verbs in the vicinity of a text corpus with the particular special word may be identified and/or retrieved via the text/web crawling program and stored in the one or more context tracking columns. The program may query the mapping table in the first instance to check for the presence of a special word before turning to the text-to-speech module or to a word segmentation module.
Additionally, after initial training of the syllable-based text converter program for pronunciation help has been performed so that the program is ready, the program may regularly crawl the internet in multiple languages to look and/or listen for words such as proper nouns which are not part of the large text corpus from step 302. The crawling may especially be performed for certain languages such as English which include many words which defy usual pronunciation and syllable segmentation rules. When new words are identified in the crawling, a new entry may be generated in the above-described mapping table and a notification may be generated and sent to a language technician requesting that the technician provide an audio recording of pronunciation of the new special word and perform syllable textual segmentation of the special word or phrase.
In step 306 of the converter training preparation process 300 shown in
In step 308 of the converter training preparation process 300 shown in
In step 310 of the converter training preparation process 300 shown in
The segmentation of step 310 may also include the program analyzing the audio waveforms for each word that were recorded in step 306. This aspect may include calculating an approximate time that is required to pronounce each textual syllable. The program for performing this converter training preparation process 300 may include a single-syllable pronouncing time calculator that is a module. This calculation may be based on the known length of time for each of the vowel and consonant sounds that are within a particular syllable. For example if the word “avenger” were being analyzed and segmented in step 310, the single-syllable pronouncing time calculator may determine that the first syllable “a” takes 0.13 seconds to pronounce, that the second syllable “ven” takes 0.17 seconds to pronounce, and that the third syllable “ger” takes 0.3 seconds to pronounce. With this time calculation, the program may implement some aspects of natural language processing.
This calculation may include performing a search on the audio waveform, starting from a first textual syllable, and looking within a search range of k seconds before and after a most likely segmentation point on the time axis of the audio waveform of the current word being analyzed. The most likely segmentation point is determined by a calculated pronouncing time of the current and previous textual syllables. Within the range, the first time point where the wave line, e.g., the amplitude line, is zero is considered as the final segmentation point of the current textual syllable. This first time point may be recorded as a part of step 310.
For any language in which each word consists of one syllable, then step 310 may be omitted. For these languages, each word will inherently provide a syllable.
In step 312 of the converter training preparation process 300 shown in
For any language in which each word consists of one syllable, then step 312 may be omitted. For these languages, each word spectrogram will inherently provide a spectrogram per syllable.
In step 314 of the converter training preparation process 300 shown in
In step 316 of the converter training preparation process 300 shown in
In step 318 of the converter training preparation process 300 shown in
A sequence of textual syllable embeddings 503, the source language 504, and the target language 506 are received as inputs in order to perform the training with the group 502. Because
The training pipeline 500 includes inputting the sequence 503 of textual syllable embeddings into the encoder 508. The encoder 508 may be a machine learning model. The machine learning model may in at least some embodiments implement Long Short-Term Memory (LSTM). The LSTM is a type of recurrent neural network capable of learning order dependence in sequence prediction. The encoder 508 evaluates the sequence 503 of textual syllable embeddings to understand the order dependence of the textual syllables and their embeddings. Recurrent networks have an internal state that can represent context information. The encoder 508 maintains information about past inputs for an amount of time that is not fixed a priori, but rather depends on its weights and on the input data. The encoder 508 may include a recurrent neural network whose inputs are not fixed but rather constitute an input sequence. In this manner, the encoder 508 may be used to transform an input sequence into an output sequence while taking into account contextual information in a flexible way. The encoder 508 as output may produce a last hidden state, and this last hidden state may be input into the syllable-based text converter 510.
Before being implemented in the training pipeline 500, in at least some embodiments the encoder 508 is trained to better understand the effect of sequential order in input sequences.
Thus, the encoder training pipeline 600 includes sending multiple input sequences of textual syllable embeddings 603 into the encoder 508. The decoder 606 may be produced by replicating the encoder 508. The encoder 508 and the decoder 606 together may constitute a deep learning model 608.
The deep learning model 608 may be trained via sequence prediction and via auto-regressive language modeling, e.g., left-to-right sequencing. An encoder-decoder architecture, e.g., an autoencoder, in at least some embodiments is an example of the deep learning model 608. The training text corpus may be broken down into numerous segments and input sequences of textual syllable embeddings 603 to receive such auto-regressive language modeling. The last hidden state of the encoder 508 is transmitted in an encoder transmission 604 to the decoder 606 as input to the decoder 606. The training of the deep learning model 608 may include cross entropy loss and backpropagation to update and refine the parameters of the encoder 508. The decoder 606 produces as its output and as output of the deep learning model 608 a predicted sequence 610 of textual syllable embeddings. This predicted sequence 610 is used for the cross entropy loss and backpropagation. Backpropagation is an algorithm for training neural networks and may be used to help fit a neural network. Backpropagation may compute the gradient of the loss function with respect to the weights of the network and may be used to help train multilayer networks including by updating weights to minimize loss. The gradient of the loss function may be determined with respect to each weight by the chain rule. The gradient may be computed one layer at a time—iterating backward from the last layer to avoid redundant calculations of intermediate terms in the chain rule. Cross-entropy may be used as a loss function when optimizing neural networks that are performing classification.
The decoder 606 helps train the encoder 508 for later use. The decoder 606 is not needed for implementation in the training pipeline 500 and is also subsequently no needed for a prediction phase in which the trained syllable-based text converter 510 performs text conversion for pronunciation help.
The training pipeline 500 includes the last hidden state of the encoder 508 being input into the syllable-based text converter 510. The syllable-based text converter 510 also includes a machine learning model. The machine learning model in at least some embodiments implements Long Short-Term Memory (LSTM). The syllable-based text converter 510 also receives as input the target language 506. The target language 506 is input into the syllable-based text converter 510 as a one-hot encoding vector. The one-hot encoding vector of the target language 506 and the last hidden state of the encoder 508 may for the inputting be concatenated together to form an initial hidden state for the syllable-based text converter 510. The syllable-based text converter 510 samples existing textual syllable embeddings and generates a sequence of textual syllable embeddings in the target language 506. The textual syllable embeddings in the target language 506 that are generated by the syllable-based text converter 510 may be one-hot encoding embeddings.
Before being implemented in the training pipeline 500, in at least some embodiments the syllable-based text converter 510 is pre-trained to enhance and/or teach an ability to convert the text syllables. This pre-training may include using maximum likelihood estimation (MLE) which involves defining a likelihood function for calculating the conditional probability of observing a data sample given a probability distribution and distribution parameters. The MLE approach may be used to search a space of possible distributions and parameters. The MLE may be applied for the syllable-based text converter 510 to all sequences of textual syllable embeddings in every language from all text training datasets used for training the program. Cross entropy loss and backpropagation may be used to update parameters of the syllable-based text converter 510. During this pre-training, the syllable-based text converter 510 may not take in the last hidden state from an encoder such as the encoder 508 and instead has an initial hidden state that is initialized to zero values.
Before being implemented in the training pipeline 500, in at least some embodiments the syllable-based text converter 510 is also trained to further enhance and/or teach an ability to convert the text syllables. This training may occur after the pre-training of the syllable-based converter 510 that was described above. This training may include randomly selecting a sequence of textual syllable embeddings in a random source language from the training dataset and fed into the encoder 508 that has been trained. Then, the syllable-based text converter 510 takes in the last hidden state of the encoder 508 and a random target language, and outputs a sequence of textual syllable embeddings in this random target language. A policy gradient such as the policy gradient 620 shown in
Below is an example of an action-value function of a sequence that may in at least some embodiments be implemented for training the syllable-based text converter 510. For the action-value function below, QG
where T is a predetermined maximum sequence length; Y1:t is a sequence of target-language textual syllable embeddings with a size of t which is generated by the syllable-based text converter 510; MCG
When the syllable-based text converter 510 is trained, the maximum length of a generated sequence by the syllable-based text converter 510 is equal to the length of the input sequence of textual syllable embeddings. If there are textual syllable embeddings in a non-target language and/or if there are audio syllable embeddings, then such embeddings are removed from both the input and generated sequences which have the same sequence indices. If all embeddings are removed from the generated sequence, then the current generated sequence may be excluded from N times of the Monte Carlo search. If the length of a generated sequence by the syllable-based text converter 510 is less than the length of the input sequence of textual syllable embeddings, all those generated embeddings may be used to calculate cosine similarity. Even with this approach, the sum is still divided by the length of the input sequence of audio syllable embeddings in the source language.
When prediction is subsequently performed with the syllable-based text converter 510, the syllable-based text converter 510 may generate M different outputs and choose the generated sequence which sounds most like the source text.
The training pipeline 500 shows that the output of the syllable-based text converter 510 is a generated sequence 512 of target-language textual syllable embeddings. Thus, the output of the syllable-based text converter 510 shows the transformation performed with at least some of the present embodiments—namely that the source-language textual syllable embeddings become target-language textual syllable embeddings.
The training pipeline 500 shows that the generated sequence 512 of target-language textual syllable embeddings is input into a classifier 514. The classifier 514 may also be/include a machine learning model. The classifier 514 may implement Long Short-Term Memory (LSTM) and classifies the language category of the generated sequence 512 of target-language textual syllable embeddings. The classifier 514 may include an output layer that assigns decimal probabilities to each class in a multi-class problem, e.g., a SoftMax output layer. The classes may include each language for which the syllable-based text conversion 116 will be capable of providing pronunciation help. Therefore, if the syllable-based text conversion 116 is trained for pronunciation conversion help between two hundred different languages, then the classifier output layer produces a vector which indicates which of the two hundred different languages is the target language. The total number of dimensions of the one-hot encoding vector may be the total number of different languages for which the system is trained plus one. The plus one addition is caused by a special language category called “Mixed language”. This vector indicating the target language constitutes the language category that is predicted via the classifier 514.
In one example, the classifier 514 may generate a vector such as (1, 0, 0, . . . , 0). The dimension with a one instead of a zero indicates which language is predicted. In some instances, a one in the first position may refer to the “Mixed language” category”. The one in the second position. e.g., (0, 1, 0, . . . , 0), may refer to another language such as English. The one in the third position, e.g., (0, 0, 1, . . . , 0), may refer to another language such as Chinese. The dimension index which has the maximum value in such output vector indicates the corresponding language (category). For example, in some embodiments the output vector from the classifier may appear as (0.001, 0.98, 0.001, . . . , 0) which may be interpreted as a one for the second position. Because the second dimension has the maximum value (0.98) in this example, the identified language category is English.
The classifier 514 does not receive the target language 506 directly as an input. Rather, the classifier 514 receives embeddings—namely the generated sequence 512 of target-language textual syllable embeddings. The classifier 514 then determines the target language based on the generated sequence 512 of target-language textual syllable embeddings.
Before being implemented in the training pipeline 500, in at least some embodiments the classifier 514 is trained to teach language category prediction. For this training, a sequence of textual syllable embeddings in a certain language may be randomly selected from the textual training dataset. For this sequence, a corresponding training label is the language of the selected sequence. In the meantime, a few textual syllable embeddings in other languages or audio syllable embeddings are randomly inserted into a randomly-selected sequence of textual syllable embeddings in one language as an input. The corresponding training label is a special language category called ‘Mixed language’. Cross entropy loss and backpropagation may be implemented in this training to update the parameters of the classifier 514.
The training pipeline 500 shows that multiple inputs are input from different sources into a syllable-based text-to-audio generator 518. These multiple inputs include as a first input 531 the sequence 503 of source-language textual syllable embeddings, as a second input 532 the generated sequence 512 of target-language textual syllable embeddings, as a third input 533 the source language 504, and as a fourth input 534 the target language 506. This reference to first, second, third, and fourth inputs does not refer to a sequence but instead is used for clarity and organization in this document.
The syllable-based text-to-audio generator 518 uses the first input 531 and the third input 533 to produce a corresponding sequence of audio syllable embeddings in the source language. These audio syllable embeddings may be one-hot encoding embeddings. The syllable-based text-to-audio generator 518 uses the second input 532 and the fourth input 534 to produce a corresponding sequence of audio syllable embeddings in the target language. These audio syllable embeddings for both source and target languages may be one-hot encoding embeddings.
Before being implemented in the training pipeline 500, in at least some embodiments the syllable-based text-to-audio generator 518 is trained to receive text-to-audio generation capabilities.
As depicted in the text-to-audio generator training pipeline 700 in
In at least some embodiments, the ConvNet encoder 808, the DeConvNet decoder 822, and the codebook 814 are trained as shown in the audio encoder training pipeline 800. This training occurs via a variational autoencoder that uses vector quantization to obtain a discrete latent representation. The ConvNet encoder 808 is trained to output discrete, rather than continuous, codes. The prior is learnt rather than static. Therefore, this training is comparable to VQ-VAE training. The training is performed with all of the syllable-based spectrograms from the training dataset, namely for those syllable-based spectrograms that were produced in step 312 of the converter training preparation process 300 that was depicted in
In the audio encoder training pipeline 800, for a syllable-based spectrogram 802 with the size of 257×T×3 (whereby T indicates the seconds to pronounce a certain syllable multiplied by 10) the syllable-based spectrogram 802 is preprocessed via the preprocessing module 804 to fit into the size of 256×256×3 at first. This preprocessing results in a preprocessed spectrogram 806 being output from the preprocessing module 804. Then, the preprocessed spectrogram 806 is input into the ConvNet encoder 808 and is encoded via the ConvNet encoder 808 to produce a vector 810 which may in some embodiments be a 1024-dimensional vector. After vector 810 is produced, then stage 812 is performed. In stages 812 and 818, nearest-neighbor mapping is performed by accessing other vectors in the codebook 814 to find a codebook vector 820 from the codebook 814 that is most similar to the vector 810. After the codebook vector 820 is identified, an index value (ranging from N to N+K) of the identified codebook vector 820 in the codebook 814 is obtained. The index value from the codebook 814 may be converted to be a one-hot encoding embedding as the output audio syllable token 816. When subsequently implemented in the text-to-audio generator training pipeline 700 shown in
K is a hyper-parameter and is equal to the total number of unduplicated audio syllables. N is also a hyper-parameter and is equal to the total number of all textual syllable embeddings (including several special embeddings, such as [BOT]). Pronouncing a syllable takes a time which may be measured in seconds. The length of time depends on the particular syllable that is being pronounced. Suppose a syllable needs (0.1×X) seconds. Then, the size of the spectrogram for this syllable is 257×(X multiplied by 10)×3.2.5 seconds (that is, 0.1×X, where X=25) may be taken as the maximum possible time (T) for pronouncing a single syllable. A single-syllable spectrogram with a smaller width may be placed in the center of an imaginary spectrogram with a size of 256×256×3, and the blanks on both sides are filled in with black color. Audio-syllable tokens are later fed into the transformer 730 shown in
In one example, the index is exactly N. Then the corresponding audio-syllable token has a dimension value 1 at its N-th dimension. The other dimensions are all zero. The stage 818 represents fetching from the codebook 814 a codebook vector 820 that corresponds to the audio syllable token 816. In one example, if the N-th dimension of the audio syllable token 816 has a value of “1” and the values of the other dimensions of this audio syllable token 816 are all “0”, then this audio syllable token 816 corresponds to the index “N” in the codebook 814. This index is for the first embedding stored in the codebook 814, because the index of the codebook 814 ranges from N to N+K and does not start at the value 0. Thus, the first embedding is fetched from the storage of the codebook 814 as the codebook vector 820 and is fed to the DeConvNet decoder 822 to produce the reconstructed spectrogram 824.
During forward computation, the codebook vector 820 is passed to the DeConvNet decoder 822. During a backwards pass in the training, a gradient ∇ZL is passed unchanged from stage 818 to 812 and reaches the ConvNet encoder 808. Because the output representation of the ConvNet encoder 808 and the input to the DeConvNet decoder 822 share the same multi-dimensional space the gradients contain useful information for how the ConvNet encoder 808 must change its output to lower the reconstruction loss.
Therefore, in at least some embodiments a syllable-based audio encoder 706 is implemented which is enhanced compared to a standard VQ-VAE model. The syllable-based audio encoder 706 is used to encode a single-syllable spectrogram to an audio-syllable token with the size of 1×1×(N+K), instead of a feature map with the size of 32×32×256. The ConvNet encoder 808 may be an eight-layer convolutional neural network with 1024 hidden units and ReLU activation for each layer. Each layer may have a receptive field of four and a stride of two to halve the width and height of images. The DeConvNet decoder 822 may have the same architecture that the ConvNet encoder 808 has except the DeConvNet decoder 822 performs deconvolution instead of convolution. The range of the indices of vectors in the trainable codebook 814 is changed to [N, N+K].
Referring again to the text-to-audio generator training pipeline 700 shown in
This transformer 730 in at least some embodiments is unidirectional and includes many layers and attention heads. The training in the text-to-audio generator training pipeline 700 shown in
As depicted in
In response to receiving the token group as input, the token embedding layer 720 generates embeddings for each of the tokens of the input token group. These generated embeddings include token embeddings 722, modal-type embeddings 728, position embeddings 726, 729, and language-category embeddings 724. The final input embeddings taken by the transformer 730 are the sum of the token embeddings 722, modal-type embeddings 728, position embeddings 726, 729, and language-category embeddings 724. All of the input sequences may be clipped or padded to a length of 1024. The position embeddings 726 are for those produced from the textual syllable tokens. The position embeddings 729 are for those produced from the audio syllable tokens.
The training pipeline 500 shows that multiple outputs from the syllable-based text-to-audio generator 518 are input into an audio-sequence similarity calculation module 520. The audio-sequence similarity calculation module 520 compares the sequence of audio syllable embeddings in the source language that was produced via the syllable-based text-to-audio generator 518 with the sequence of audio syllable embeddings in the target language that was also produced in the syllable-based text-to-audio generator 518.
In at least some embodiments, the audio-sequence similarity calculation module 520 comparing the sequence of audio syllable embeddings in the source language to the sequence of audio syllable embeddings in the target language produces and/or calculates an audio-sequence similarity value, e.g., a cosine similarity value. The comparison may include replacing each audio-syllable embedding in the two input sequences with a vector from within a trained codebook 814 based on the mapping between the one-hot encoding embedding of the respective input sequence and the indices of the trained codebook 814. The cosine similarity between each pair of two codebook vectors with the same index from the two sequences may be calculated starting from the beginning of the two replaced sequences. The final audio-sequence similarity value may be equal to the sum of all cosine similarities divided by the length of the input sequence of audio syllable embeddings in source language.
Because this input language 904 is part of the prediction phase instead of a training phase, the input language 904 depends on the language of the input text that was input by the user in the present prediction session and not on the source language 504 that was input for the training pipeline 500 depicted in
The input text to convert 902 and the input language 904 are input into the text segmentation module 906 which is part of the program for text-syllable conversion 116. The text segmentation module 906 divides the input text into textual segments. The text segmentation module 906 may include a copy of a special word-mapping table that was described above with respect to the converter training preparation process 300 shown in
The input text may be scanned through the mapping table to check for any hits of stored special words before proceeding to the main portion of the text segmentation module 906. If the input text is identified in the special word mapping table, then the segmentation may be retrieved from the table so that the main portion of the text segmentation module 906 which performs rule-based segmentation and/or stored syllable segmentation for known words may be skipped.
In instances when the input text includes a word which has multiple entries in the mapping table, in at least some embodiments one of the multiple entries may be selected based on contextual words. Specifically words such as nouns and/or verbs from text in the vicinity of the input text may be compared with contextual words in the mapping table. Those contextual words in the mapping table were saved associated with this special word. Such contextual words may help determine which of different pronunciations should be implemented for a particular special word spelling. For example, the context may indicate that the special word(s) refers to a geographical location with a first pronunciation instead of to a famous person with the same name spelling but a second different pronunciation.
These textual segments are then fed to a preprocessing module 907 which converts the syllable segments from the segmentation module 906 into one-hot encoded embeddings. Similar to natural language processing of text words to tokens, the syllable segments are converted in this preprocessing module 907 to a one-hot encoded embedding according to its assigned token index, e.g., ranging from 0 to N−1. For example, when the index for the segmented syllable “ven” is 0, then the one-hot encoded embedding for “ven” is (1, 0, 0, . . . , 0) meaning that the first dimension value is 1 but the other dimension values are 0. The one-hot encoded embeddings are thereafter input into the encoder 508. This encoder 508 is implemented here having already been trained in the training pipeline 500 depicted in
Because this selected target language 908 is part of the prediction phase instead of a training phase, the selected target language 908 depends on the choice selected by the user in the present prediction session and not on the target language 506 that was input for the training pipeline 500 depicted in
Now having been trained and using the output of the encoder 508 and the selected target language 908, the syllable-based text converter 510 generates, as output, target language syllables 910 that correspond to the input text. These target language syllables 910 may be presented to the user in a variety of ways in order to help this user pronounce the input text in a way more readily understood by the user.
Because the syllable-based text converter 510 has been trained, during the prediction phase pipeline 900 a new audio waveform and a new spectrogram do not need to be generated each time. Nevertheless, the inner workings of the syllable-based text converter 510 are still based on a comparison of spectrograms because the training of the syllable-based text converter 510 depended on the waveform generation, spectrogram generation, etc. that were performed to the words of the text corpuses of various languages in the training pipeline 500 and/or in the training of the elements used in the training pipeline 500 as depicted in
The computer program for syllable-based text conversion 116 may in some embodiments produce a spectrogram on the fly and use the spectrogram for comparison to other saved spectrograms in order to find syllables that have a pronunciation that most closely matches a pronunciation of input text in the first language for which pronunciation help was requested. Thus, different languages may be evaluated and interpreted even on the fly to provide translation and pronunciation help.
In some embodiments, the program for syllable-based text conversion 116 may ask for feedback from the user after pronunciation help has been given. This feedback may then be applied in one or more of the machine learning models involved in the program for syllable-based text conversion 116 in order to improve the machine learning and the future pronunciation help.
Although the figures and embodiments show that pronunciation help is generated and presented in a single target language, in some embodiments the person user may find the presentation of pronunciation help in multiple languages to be helpful. For example, the user may be familiar with multiple languages or may be attempting to learn a third language. Thus, in these other embodiments, the converting may be performed multiple times to present multiple pronunciation helps, e.g., pronunciation helping characters not only in the Chinese language but also in Spanish and/or Hindi, etc. The user may select multiple programs in a graphical user interface of the program to request this advanced help of determining and presenting the pronunciation help in multiple target languages. The process may be repeated for each language in order to generate this text conversion to the multiple languages. In a default setting of the program, the program will generate pronunciation help in a single target language in order to reduce the computing requirements of the program.
It may be appreciated that
For example, preferred machine learning models were described for implementation for the syllable-based text conversion for pronunciation help and for training the text conversion system. In other embodiments, other machine learning models may be implemented for one or more of the training and/or prediction steps described above in the various embodiments. Such alternative machine learning models may include naive Bayes models, random decision tree models, linear statistical query models, logistic regression n models, neural network models, e.g. convolutional neural networks, multi-layer perceptrons, residual networks, long short-term memory architectures, algorithms, deep learning models, deep learning generative models, and other models. Training data should include targets or target attributes which include a correct answer. The learning algorithm finds patterns in input data in order to map the input data attributes to the target. The machine learning models contain these patterns so that the answer can be predicted for similar future inputs. A machine learning model may be used to obtain predictions on new input text. The machine learning model uses the patterns that are identified to determine what the appropriate text conversion for pronunciation help is. Training may include supervised and/or unsupervised learning.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,” “includes,” “including,” “has,” “have,” “having,” “with,” and the like, when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but does not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The descriptions of the various embodiments of the present invention have been presented for purposes of illustration but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments. In this regard, each block in the flowchart, pipeline, and/or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).