Voice-enabled computing devices are becoming more prevalent. An individual speaks a command to activate such a device. In response to a voice command, the device performs various functions, such as outputting audio.
Voice computing will soon be used by billions of people. Retrieving content with voice commands is a very different user experience than typing keywords into a search engine which has indexed billions of pages of content and has an advanced algorithm to surface the most relevant or highest value content. With voice, it would be extremely frustrating and time consuming to listen to all kinds of possible hits. With screen computing, one can quickly scan the pages and immediately find the relevant page to click on.
In one exemplary embodiment, a method comprises creating a library of items, each item having one or more excerpts. The method further comprises enabling a user to add new items to the library and to assign a unique voice tag to each new item as it is added to the library. The method further comprises adding metadata about each new item to an index as the new item is added to the library, together with the unique voice tag assigned to the new item by the user. The method further comprises monitoring for potential duplicate voice tags in the library. When such duplicates are detected, one or more alternative voice tags are recommended to the user.
In another exemplary embodiment, a system comprises a library of items, each item having one or more excerpts. The system further comprises an index comprising metadata about each item in the library and a plurality of unique voice tags. Each voice tag corresponds to one item in the library. The system further comprises a voice tag assignment module configured to enable a user to assign new voice tags to new items as they are added to the library. The voice tag assignment module is configured to prevent a user from assigning a duplicate voice tag to a new item as it is added to the library.
In another exemplary embodiment, a method comprises receiving a voice instruction from a user. The voice instruction comprises a command portion and a voice tag portion. The voice tag portion comprises a voice tag corresponding to an item in a library, and the voice tag being assigned to the corresponding item by a user when the corresponding item was added to the library. The method further comprises parsing the voice instruction to identify the command portion and the voice tag portion, and processing the voice tag portion to identify the item in the library corresponding to the voice tag. The method further comprises accessing the item in the library corresponding to the voice tag, and processing the command portion to carry out a desired function on the accessed item, in accordance with the voice instruction.
Understanding that the drawings depict only exemplary embodiments and are not therefore to be considered limiting in scope, the exemplary embodiments will be described with additional specificity and detail through the use of the accompanying drawings, in which:
In accordance with common practice, the various described features are not drawn to scale but are drawn to emphasize specific features relevant to the exemplary embodiments.
When an item 105 is ingested into a personal library 110, it may be transcribed and parsed into one or more excerpts 130, or clips. For example, as shown in
In some cases, items 105 (e.g., Item 2 and Item 3 in
In operation, the user 115 may assign a voice tag 125 to an item 105 using a voice tag assignment module 150, accessible via a website 155 or mobile application 160, for example. In some embodiments, the voice tag assignment module 150 may recommend voice tags 125 that are unique, easy to remember (e.g., short words related to the text), and comprise words that work well in voice computing.
For example, if the user 115 selects a voice tag 125 that has been used previously, the voice tag assignment module 150 may detect the conflict, and recommend an alternative, unique voice tag 125. It is known that a set of 40,000 unique words in English can be combined to form approximately 64 trillion unique three-word combinations, i.e., 40,0003=64 trillion. Thus, even if each voice tag 125 is relatively short (e.g., three words or less), the index 135 may comprise a vast namespace with trillions of unique items 105 or excerpts 130, each having a corresponding unique voice tag 125.
It is also known that voice computing systems perform well on some words, but poorly on other words, such as names or words that sound like other words. Thus, the voice tag assignment module 150 may suggest that the user 115 avoid such problematic words when assigning voice tags 125.
To provide a specific example, Item 1 shown in
As another example, Item 2 shown in
As another example, Item 3 shown in
In some embodiments, individual users 115 or organizational users 215 may be compensated for their contributions to the universal library 210. For example, usage of the universal library 210 by paying subscribers may be tracked, and a percentage of revenues paid to the individual users 115 or organizational users 215 that contributed the items 105 or excerpts 130 to the universal library 210, based on usage.
The universal library 210 comprises an index 135 with a “Type” field indicating who can access the corresponding items 105 or excerpts 130. For example, as shown in
In operation, individual users 115 and organizational users 215 may assign unique voice tags 125 to items 105 using the voice tag assignment module 150, as described above. When analyzing potential namespace conflicts in the universal library 210, additional metadata beyond the voice tag 125 can be considered to uniquely identify items 105 in the index 135. For example, if two users 115 assigned the same voice tag 125 (e.g., “Acme Merger Call”) to two different items 105 marked private in their personal libraries, no conflict would arise because the items do not exist in the same public namespace. On the other hand, if both items 105 were marked public, a namespace conflict would arise, and the second user 115 would be prompted to select a different, unique voice tag 125. In such cases, the second user 115 may choose a related voice tag 125 (e.g., “Acme Merger Discussion”) or an unrelated voice tag 125 (e.g., “Apple Banana Orange”).
In some embodiments, a universal library 210 can be replicated to create a new namespace for a given organizational user 215. For example, an organization name (e.g., CSPAN) can be added to the keyword field of the index 135 to create an organizational library with a unique namespace. The organization name or other keywords can be combined with the unique voice tags 125 to identify and access particular versions of items 105 in the universal library 210. As an example, “CSPAN King Dream” may correspond to the CSPAN version of the King “I Have a Dream” speech, or billions of other audio clips. In some cases, individual users 115 and organizational users 215 may have access to different libraries, and could be directed to a particular library based on a voice tag 125 combined with one or more keywords.
In addition to voice tags 125 and keywords, the index 135 can also include “meaningful phrases” added by users 115 as additional metadata corresponding to items 105 in the universal library 210. In some embodiments, the index 135 may also comprise a full transcript with every word of every item 105 in the universal library 210, accessible via a full-text voice search engine. Meaningful phrases and full transcripts can be searched and multiple possible “hits” can be presented to the user 115, whereas with voice tags 125, the system 200 looks for a substantially exact match, the single item that best matches the voice tag 125, for retrieval.
For example, in the embodiment illustrated in
In other embodiments, users may see or hear voice tags 125 through a wide variety of possible user interfaces, such as websites, printed publications, broadcasts, posts, etc. Users can access such interfaces through a wide variety of suitable devices or media, such as computing devices (e.g., notebook computers, ultrabooks, tablet computers, mobile phones, smart phones, personal data assistants, video gaming consoles, televisions, set top boxes, smart televisions, portable media players, and wearable computers (e.g., smart watches, smart glasses, bracelets, etc.), display screens, displayless devices (e.g., Amazon Echo), other types of display-based devices, smart furniture, smart household devices, smart vehicles, smart transportation devices, and/or smart accessories, among others), static displays (e.g., billboards, signs, etc.), publications (e.g., books, magazines, pamphlets, flyers, mailers, etc.).
In some embodiments, when a voice tag 125 is displayed visually, it can be preceded by a selected designator, such as the ∞ character (ASCII code 236) or the ˜ character (ASCII code 126). For example, the voice tag 125 “King Dream” may be displayed or printed as ∞KingDream or ˜KingDream. Seeing the selected designator will let users 115 know that the text that follows is a voice tag 125, and they can access the corresponding item 105 or excerpt 130 by saying the voice tag 125 near a suitable voice-enabled computing device.
In some embodiments, a voice tag 125 can also function as a hypertext link to a unique URL following a predictable naming convention, such as: https://play.soar.com/Voice-Tag. For example, the voice tag 125 ˜KingDream may correspond to the following URL: https://play.soar.com/King-Dream. In some embodiments, when such a voice tag 125 is displayed on a computing device, a user 115 can select the hyperlink to navigate directly to the corresponding URL. In some embodiments, when the user's web browser retrieves the selected URL and the corresponding item 105 or excerpt 130 is a media file, playback may begin automatically.
In operation, when the voice-enabled computing device 475 receives the voice instruction 470, the device 475 activates the voice tag retrieval module 480 to access a selected item 105 or excerpt 130 and deliver it via output 485, in accordance with the voice instruction 470. In some embodiments, before the user 115 speaks the voice instruction 470, the user may say a “wakeword” (e.g., “Alexa,” “OK Google,” etc.) and another voice command (e.g., “Open Soar Audio,” etc.) to launch the voice tag retrieval module 480. In some embodiments, the voice instruction 470 may comprise a command portion 470A (e.g., “GET,” “SHARE,” etc.), an optional first context portion 470B (e.g., “from the web,” etc.), an optional keyword portion 470C (e.g., “Soar,” “CSPAN,” etc.), a voice tag portion 470D (e.g., “Happy Pain,” “King Dream,” etc.), an optional second context portion 470E (e.g., “from 1963,” etc.), and an optional delivery portion 470E (e.g., “on my phone,” “to my family,” etc.).
The voice instruction 470 may be audio data analyzed to identify and convert the words represented in the audio data into tokenized text. This can include, for example, processing the audio data using an automatic speech recognition (ASR) module (not shown) that is able to recognize human speech in the audio data and then separate the words of the speech into individual tokens that can be sent to a natural language understanding (NLU) module (not shown), or other such system or service. The tokens can be processed by the NLU module to attempt to determine a slot or purpose for each of the words in the audio data. For example, the NLU module can attempt to identify the individual words, determine context for the words based at least in part upon their relative placement and context, and then determine various purposes for portions of the audio data.
For example, the NLU module can process the words “GET King Dream on my phone” together to identify this phrase as a voice instruction 470. There can be variations to such an intent, but words such as “GET” or “SHARE” can function as a primary trigger word, for example, which can cause the NLU module to look for related words that are proximate the trigger word in the audio data. Other variations such as “I want to SHARE” may also utilize the same trigger word, such that the NLU may need to utilize context, machine learning, or other approaches to properly identify the intent. In this particular example, the voice tag retrieval module 480 will parse the voice instruction 470 and will identify the word “GET” as the command portion 470A, the words “King Dream” as the voice tag portion 470D, and the words “on my phone” as the optional delivery portion 470F. Accordingly, the voice tag retrieval module 480 will retrieve Item 3 from the universal library 210, and deliver it to the user 115 via output 485, which in this case will be a device previously identified as the user's phone. In some embodiments, Item 3 will begin playing automatically on the user's phone.
As another example, the voice instruction 470 may comprise the phrase “GET King Dream,” without any additional context modifiers or keywords. In this example, the voice tag retrieval module 480 will retrieve Item 3 from the universal library 210, and deliver it to the user 115 via output 485, which in this case will be the voice-enabled computing device 475, because the voice instruction 470 did not include the optional delivery portion 470F.
As another example, the voice instruction 470 may comprise the SHARE command, which advantageously enables users 115 to designate any number of individuals or groups with whom they will be able to immediately share selected items 105 or excerpts 130. For example, the voice instruction 470 may comprise the phrase “SHARE King Dream with my family.” In this example, the voice tag retrieval module 480 will parse the voice instruction 470 and will identify the word “SHARE” as the command portion 470A, the words “King Dream” as the voice tag portion 470D, and the words “with my family” as the optional delivery portion 470F. Accordingly, the voice tag retrieval module 480 will retrieve Item 3 from the universal library 210, and deliver it via output 485, which in this case will be a group of individuals previously designated as the user's family. In some embodiments, the selected item 105 or excerpt 130 will be delivered to each family member through their preferred delivery method, as described below.
In some embodiments, the voice tag retrieval module 480 may reference an account of the user 115 to identify individuals designated as members of the user's family. In another example, if the user 115 desired to share an item 105 or excerpt 130 with another identifiable group of individuals (e.g., coworkers, clients, club members, etc.), the voice tag retrieval module 480 may reference the user's account to find the individuals designated as members of the desired group. In some embodiments, the voice tag retrieval module 480 may check user preferences to determine how to share the selected item 105 or excerpt 130 with each individual. For example, a user 115 may create a profile and indicate a preferred delivery method, such as a voice assistant (e.g., Amazon Echo, Google Home, etc.), email, SMS, WhatsApp, Facebook Messenger, etc. In some embodiments, a voice assistant can send “notifications” to individual users, to let them know that new content is available. For example, an indicator light may illuminate to indicate that new notifications or messages have been received.
In other examples, the voice instruction 470 may comprise a phrase such as “SHARE King Dream on Facebook” or “SHARE King Dream on Twitter.” In these example, the voice tag retrieval module 480 will parse the voice instruction 470 and will identify the word “SHARE” as the command portion 470A, the words “King Dream” as the voice tag portion 470D, and the words “on Facebook” or “on Twitter” as the optional delivery portion 470F. Accordingly, the voice tag retrieval module 480 will retrieve Item 3 from the universal library 210, and deliver it via output 485, which in this case will be a social media account previously designated by the user 115.
The specification and drawings are to be regarded in an illustrative rather than a restrictive sense. It will, however, be evident that various modifications and changes may be made thereunto without departing from the broader spirit and scope of the invention as set forth in the claims.
This application claims the benefit of U.S. Provisional Patent Application Ser. No. 62/957,738 (Attorney Docket 333.001USPR) filed on Jan. 6, 2020, entitled “PRECISION RECALL IN VOICE COMPUTING”, the entirety of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
62957738 | Jan 2020 | US |