Embodiments of the present invention are generally directed to techniques for presenting audio/video responses based on intent derived from features of audio/video interactions. These techniques could be implemented as part of a lead management system, and therefore, an overview of leads is provided below. However, these techniques could also be implemented as part of or on behalf of any system that interfaces with an individual via an audio/video stream.
A lead can be considered a contact, such as an individual or an organization, that has expressed interest in a product or service that a business offers. A lead could merely be contact information such as an email address or phone number, but may also include an individual's name, address or other personal/organization information, an identification of how an individual expressed interest (e.g., providing contact/personal information via a web-based form, signing up to receive periodic emails, calling a sales number, attending an event, etc.), communications the business may have had with the individual, etc. A business may generate leads itself (e.g., as it interacts with potential customers) or may obtain leads from other sources.
A business may use leads as part of a marketing or sales campaign to create new business. For example, sales representatives may use leads to contact individuals to see if the individuals are interested in purchasing any product or service that the business offers. These sales representatives may consider whatever information a lead includes to develop a strategy that may convince the individual to purchase the business's products or services. When such efforts are unproductive, a lead may be considered dead. Businesses typically accumulate a large number of dead leads over time.
Recently, efforts have been made to employ artificial intelligence to identify leads that are most likely to produce successful results. For example, some solutions may consider the information contained in leads to identify which leads exhibit characteristics of the ideal candidate for purchasing a business's products or services. In other words, such solutions would inform sales representatives which leads to prioritize, and then the sales representatives would use their own strategies to attempt to communicate with the respective individuals.
The present invention extends to systems, methods and computer program products for presenting audio/video responses based on intent derived from features of audio/video interactions. By providing such audio/video responses, a consumer interaction agent can cause a consumer to experience an interactive conversation that is as good or better than communicating with a human.
In some embodiments, the present invention may be implemented as a method for providing audio/video responses to consumers based on intent derived from features of the consumer's audio/video interactions. An audio/video interaction can be received from a consumer. Text can be extracted from the audio/video interaction. One or more features in the text can be identified. An intent of the audio/video interaction can be derived based on the one or more features in the text. An audio/video response can be selected based on the intent. The audio/video response can be presented to the consumer.
In some embodiments, one or more features in the audio/video content of the audio/video interaction can also be identified, and the intent may be derived based also on the one or more features in the audio/video content.
In some embodiments, the audio/video response may be an audio/video clip that includes a human speaking or a rendering of an avatar speaking. The avatar may or may not resemble a human.
In some embodiments, the one or more features in the text can be identified by performing natural language processing to determine one or more tokens that appear in the text.
In some embodiments, the one or more features in the text can be identified by generating a tokenized version of the text.
In some embodiments, the one or more features in the audio/video content of the audio/video interaction can be identified by detecting one or more of a tone, body language, or facial expression of the consumer.
In some embodiments, the tone, body language, or facial expression of the consumer can be detected by detecting when voice content of the audio/video content represents excitement, reluctance, or uncertainty.
In some embodiments, the tone, body language, or facial expression of the consumer can be detected by detecting particular facial expressions or hand gestures.
In some embodiments, the tone, body language, or facial expression of the consumer may be detected using artificial intelligence.
In some embodiments, one or more timestamps can be associated with the text and the one or more timestamps can be used to link at least one of the one or more features in the text with at least one corresponding feature of the one or more features in the audio/video content.
In some embodiments, the audio/video response can be selected based on the intent by selecting the audio/video response from among multiple audio/video responses that match the intent that is derived based on the one or more features in the text.
In some embodiments, the audio/video response can be selected from among the multiple audio/video responses that match the intent that is derived based on the one or more features in the text by selecting the audio/video response based on the intent that is derived based also on the one or more features in the audio/video content.
In some embodiments, the intent may be one of busy, busy and anxious, affirmative answer, affirmative answer and sad, affirmative answer and excited, or negative answer.
In some embodiments, the audio/video response can be presented to the consumer by dynamically generating data for rendering an avatar by which the audio/video response is presented.
In some embodiments, the intent may be derived based also on previous interactions with the consumer or information known about the consumer.
In some embodiments, the consumer can be connected with a human after presenting the audio/video response to the consumer.
In some embodiments, the text extracted from the audio/video interaction and text of the audio/video response can be presented to the human to thereby provide context to the human.
In some embodiments, the present invention can be implemented as computer storage media storing computer executable instructions which when executed implement a method for providing audio/video responses to consumers based on intent derived from features of the consumer's audio/video interactions. An audio/video interaction can be received from a consumer. Text can be extracted from the audio/video interaction. One or more features in the text can be identified. One or more features in the audio/video content of the audio/video interaction can also be identified. An intent of the audio/video interaction can be derived based on the one or more features in the text and the one or more features in the audio/video content. An audio/video response can be selected based on the intent. The audio/video response can be presented to the consumer.
In some embodiments, the present invention may be implemented as a method for providing audio/video responses to consumers based on intent derived from features of the consumer's audio/video interactions. Audio/video interactions can be received from a consumer. Text can be extracted from the audio/video interactions. Features in the text can be identified using artificial intelligence to detect one or more of a tone, body language, or facial expression of the consumer during the audio/video interactions. Intents of the audio/video interactions can be derived based on the features in the text. Audio/video responses can be selected based on the intents. The audio/video responses can be presented to the consumer.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter.
Understanding that these drawings depict only typical embodiments of the invention and are not therefore to be considered limiting of its scope, the invention will be described and explained with additional specificity and detail through the use of the accompanying drawings in which:
In the specification and the claims, the term “consumer” should be construed as an individual. A consumer may or may not be associated with an organization. The term “lead” should be construed as information about, or that is associated with, a particular consumer. The term “consumer computing device” can represent any computing device that a consumer may use and by which an audio/video communication system may communicate with the consumer. In a typical example, a consumer computing device may be a consumer's phone.
As an overview, embodiments of the present invention can be used to have audio/video interactions (e.g., video calls) with a consumer without the involvement of a human. In other words, embodiments of the present invention enable consumers to have video calls or other forms of audio/video interactions with what may appear to be a human when in fact a consumer interaction agent is on the other end of the video calls. Embodiments of the present invention employ techniques for selecting and presenting audio/video responses based on intent derived from features of the consumer's audio/video interactions. Such techniques enable the consumer interaction agent to present a representation of a human that responds to and interacts with the consumer as a human would.
Embodiments of the present invention are primarily described in the context of a lead management system that is designed to assist a business in contacting and communicating with its leads. However, embodiments of the present invention could be implemented whenever it may be desired or useful to have audio/video interactions with individuals. For example, such individuals could have initiated the audio/video interactions without prior involvement of any business and/or without any prior knowledge of or information about the individuals.
Lead management system 100 can perform a variety of functionality on the leads to enable lead management system 100 to have AI-driven interactions, including audio/video interactions, with consumers 170. For example, these AI-driven interactions can be audio/video calls that are intended to convince consumers 170 to have a video call, phone call, or other interaction with a sales representative (or agent) of business 160. Once the AI-driven interactions with a particular consumer 170 are successful (e.g., when the particular consumer 170 agrees to a video call with business 160), lead management system 100 may initiate/connect a video call between the particular consumer 170 and a sales representative of business 160. Accordingly, by only providing its leads, including its dead leads, to lead management system 100, business 160 can obtain video calls or other interactions with consumers 170.
Intent extractor 110 can represent one or more components of lead management system 100 that are configured to extract/derive intent from features of audio/video interactions received by consumer interaction agents 150. Audio/video response database 120 can represent one or more data storage mechanisms that store audio/video clips (e.g., pre-recorded audio/video of a human) that consumer interaction agents 140 can use as responses to audio/video interactions they have with consumers. Alternatively or additionally, audio/video response database 120 could include data for rendering audio/video responses for consumer interaction agents 150 to use (e.g., audio/video of a rendered avatar of a human, an animal, a cartoon, or other non-human thing). Lead database 130 can represent one or more data storage mechanisms for storing leads or data structures defining leads. Consumer interaction database 140 can represent one or more data storage mechanisms for storing consumer interactions or data structures defining consumer interactions.
Consumer interaction agents 150 can be configured to interact with consumers 170 via consumer computing devices. For example, consumer interaction agents 150 can communicate with consumers 170 via text messages, emails, another text-based mechanism, or, of primary relevance to embodiments of the present invention, audio and video such as video calls. These interactions can be stored in consumer interaction database 140 and associated with the respective consumer 170 (e.g., via associations with the corresponding lead defined in lead database 130).
In some embodiments, lead management system 100 could include a business appointment initiator that is configured to initiate an appointment (e.g., a video call, phone call, or similar communication) between a consumer 170 and a representative of business 160. For example, a business appointment initiator could establish a call with a consumer and then connect the business representative to the call. As described in U.S. patent application Ser. No. 17/346,032 (the “'032 Application”), which is incorporated herein by reference, a business appointment extractor can intelligently select the timing of such appointments by applying a scheduling language and model to the consumer interactions, including AI-driven audio/video interactions, that consumer interaction agents 150 have with consumers 170.
In some embodiments, lead management system 100 could include a dynamic lead outreach engine that can be used to determine the timing, content, and the like of a next interaction with a consumer as described in U.S. patent application Ser. No. 17/347,207, which is incorporated herein by reference. In other words, a dynamic lead outreach engine could be used when lead management system 100 initiates the AI-driven interactions with a consumer (e.g., based on previous interactions with the consumer or other information known about the consumer). However, in embodiments of the present invention, the consumer may initiate the AI-driven interactions (e.g., by initiating a video call) regardless of whether lead management system 100 has any prior knowledge of the consumer.
In some embodiments, lead management system 100 could include a lead data processor for processing lead data to facilitate the AI-driven interactions with consumers as described in U.S. patent application Ser. No. 17/346,055, which is incorporated herein by reference.
In both
Although not represented in
When consumer interaction agent 150 receives an audio/video interaction from the consumer (e.g., a captured portion of the audio/video content that the consumer computing device sends during an audio/video call), audio/video interface 151 (or another suitable component) may extract the audio from the audio/video interaction and provide it to voice-to-text module 152. Voice-to-text module 152 can generate text of the audio and provide the text to feature extractor 153. In some embodiments, the audio/video can also be input to feature extractor 153. Although not shown, in some embodiments, the text of the audio can include one or more timestamps or some other information for linking the text to the corresponding portions of the audio/video.
Feature extractor 153 can employ natural language processing or other suitable techniques to extract features from the text of the audio and, in some embodiments, can employ audio/video processing techniques to extract features from the video. Examples of how features may be extracted from text are provided in the '032 Application. For example, feature extractor 153 may employ natural language processing to determine which tokens (which may be considered one type of feature) appear in the text for the audio. Feature extractor 153 could then output text features which may be in the form of a tokenized version of the text for the audio.
To generate audio/video features from the audio/video, feature extractor 153 may process the audio/video to detect the consumer's tone, body language, facial expression, etc. For example, feature extractor 153 could use artificial intelligence (e.g., a machine learning algorithm) to detect when voice content represents excitement, reluctance, uncertainty, or any other emotion that may be conveyed via tone. Similarly, feature extractor 153 could use artificial intelligence (e.g., a machine learning algorithm) to detect particular facial expressions, hand gestures, or other body language that convey a particular emotion that may be present in the video content. Accordingly, each audio/video feature that feature extractor 153 generates can represent an occurrence of an emotion or other audible/visual expression that is detected in the audio/video interaction. As stated above, timestamps or any other suitable information can be used to associate these audio/video features with the corresponding text features (e.g., so that intent extractor 110 can know which emotion/expression the consumer had when speaking a particular word, phrase, sentence, etc.). As one example, one or more timestamps could be used to link the text of a consumer's answer to a question to each audio/video feature that was detected during the consumer's answer.
Feature extractor 153 can then provide the text features and the audio/video features to intent extractor 110. In some embodiments, intent extractor 110 could be configured to employ artificial intelligence (e.g., a machine learning algorithm) to derive an intent from the features. Examples of deriving an intent from text features are provided in the '032 Application. For example, intent extractor 110 could employ a machine learning algorithm that is trained on a model that includes the tokens and patterns to determine which pattern the tokenized version of the text for the audio matches. In such cases, the matched pattern could define the intent. The '032 Application uses examples where the intent relates to the scheduling of a future communication. However, the same or similar techniques could be used to derive any intent such as a particular question that the consumer is asking.
In some embodiments, intent extractor 110 may use the text features alone to derive the intent and therefore to select the matching audio/video response. For example, based only on the text features, intent extractor 110 could determine that the intent of the audio/video interaction is an interest in a particular product or service being discussed. In such a case, intent extractor 110 could query audio/video response database 120 to retrieve an audio/video clip (or response) 120 associated with interest in general or with interest in the particular product or service. As one example, a matching audio/video clip could be a recording of an individual saying “That's great. I'm glad you're interested in this offering.” As another example, a matching audio/video response could be a rendering of an avatar speaking this same content.
In some embodiments, intent extractor 110 may use both the text features and the audio/video features to derive the intent and therefore to select the matching audio/video response. Using the same example as above, if the audio/video features indicate that the consumer was excited during the audio/video interaction (e.g., if the audio/video features indicate that the consumer was smiling, using a higher pitched tone, moving his or her hands in an excited manner, etc.), intent extractor 110 could determine that the intent of the audio/video interaction was excited interest in a particular product or service being discussed. In such a case, intent extractor 110 could select an audio/video clip where the individual says “That's great. I'm glad you're interested in this offering” in an excited tone, with an excited facial expression, and/or with an excited hand gesture. Accordingly, there may be multiple audio/video responses that could match intent based on text features alone where the audio/video response that will be selected in a particular scenario depends on the corresponding audio/video features.
In some embodiments, rather than employing pre-created audio/video clips or pre-defined content for rendering an audio/video response, intent extractor 110 (or another component) could dynamically generate an audio/video response. For example, intent extractor 110 could dynamically generate data for rendering an avatar that speaks a response that is tailored specifically to the features of the audio/video interaction and/or the derived intent. As one example, a dynamically generated audio/video response could include an avatar using the consumer's name or other information obtained from the audio/video interaction or previous audio/video interactions.
Returning to
As mentioned above, in some embodiments, the selection of an intent based on the features of an audio/video interaction can also be based on previous interactions with the consumer and/or information known about the consumer. For example, intent extractor 110 could interface with lead database 130 and/or consumer interaction database 140 to obtain additional context for deriving an intent for a particular audio/video interaction and/or for selecting a particular audio/video response based on a derived intent. In this way, the audio/video response that is presented to the consumer can be further customized based on what is known about the consumer and his or her previous interactions.
In some embodiments, consumer interaction agent 150, intent extractor 110, or another component can be configured to transfer a video call (or other audio/video interaction) to an actual human. For example, in some embodiments, when a particular intent is derived from the features of an audio/video interaction, intent extractor 110 may be configured to cause consumer interaction agent 150 to transfer the video call to a sales representative or other agent of business 160 (or otherwise initiate a video call between the consumer and the human). In so doing, consumer interaction agent 150 could provide the text that has been generated from the consumer's audio/video interactions and text of audio/video responses that have been provided to the consumer to thereby provide context to the human. As one example, when intent extractor 110 determines that the intent of an audio/video interaction is to purchase a product or service, the video call could be transferred to a human to close the deal. As another example, when intent extractor 110 determines that an appropriate audio/video response cannot be provided (e.g., when the consumer's question or concern cannot be adequately addressed with any available audio/video clip or rendering), the video call could be transferred to a human to respond appropriately.
As can be seen, embodiments of the present invention enable a consumer interaction agent to appear to the consumer as if it were an actual human, both visually and verbally. Alternatively, the consumer interaction agent can appear as an avatar which may or may not resemble a human. In either case, by presenting audio/video responses that are selected based on the intent of the consumer's audio/video interactions, the consumer can receive prompter service, improved emotional intelligence, more comprehensive expertise, and an overall better experience. Embodiments of the present invention can therefore enhance the consumer experience in a wide variety of interaction scenarios.
Embodiments of the present invention may comprise or utilize special purpose or general-purpose computers including computer hardware, such as, for example, one or more processors and system memory. Embodiments within the scope of the present invention also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures. Such computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer system.
Computer-readable media are categorized into two disjoint categories: computer storage media and transmission media. Computer storage media (devices) include RAM, ROM, EEPROM, CD-ROM, solid state drives (“SSDs”) (e.g., based on RAM), Flash memory, phase-change memory (“PCM”), other types of memory, other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other similar storage medium which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. Transmission media include signals and carrier waves. Because computer storage media and transmission media are disjoint categories, computer storage media does not include signals or carrier waves.
Computer-executable instructions comprise, for example, instructions and data which, when executed by a processor, cause a general-purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. The computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language or P-Code, or even source code.
Those skilled in the art will appreciate that the invention may be practiced in network computing environments with many types of computer system configurations, including, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, smart watches, pagers, routers, switches, and the like.
The invention may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks. In a distributed system environment, program modules may be located in both local and remote memory storage devices. An example of a distributed system environment is a cloud of networked servers or server resources. Accordingly, the present invention can be hosted in a cloud environment.
The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description.
This application claims the benefit of U.S. Provisional Application No. 63/304,959 which was filed on Jan. 31, 2022.
Number | Date | Country | |
---|---|---|---|
63304959 | Jan 2022 | US |