The present disclosure is related to multimodal interactions, and more particularly to a method and a system for personalized multimodal response generation through a virtual agent using Artificial Intelligence (AI) models.
In today's digital age, virtual agents and AI-powered systems have become increasingly integrated into our daily lives. These virtual agents are designed to understand and respond to user inputs, making them valuable tools for a wide range of applications, from customer service chatbots to virtual assistants in smart devices.
Traditionally, virtual agents and chatbots have relied heavily on text-based inputs and rule-based systems. These systems use predefined decision trees and scripts to generate responses to user queries. While they have been effective for simple tasks like information retrieval or basic customer support, they have significant limitations when it comes to handling more complex, natural language interactions.
With the advent of AI and Natural Language Processing (NLP), some advancements have been made in virtual agent technology. Large Language Models (LLMs), such as GPT-3 and GPT-4, have demonstrated impressive capabilities in understanding and generating human-like text. These models have been integrated into virtual agents, allowing them to provide more contextually relevant responses to text-based queries.
However, these existing techniques primarily focus on text-based interactions, and their ability to handle other modalities like speech, sensor data, or visual inputs is limited. Furthermore, they often lack the ability to understand user emotions and adapt their responses accordingly. These limitations result in less engaging and less effective user-agent interactions.
Therefore, in order to overcome the aforementioned problems, there exists a need for techniques that effectively utilize LLMs and multimodality to create personalized virtual agents across various modalities, including text, speech, and vision. These virtual agents not only understand user inputs across various modalities but also engage in reasoning and respond with a high degree of personalization. Such techniques enable virtual agents to be more adaptive, empathetic, and proficient in delivering responses that cater to users' unique needs and preferences.
It is within this context that the present embodiments arise.
The following embodiments present a simplified summary in order to provide a basic understanding of some aspects of the disclosed invention. This summary is not an extensive overview, and it is not intended to identify key/critical elements or to delineate the scope thereof. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.
Some example embodiments disclosed herein provide a method for multimodal response generation through a virtual agent, the method comprising retrieving information related to an input received by the virtual agent. The virtual agent employs an Artificial Intelligence (AI) model. The method may further include generating a response corresponding to the input based on the retrieved information. The method may further include generating a plurality of prompts based on user characteristics and the input. The method may also include modifying the response based on the plurality of prompts to generate a multimodal response.
According to some example embodiments, the AI model is a Generative AI model.
According to some example embodiments, the method further comprises transmitting the multimodal response to the user, wherein the multimodal response is transmitted to the user in one or more combinations of modalities comprising text, speech, visual elements, and gesture.
According to some example embodiments, the method further comprises determining one or more modalities for generating the multimodal response based on user's engagement and comprehension levels.
According to some example embodiments, the AI model continuously learns from historical interactions to upgrade reasoning and response framing, and dynamically adapt user's accustomed communication style.
According to some example embodiments, the AI model employs a role-based approach following user-provided instructions and the plurality of prompts to generate the multimodal response.
According to some example embodiments, the AI model is trained to understand user emotions enabling generation of the multimodal response adaptive to user's emotional state.
According to some example embodiments, the plurality of prompts facilitates personalization of the response in real-time.
According to some example embodiments, the method further comprises storing a record of the input, the plurality of prompts, and the generated multimodal response for future reference and analysis.
According to some example embodiments, the method further comprises monitoring a user feedback on the multimodal response; adjusting subsequent prompts based on the user feedback; and modifying a subsequent response based on the subsequent prompts.
Some example embodiments disclosed herein provide a computer system for multimodal response generation through a virtual agent, the computer system comprises one or more computer processors, one or more computer readable memories, one or more computer readable storage devices, and program instructions stored on the one or more computer readable storage devices for execution by the one or more computer processors via the one or more computer readable memories, the program instructions comprising retrieving information related to an input received by the virtual agent. The virtual agent employs an Artificial Intelligence (AI) model. The one or more processors are further configured for generating a response corresponding to the input based on the retrieved information. The one or more processors are further configured for generating a plurality of prompts based on user characteristics and the input. The one or more processors are further configured for modifying the response based on the plurality of prompts to generate a multimodal response.
Some example embodiments disclosed herein provide a non-transitory computer readable medium having stored thereon computer executable instruction which when executed by one or more processors, cause the one or more processors to carry out operations for multimodal response generation through a virtual agent. The operations comprising retrieving information related to an input received by the virtual agent. The virtual agent employs an Artificial Intelligence (AI) model. The operations further comprising generating a response corresponding to the input based on the retrieved information. The operations further comprising generating a plurality of prompts based on user characteristics and the input. The operations further comprising modifying the response based on the plurality of prompts to generate a multimodal response.
The foregoing summary is illustrative only and is not intended to be in any way limiting. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features will become apparent by reference to the drawings and the following detailed description.
The above and still further example embodiments of the present disclosure will become apparent upon consideration of the following detailed description of embodiments thereof, especially when taken in conjunction with the accompanying drawings, and wherein:
The figures illustrate embodiments of the invention for purposes of illustration only. One skilled in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the invention described herein.
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be apparent, however, to one skilled in the art that the present disclosure can be practiced without these specific details. In other instances, systems, apparatuses, and methods are shown in block diagram form only in order to avoid obscuring the present invention.
Reference in this specification to “one embodiment” or “an embodiment” or “example embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. The appearance of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Further, the terms “a” and “an” herein do not denote a limitation of quantity, but rather denote the presence of at least one of the referenced items. Moreover, various features are described which may be exhibited by some embodiments and not by others. Similarly, various requirements are described which may be requirements for some embodiments but not for other embodiments.
Some embodiments of the present disclosure will now be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all, embodiments of the invention are shown. Indeed, various embodiments of the invention may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements. Like reference numerals refer to like elements throughout.
The terms “comprise”, “comprising”, “includes”, or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a setup, device, or method that comprises a list of components or steps does not include only those components or steps but may include other components or steps not expressly listed or inherent to such setup or device or method. In other words, one or more elements in a system or apparatus proceeded by “comprises . . . a” does not, without more constraints, preclude the existence of other elements or additional elements in the system or method.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present invention. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., are non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, non-volatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
The embodiments are described herein for illustrative purposes and are subject to many variations. It is understood that various omissions and substitutions of equivalents are contemplated as circumstances may suggest or render expedient but are intended to cover the application or implementation without departing from the spirit or the scope of the present invention. Further, it is to be understood that the phraseology and terminology employed herein are for the purpose of the description and should not be regarded as limiting. Any heading utilized within this description is for convenience only and has no legal or limiting effect.
The term “module” used herein may refer to a hardware processor including a Central Processing Unit (CPU), an Application-Specific Integrated Circuit (ASIC), an Application-Specific Instruction-Set Processor (ASIP), a Graphics Processing Unit (GPU), a Physics Processing Unit (PPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Device (PLD), a Controller, a Microcontroller unit, a Processor, a Microprocessor, an ARM, or the like, or any combination thereof.
The term “machine learning model” may be used to refer to a computational or statistical or mathematical model that is trained on classical ML modelling techniques with or without classical image processing. The “machine learning model” is trained over a set of data and using an algorithm that it may use to learn from the dataset.
The term “artificial intelligence” may be used to refer to a model built using simple or complex Neural Networks using deep learning techniques and computer vision algorithms. Artificial intelligence model learns from the data and applies that learning to achieve specific pre-defined objectives.
The term “virtual agent” may be used to refer to a virtual assistant that is computer program or AI system designed to simulate human-like conversations with users. They are typically powered by artificial intelligence and natural language processing technologies. The virtual agent can understand user inputs, generate appropriate responses, and perform specific tasks or provide information. They are often used in customer support, information retrieval, and other applications to provide automated and efficient conversational experiences.
Embodiments of the present disclosure may provide a method, a system, and a computer program product for multimodal input processing for a virtual agent. The method, the system, and the computer program product for multimodal response generation through a virtual agent are described with reference to
In some embodiments, the medium 106 supports multimodal communication, allowing the user 102 to combine various forms, such as text, speech, and visual elements. This multimodal capability enables a more natural and interactive communication between the user 102 and the virtual agent 104.
Furthermore, the medium 106 may involve sensor data, such as data from accelerometers, gyroscopes, GPS sensors, or other environmental sensors (e.g., temperature and humidity sensors). This environmental data plays important role in understanding user context and emotions, which is essential for generating personalized multimodal responses.
The user input, in this context, represents a query, request, or command from the user 102. This input serves as the starting point for the virtual agent 104 to understand the user's needs and provide appropriate assistance or information. Importantly, this input may span across various modalities, reflecting the diverse communication styles of users.
To further elaborate, here are some detailed examples of what these inputs may include:
Textual Input: The user may provide text-based queries or requests through written messages or chat. For instance: “Can you recommend a good Italian restaurant nearby?.”, or “Tell me today's weather forecast.”
Speech Input: Users may interact with the virtual agent 104 using spoken language. For instance: “Call John Smith.”, or “Play some relaxing music.”
Visual Input: Users may show their face expressions, emotions, or gestures, to show their feelings and the virtual agent 104 may interpret and respond to these inputs accordingly. For instance: A user might smile to indicate happiness or agreement.
Sensor Data: In scenarios involving IoT devices, wearable gadgets, or environmental sensors, the user's inputs may include data generated by these sensors. For example: health data from a fitness tracker, such as heart rate and steps taken, or home automation commands, like adjusting the thermostat temperature.
Commands and Requests: Users may issue direct commands or requests for specific actions. For example: “Set an alarm for 7 AM.”, or “Send a text message to Mom.”
Questions and Inquiries: Users often seek information or answers to questions. For example: “What's the capital of France?”, or “How do I bake a chocolate cake?”
Personal Preferences: Users may provide input related to their personal preferences or choices. For example: “Recommend a movie similar to the one I watched last week.”, or “Suggest a restaurant that serves vegetarian cuisine.”
Location-Based Queries: Input may involve location-based requests or queries. For example: “Find the nearest gas station.”, or “Give me directions to the nearest bus stop.”
In some embodiments, while communicating with the user 102, the virtual agent 104 may combine the multimodal inputs, such as text, speech, and visuals of the user to understand the user's query/request and provide more personalized response in visual form.
By way of an example, while communicating with the user 102, the virtual agent 104 may analyse the following:
Facial Expressions: Users may use their facial expressions to convey emotions or reactions. For instance: frowning or a furrowed brow may signify confusion or dissatisfaction, or raising an eyebrow may signal curiosity or scepticism.
Emotional Cues: Visual input as well as audio-based input and other forms of input help understand the emotional cues of the user. For example: if the user appears sad or teary-eyed, the agent can respond with empathy and offer comforting words, or the user 102 with an excited expression can prompt the agent to respond with enthusiasm.
Gestures: Users may use hand gestures or body language to communicate non-verbally. For example: pointing at an object in the environment, indicating interest or a question about that object, or waving a hand to get the agent's attention or to say goodbye.
Visual Cues: Users may provide visual cues by showing specific objects or scenes through their device's camera. For instance: displaying a broken appliance and asking for help in identifying the issue or sharing a photo of a product they want to purchase and asking for reviews or price information.
Facial Features: Detailed analysis of facial features, such as eye movements or the position of the mouth, may help gauge the user's emotional state or level of engagement. For example: dilated pupils may indicate excitement or interest, or avoiding eye contact might suggest shyness or discomfort.
The virtual agent 104, equipped with computer vision algorithms and advanced multimodal response generation techniques, may combine, and analyze these visual inputs to understand the user's intent, or emotions. This enables the virtual agent 104 to respond in a visual, personalized and contextually relevant manner, enhancing the overall user experience during interactions.
The ability to understand and respond to users across various modalities while considering individual user characteristics and emotions is a key differentiator and advantage of the present disclosure in the field of personalized multimodal response generation. This is further explained in greater detail in conjunction with
While only a single machine is illustrated, the term “machine” shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.
The example of the machine 200 includes at least one processor 202 (e.g., a central processing unit (CPU), a graphics processing unit (GPU), advanced processing unit (APU), or combinations thereof), one or more memories such as a main memory 204, a static memory 206, or other types of memory, which communicate with each other via link 208. Link 208 may be a bus or other type of connection channel. The machine 200 may include further optional aspects such as a graphics display unit 210 comprising any type of display. The machine 200 may also include other optional aspects such as an alphanumeric input device 212 (e.g., a keyboard, touch screen, and so forth), a user interface (UI) navigation device 214 (e.g., a mouse, trackball, touch device, and so forth), a storage unit 216 (e.g., disk drive or other storage device(s)), a signal generation device 218 (e.g., a speaker), sensor(s) 221 (e.g., global positioning sensor, accelerometer(s), microphone(s), camera(s), and so forth), output controller 228 (e.g., wired or wireless connection to connect and/or communicate with one or more other devices such as a universal serial bus (USB), near field communication (NFC), infrared (IR), serial/parallel bus, etc.), and a network interface device 220 (e.g., wired and/or wireless) to connect to and/or communicate over one or more networks 226.
Executable Instructions and Machine-Storage Medium: The various memories (i.e., 204, 206, and/or memory of the processor(s) 202) and/or storage unit 216 may store one or more sets of instructions and data structures (e.g., software) 224 embodying or utilized by any one or more of the methodologies or functions described herein. These instructions, when executed by processor(s) 202 cause various operations to implement the disclosed embodiments.
Example Machine Architecture and Machine-Readable Medium
As used herein, the terms “machine-storage medium,” “device-storage medium,” “computer-storage medium” mean the same thing and may be used interchangeably in this disclosure. The terms refer to a single or multiple storage devices and/or media (e.g., a centralized or distributed database, and/or associated caches and servers) that store executable instructions and/or data. The terms shall accordingly be taken to include storage devices such as solid-state memories, and optical and magnetic media, including memory internal or external to processors. Specific examples of machine-storage media, computer-storage media and/or device-storage media include non-volatile memory, including by way of example semiconductor memory devices, e.g., erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), FPGA, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The terms machine-storage media, computer-storage media, and device-storage media specifically and unequivocally excludes carrier waves, modulated data signals, and other such transitory media, at least some of which are covered under the term “signal medium” discussed below.
Signal Medium: The term “signal medium” shall be taken to include any form of modulated data signal, carrier wave, and so forth. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a matter as to encode information in the signal
Computer Readable Medium: The terms “machine-readable medium,” “computer-readable medium” and “device-readable medium” mean the same thing and may be used interchangeably in this disclosure. The terms are defined to include both machine-storage media and signal media. Thus, the terms include both storage devices/media and carrier waves/modulated data signals.
As used herein, the term “network” may refer to a long-term cellular network (such as GSM (Global System for Mobile Communication) network, LTE (Long-Term Evolution) network or a CDMA (Code Division Multiple Access) network) or a short-term network (such as Bluetooth network, Wi-Fi network, NFC (near-field communication) network, LoRaWAN, ZIGBEE or Wired networks (like LAN, el all) etc.).
As used herein, the term “computing device” may refer to a mobile phone, a personal digital assistance (PDA), a tablet, a laptop, a computer, VR Headset, Smart Glasses, projector, or any such capable device.
As used herein, the term ‘electronic circuitry’ may refer to (a) hardware-only circuit implementations (for example, implementations in analog circuitry and/or digital circuitry); (b) combinations of circuits and computer program product(s) comprising software and/or firmware instructions stored on one or more computer readable memories that work together to cause an apparatus to perform one or more functions described herein; and (c) circuits, such as, for example, a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation even if the software or firmware is not physically present. This definition of ‘circuitry’ applies to all uses of this term herein, including in any claims. As a further example, as used herein, the term ‘circuitry’ also includes an implementation comprising one or more processors and/or portion(s) thereof and accompanying software and/or firmware. As another example, the term ‘circuitry’ as used herein also includes, for example, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in a server, a cellular network device, other network device, and/or other computing device.
Accordingly, blocks of the flow diagram support combinations of means for performing the specified functions and combinations of operations for performing the specified functions. It will also be understood that one or more blocks of the flow diagram, and combinations of blocks in the flow diagram, may be implemented by special purpose hardware-based computer systems which perform the specified functions, or combinations of special purpose hardware and computer instructions. The method 300 illustrated by the flowchart diagram of
The method 300 starts at step 302, where the virtual agent is ready to retrieve the user input. At step 304, the virtual agent may retrieve information related to an input received by the virtual agent. This input may come in various forms, including text, speech, and visual cues. The virtual agent may employ an Artificial Intelligence (AI) model, which may, in particular embodiments, be a Generative AI (GenAI) model. The GenAI model represents a cutting-edge approach to artificial intelligence and is capable of multifaceted operations.
In specific implementations, the GenAI model may include, but not limited to, a Language model (LLM) for text, a Vision Language model (VLM) for vision-text, a speech model for speech, and other relevant modules. This comprehensive GenAI model is designed to process and respond to multimodal inputs effectively, making it exceptionally versatile in understanding and interacting with users across different modalities such as text, vision, and speech. In some embodiments, the GenAI model may take the form of an ensemble model allowing for even greater adaptability and proficiency in handling diverse inputs and user interactions.
At step 306, the method 300 may further include generation of a response corresponding to the input based on the retrieved information.
To further elaborate, before generating the response, the virtual agent thoroughly analyzes the user's input. This analysis includes understanding the content, context, intent, and sentiment conveyed by the user across various modalities, such as text, speech, and visual cues.
The virtual agent may refer to its knowledge base, which can be an internal database or an external data source, to gather additional information relevant to the user's query or input. This information retrieval process helps ensure that the response is factually accurate and contextually rich. Based on the analysis of the user's input and the retrieved information, the virtual agent generates an initial response. This response may take the form of text, speech, visual elements, or a combination of these modalities, depending on the nature of the user's input and the design of the virtual agent.
The virtual agent takes into account the user's characteristics, preferences, and historical interactions to tailor the response accordingly. For example, if the user has previously expressed a preference for a formal tone, the initial response may be composed in a formal style.
In scenarios where the user's input conveys emotional cues, such as sadness or frustration, the virtual agent may incorporate emotional intelligence. This means that the response may be designed to acknowledge the user's emotions and provide empathetic or supportive language.
At step 308, a plurality of prompts may be generated based on user characteristics and the input. The plurality of prompts facilitates personalization of the response in real-time. These prompts are created based on user characteristics and the content of the input. The prompts play a crucial role in guiding the subsequent response generation process.
In a more elaborative way, these prompts are designed to enhance personalization and ensure that the final multimodal response aligns with the user's individual characteristics, preferences, and the specific input provided. This may be achieved by the following steps:
Further, at step 310, the response may be rephrased based on the plurality of prompts to generate the multimodal response. In this step, the virtual agent selects an appropriate prompt from the plurality of prompts generated. The choice of prompt depends on several factors, including the user's characteristics, emotional state, and the specific context of the conversation.
By way of an example, consider a scenario where a user initiates a conversation with a virtual agent by saying, “I've been feeling really stressed lately due to work pressure. Can you help me manage my stress?”. Let's say the virtual agent selects the following prompt from the set: “I'm here to help you manage your stress. Let's start by discussing some relaxation techniques.”
The virtual agent uses this selected prompt as a foundation for its response to the user. However, it doesn't provide the prompt precisely. Instead, it rephrases it into a more comprehensive and empathetic response:
In some example embodiments, a computer programmable product may be provided. The computer programmable product may comprise at least one non-transitory computer-readable storage medium having stored thereon computer-executable program code instructions that when executed by a computer, cause the computer to execute the method 300.
In an example embodiment, an apparatus for performing the method 300 of
Alternatively, the apparatus may comprise means for performing each of the operations described above. In this regard, according to an example embodiment, examples of means for performing operations (302-312) may comprise, for example, the processor 202 which may be implemented in the system 200 and/or a device or circuit for executing instructions or executing an algorithm for processing information as described above.
The user query may include a wide range of topics, including but not limited to:
The multimodal nature of user query ensures that the virtual agent 104 may accommodate a wide array of communication styles and user preferences, making the interaction not only efficient but also personalized and engaging. This flexibility enables the virtual agent 104 to provide a diverse user needs and advance meaningful interactions across various domains and scenarios.
In the backend, while answering to the user's query the virtual agent 104 may include various components that may perform various functionalities:
Based on the user's voice, emotions and character, the virtual agent may customize it's speech tone, face pose and its character. Therefore, to achieve this, the multimodal mixer module 414 may include components such as speech tone modifier 416, the factual information retrieved and delivered as natural language 418, and face pose modifier 420.
The speech tone modifier may be capable of modifying speech tone of the virtual agent 416. This allows the virtual agent to convey the response with the desired emotional tone or style.
The factual information retriever 418 component retrieves factual information from the LLM 402 to ensure that the response is grounded in accurate data and facts. The retrieved information is presented as natural language in order to respond back to the user.
The face pose modifier 420 component focuses on adjusting the virtual agent's facial expression, which may be included in the visual aspect of the response.
The components for modifying speech tone, retrieving factual responses, and adjusting facial expressions work together to generate a modified character of the virtual agent 104 via a character generator 422. This character represents the virtual agent's response and is designed to convey the information effectively.
Finally, the modified character, which encapsulates the multimodal response, is transmitted to the user 102 through the virtual agent 104. The response is tailored to include text, visual elements, and speech, enhancing the overall user-agent interaction.
For the sake of explanation of multimodal response generation process, consider an exemplary scenario 500 of user and virtual agent interaction, as illustrated in
In a more elaborate way, the virtual agent in conjunction with LLM may do the following observation by looking at the user: User feels a bit low and anxious. Eyes not looking straight. Speech is very feeble. The LLM collects the following information related to the user's feeling: the user is feeling low and anxious, the user's eyes are not looking straight, and the user's speech is very feeble.
Further, the user may ask multimodal query, “Will things get better for me in life?”. The multimodal input includes text (typing), voice (spoken words), and visual cues (facial expression and gestures) to convey their emotional state.
Based on user's query and the collected emotional information, the designer comes into play. The LLM collaborates with the designer to devise a prompt aimed at exploring the user's current emotional state and identifying the aspects of their life that are causing concern. This prompt is designed to improve the user's mood and emotional well-being, aligning with the observed emotional journey information. For example, as shown in present
Subsequently, the responder uses text to prompt the user to share their feelings. It expresses empathy and understanding, acknowledging the user's emotions. Furthermore, the responder adjusts its speech tone to be more empathetic, matching the user's emotional state, and lowering it accordingly. Simultaneously, in the visual aspect, the virtual agent displays an empathetic facial expression, reinforcing the supportive and understanding nature of the response.
In light of the user's expressed feelings of being low and anxious, along with their eyes and speech as indicators of emotional intensity, the virtual agent responds with a compassionate message. It states, “I'm sorry to hear you're feeling low and anxious. Acknowledging your emotions and taking care of yourself is crucial. Your eyes and speech may reflect the intensity of your feelings. To give a better answer, I need more context. Life has its ups and downs; seeking support from friends, family, or a counselor can help. You're not alone; don't hesitate to ask for help. Take small steps in self-care and find joy in activities. While I can't predict the future, I'm here to support you. Feel free to share more, and remember, talking about your feelings is the first step to positive change.”
Following this response, the virtual agent continues to monitor the user's emotional state and designs prompts accordingly. For example, in the depicted scenario, the user transitions from initial frustration and anxiety to a blank and reflective state and eventually to a state of calm. The LLM guides the designer to provide prompts that encourage positive thinking and outlook in life, facilitating a personalized and empathetic interaction.
In a more generalized way, the interaction between the user and the virtual agent may be explained with the help of another example where the virtual agent persona is created beforehand. The details of the created virtual agent are follows:
The user begins conversation with the virtual agent named “Jason” as follows:
This example showcases the effectiveness of a virtual agent like Jason in understanding and responding to the user's changing emotional states. Jason's persona, empathetic responses, and adaptability help to transform the user's negative emotions into a more positive and hopeful outlook. The engagement level is consistently maintained throughout the conversation, demonstrating a personalized and emotionally intelligent interaction.
Accordingly, blocks of the flow diagram support combinations of means for performing the specified functions and combinations of operations for performing the specified functions for performing the specified functions. It will also be understood that one or more blocks of the flow diagram, and combinations of blocks in the flow diagram, may be implemented by special purpose hardware-based computer systems which perform the specified functions, or combinations of special purpose hardware and computer instructions.
The method 600 illustrated by the flow diagram of
In specific implementations, the GenAI model may include, but not limited to, a Language model (LLM) for text, a Vision Language model (VLM) for vision-text, a speech model for speech, and other relevant modules. This comprehensive GenAI model is designed to process and respond to multimodal inputs effectively, making it exceptionally versatile in understanding and interacting with users across different modalities such as text, vision, and speech. In some embodiments, the GenAI model may take the form of an ensemble model allowing for even greater adaptability and proficiency in handling diverse inputs and user interactions. It should be noted that the multimodal input processing may be enhanced by utilizing either a single model, or an ensemble of models designed to specific modalities, such as text, voice, and visual inputs. These models play an important role in converting the multimodal inputs into a standardized format before presenting them to the collector for entity recognition and extraction.
Once the information is retrieved, method 600 may further include, at step 606, generating a response corresponding to the input based on the retrieved information. Further, the method 600, at step 608, may include generating a plurality of prompts based on user characteristics and the input. Subsequently, at step 610, the method 600 includes modifying the response based on the plurality of prompts to generate a multimodal response. This is already explained in conjunction with
Further, at step 612 the method 600 may include determining one or more modalities for generating the multimodal response based on user's engagement and comprehension levels. In this step, the virtual agent assesses the user's engagement and comprehension levels to make informed decisions about the modalities to be used when generating the multimodal response. The choice of modalities is adapted to the user's preferences and needs, aiming to provide the most effective and engaging interaction. The one or more modalities may be determined by the following steps:
After generating the multimodal response, at step 614 the method 600 may include transmitting the multimodal response to the user. The multimodal response is transmitted to the user in one or more combinations of modalities comprising text, speech, visual elements, and gesture. This ensures that the user receives the response in a format that suits their preferences and needs. The method 600 may be terminated, at step 616.
In some example embodiments, a computer programmable product may be provided. The computer programmable product may comprise at least one non-transitory computer-readable storage medium having stored thereon computer-executable program code instructions that when executed by a computer, cause the computer to execute the method 600.
In an example embodiment, an apparatus for performing the method 600 of
Accordingly, blocks of the flow diagram support combinations of means for performing the specified functions and combinations of operations for performing the specified functions. It will also be understood that one or more blocks of the flow diagram, and combinations of blocks in the flow diagram, may be implemented by special purpose hardware-based computer systems which perform the specified functions, or combinations of special purpose hardware and computer instructions. The method 600 illustrated by the flowchart diagram of
The method 700 starts at 702 and commences with retrieving information related to an input received by the virtual agent, at step 704. The virtual agent may employ an Artificial Intelligence (AI) model. In some embodiments, the AI model may be a. Generative AI (GenAI) model. Examples of the GenAI may include, but are not limited to, LLM, VLM, and the like. In some embodiments, the Gen AI may be an ensemble model.
Once the information is retrieved, method 700 may further include, at step 706, generating a response corresponding to the input based on the retrieved information. Further, the method 700, at step 708, may include generating a plurality of prompts based on user characteristics and the input. Subsequently, at step 710, the method 700 includes modifying the response based on the plurality of prompts to generate a multimodal response.
Further, at step 712 the method 700 may include determining one or more modalities for generating the multimodal response based on user's engagement and comprehension levels. Further, at step 714 the method 700 may include transmitting the multimodal response to the user. The multimodal response is transmitted to the user in one or more combinations of modalities comprising text, speech, visual elements, and gesture.
Furthermore, at step 716 the method 700 may include storing a record of the input, the plurality of prompts, and the generated multimodal response for future reference and analysis. At this stage, the virtual agent ensures that a comprehensive record of the conversation is maintained. This record includes the user's original input, the prompts that were generated based on the user's characteristics and input, and the final multimodal response. This data is valuable for several reasons:
In some example embodiments, a computer programmable product may be provided. The computer programmable product may comprise at least one non-transitory computer-readable storage medium having stored thereon computer-executable program code instructions that when executed by a computer, cause the computer to execute the method 700.
In an example embodiment, an apparatus for performing the method 700 of
Accordingly, blocks of the flow diagram support combinations of means for performing the specified functions and combinations of operations for performing the specified functions. It will also be understood that one or more blocks of the flow diagram, and combinations of blocks in the flow diagram, may be implemented by special purpose hardware-based computer systems which perform the specified functions, or combinations of special purpose hardware and computer instructions.
The method 800 illustrated by the flow diagram of
Further the method 800 may include, at step 806, adjusting subsequent prompts based on the user feedback. The primary objective is to adapt the prompts and instructions used in generating responses to better align with the user's preferences and emotional states. This may be achieved by following steps:
By adjusting subsequent prompts based on user feedback, the virtual agent aims to create a more user-centric and emotionally aware interaction environment. This adaptive approach helps improve user engagement and satisfaction by addressing specific concerns and preferences expressed by the user during interactions.
Further, the method 800, at step 808, may include modifying a subsequent response based on the subsequent prompts. By modifying responses to align with the user's emotional state and preferences, the virtual agent tries to create a more meaningful and satisfying interaction. This iterative process contributes to the virtual agent's ability to provide empathetic and contextually relevant responses. The method 800 terminates at step 810.
Capturing user emotions/information at frequent intervals in a sliding window (each conversation) and then tracking whether the user behavior has improved/changed positively or negatively over time.
In some embodiments, in order to improve the generated response, a temporal observation and user analysis may be performed by the LLM that involve continuously capturing user emotions and information at frequent intervals, typically after each conversation or interaction, and then tracking how the user's behavior, emotions, or overall state change over time. This process aims to understand how the language model's responses, along with various data inputs, influence the user's experience and well-being. The temporal observation and user analysis facilitates:
In some example embodiments, a computer programmable product may be provided. The computer programmable product may comprise at least one non-transitory computer-readable storage medium having stored thereon computer-executable program code instructions that when executed by a computer, cause the computer to execute the method 800.
In an example embodiment, an apparatus for performing the method 800 of
As will be appreciated by those skilled in the art, the techniques described in the various embodiments discussed above are not routine, or conventional, or well understood in the art. The present disclosure addresses the limitations of existing techniques in multimodal response generation through a virtual agent. Unlike conventional approaches that often rely solely on text-based input, the disclosed techniques enable virtual agents to generate multimodal response. Users may communicate through text, speech, and visual cues, making interactions more natural and accommodating diverse user preferences. By utilizing an LLM in conjunction with a role-based approach and continuous monitoring of user engagement and satisfaction, the disclosed techniques represent a novel and highly innovative approach to enhancing the capabilities of virtual agents.
The techniques discussed above provide various advantages that may significantly enhance both personal and professional aspects of life. By introducing a personalized, skill-focused virtual agent that leverages Large Language Models (LLMs) and multimodality, these advancements open doors to an entirely new level of user interaction. This virtual agent may be a game-changer in various domains, providing a host of benefits.
First and foremost, these techniques have the potential to revolutionize the customer experience by enabling live interactions and personalization. Users can expect a level of engagement and assistance that goes beyond traditional AI systems. This personalized touch may lead to improved satisfaction, making interactions more meaningful and productive.
One of the standout advantages is the boost in productivity across different tasks and domains. Whether it's at work or in daily life, the virtual agent's capabilities translate to faster task completion, saving valuable time and resources. This efficiency gain may have a significant impact on overall work productivity and life management.
Moreover, these techniques enable the creation of a virtual agent with diverse personas, providing a multi-dimensional view and understanding of user needs and preferences. This versatility ensures that the virtual agent may adapt to various roles and scenarios, catering to a wide range of user requirements.
The incorporation of factual evidence through retrieval augmented generation is another noteworthy advantage. This means that the virtual agent may access and utilize relevant information from internal and external sources, enhancing its ability to provide accurate and informed responses.
Furthermore, these techniques promote internal information sharing across different functionalities, fostering collaboration and knowledge exchange within the virtual agent system. This collective intelligence can result in more comprehensive and contextually relevant responses.
Lastly, these advancements enable interactive and responsible conversations with users, taking into account their previous history and interactions. This level of continuity and context-awareness creates a more engaging and meaningful user-agent relationship.
Many modifications and other embodiments of the inventions set forth herein will come to mind to one skilled in the art to which these inventions pertain having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that the inventions are not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Moreover, although the foregoing descriptions and the associated drawings describe example embodiments in the context of certain example combinations of elements and/or functions, it should be appreciated that different combinations of elements and/or functions may be provided by alternative embodiments without departing from the scope of the appended claims. In this regard, for example, different combinations of elements and/or functions than those explicitly described above are also contemplated as may be set forth in some of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.
It is to be understood that the above description is intended to be illustrative, and not restrictive. For example, the above-discussed embodiments may be used in combination with each other. Many other embodiments will be apparent to those of skill in the art upon reviewing the above description.
With respect to the use of substantially any plural and/or singular terms herein, those having skill in the art can translate from the plural to the singular and/or from the singular to the plural as is appropriate to the context and/or application. The various singular/plural permutations may be expressly set forth herein for sake of clarity.
The benefits and advantages which may be provided by the present invention have been described above with regard to specific embodiments. These benefits and advantages, and any elements or limitations that may cause them to occur or to become more pronounced are not to be construed as critical, required, or essential features of any or all of the embodiments.
While the present invention has been described with reference to particular embodiments, it should be understood that the embodiments are illustrative and that the scope of the invention is not limited to these embodiments. Many variations, modifications, additions, and improvements to the embodiments described above are possible. It is contemplated that these variations, modifications, additions, and improvements fall within the scope of the invention.