The present disclosure relates to systems and methods for establishing or generating multi-turn communications between a robot device and an individual, consumer or user, where the systems or methods utilize a SocialX cloud-based conversation module to assist in communication generation.
Since the dawn of artificial intelligence (AI), there has been a strong desire to create autonomous agents that are capable of natural communication with human users. While conversational agents (e.g., Alexa, Google Home, or Siri) have made their way into our daily lives, their conversational capabilities are still very limited. Specifically, conversation interactions only function in a single-transactional fashion also called command-response interactions (i.e., the human user has an explicit request and the agent provides a single response). However, multiturn conversations interactions are rare if not non-existent and do not go beyond direct requests to gather information and/or reduce ambiguity. For example, a sample conversation may look like User: Alexa, I want to make a reservation; Alexa/Machine: Ok, which restaurant ?; User: Tar and Roses in Santa Monica; and Alexa makes the reservation. Modern machine learning technologies (i.e., transformer models such as GPT-2 or GPT-3) have opened up possibilities that go beyond those of current intent-based transactional conversational agents. These models are able to generate seemingly human sounding stories, conversations, news articles, (e.g., OpenAI even (in a publicity stunt) called these technologies as too dangerous to be made publicly available).
However, these modern machine learning models come with a number of significant drawbacks: First, these models are massive and cannot run on lean IoT devices (e.g., such as robot computing devices) that have limited computational power and memory. Second, even when run on a GPU-accelerated machine, these models take several seconds to generate an output which is prohibitive for real-time conversational agents. As a general rule, the sense-act loop for such conversational agents needs to be below 400-500 ms to maintain engagement with the human or consumer. Third, these massive machine learning models are trained on enormous amounts of data (basically the entirety of the internet) and are therefore tainted by the following drawbacks: (1) lewd language; (2) false and unverified information (e.g., the model might claim that Michael Crichton was the director of the movie Jurassic Park, while he was only the author of the book); (3) represent a generic point of view rather than a specific point of view (e.g., in one instance this model could be democrat and in the next republican, in one instance the favorite food could be steak and in the next the model could be a strict vegan, etc.); (4) training takes an enormous amount of time and energy and therefore a model represents a single moment in time (e.g., the vast majority of state of the art models have been trained on data collected in 2019 and have therefore never heard of Covid-19); and (5) again due to the fact that this data originates from everyone writing on the internet, the used language is generic and does not represent the voice of a single persona (e.g., in one instance the model might generate sentences that are believably expressed by a child such as “Toy Story is my favorite movie” and in the next it could generate “I have three children and work as an accountant”). Fourth, the models taken by themselves still only have short-term memory that washes out over a few conversational turns and are not capable of building a long-term relationship with a human user or consumer.
One aspect of the present disclosure relates to a system configured for establishing or generating multi-turn communications between a robot device and an individual. The system may include one or more hardware processors configured by machine-readable instructions. The processor(s) may be configured to receive, from a computing device performing speech-to-text recognition, one or more input text files associated with the individual's speech. The processor(s) may be configured to filter, via a prohibited speech filter, the one or more input text files to verify the one or more input text files are not associated with prohibited subjects. The processor(s) may be configured to analyze the one or more input text files to determine an intention on the individual's speech. The processor(s) may be configured to perform actions on the one or more input text files based at least in part on the analyzed intention. The processor(s) may be configured to generate one or more output text files based on the performed actions. The processor(s) may be configured to communicate the created one or more output text files to the markup module. The processor(s) may be configured to analyze, by the markup module, the received one or more output text files for sentiment. The processor(s) may be configured to, based at least in part on the sentiment analysis, associating an emotion indicator, and/or multimodal output actions for the robot device with the one or more output text files. The processor(s) may be configured to verify, by the prohibited speech filter, that one or more output text files do not include prohibited subjects. The processor(s) may be configured to analyze the one or more output text files, the associated emotion indicator and/or the multimodal output actions to verify conformance with robot device persona parameters. These and other features, and characteristics of the present technology, as well as the methods of operation and functions of the related elements of structure and the combination of parts and economies of manufacture, will become more apparent upon consideration of the following description and the appended claims with reference to the accompanying drawings, all of which form a part of this specification, wherein like reference numerals designate corresponding parts in the various figures. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended as a definition of the limits of the invention. As used in the specification and in the claims, the singular form of ‘a’, ‘an’, and ‘the’ include plural referents unless the context clearly dictates otherwise.
The subject matter in this document represents a composition of novel algorithms and systems enabling safe persons-based multimodal natural conversational agents with long-term memory and access to correct, current, and factual information. This is because in order for conversational agents to work, the conversation model and/or module needs to keep track of context and past conversations. A conversation module or agent needs to keep track of multi-user context in which the system remembers the conversations with each member of the group and remembers the composition and roles of the members of the group. A conversation module of agent also needs to generate multimodal communication which is not only composed by language outputs but also appropriate facial expressions, gestures, and voice inflections. In addition, depending on the human user and/or their choices, the conversation agent should also be able to impersonate various personas with various limitations or access to certain modules (e.g., child content vs. adult content). These personas may be maintained by the conversation agent or module leveraging a knowledge base or database of existing information regarding the persona. The subject matter described herein allows interactive conversation agent, module or machines to naturally and efficiently communicate in a broad range of social situations. The invention differs from the current state of the art conversational agent, module or machine systems in the following way: First, the present conversation agent, module or machine leverages multimodal input comprising microphone array, camera, radar, lidar, and infrared camera, to track the environment and maintain a persistent view of the world around it. See MULTIMODAL BEAMFORMING AND ATTENTION FILTERING FOR MULTIPARTY INTERACTIONS, Application Ser. No. 62/983,595, filed Feb. 29, 2020. Second, the present conversation agent, module or machine system tracks the engagement of the users around it leveraging the methods and systems described in the SYSTEMS AND METHODS TO MANAGE CONVERSATION INTERACTIONS BETWEEN A USER AND A ROBOT COMPUTING DEVICE OR CONVERSATION AGENT patent application Ser. No. 62/983,590, filed Feb. 29, 2020. Third, once a user is engaged, the conversation agent, module or machine analyzes the user's behavior and assesses linguistic context, facial expression, posture, gestures, voice inflection, etc., to better understand the intent and meaning of the user's comments, questions, and/or affect. Fourth, the conversation agent, module or machine analyzes the user's multimodal natural behavior to identify when it is the conversation agent's, module's or machine's turn to take the floor (e.g., to respond to the consumer or user or to initiate a conversation turn with the user).
Fifth, the conversation agent, module or machine responds to the user by utilizing and/or leveraging multimodal output and signals when it is time for the conversation agent, module or machine to respond. See SYSTEMS AND METHODS TO MANAGE CONVERSATION INTERACTIONS BETWEEN A USER AND A ROBOT COMPUTING DEVICE OR CONVERSATION AGENT, Application Ser. No. 62/983,592, filed Feb. 29, 2020, and SYSTEMS AND METHODS FOR SHORT- AND LONG-TERM DIALOG MANAGEMENT BETWEEN A ROBOT COMPUTING DEVICE/DIGITAL COMPANION AND A USER, application serial No. 62/983,592, filed Feb. 29, 2020. Sixth, the conversation agent, module or machine system identifies when to engage the cloud-based NLP modules based on, special commands (e.g., Moxie, let's chat), planned scheduling, special markup (e.g., open question). and/or a lack of or mismatched authored patterns on the robot (i.e., fallback handling); and or depending on the complexity of the ideas or context of the one or more text files received from the speech-to-text converting module, Seventh, the conversation agent, module or machine system may engage in masking techniques (or utilize multimodal outputs to display thinking behavior) to hide the fact that there is likely to be a time delay between request in the received one or more input text files and receipt of response from the SocialX cloud-based module (e.g., by speaking hmm, let me think about that, and also utilizing facial expressions to simulate a thinking behavior). The conversation agent, module or machine system utilizes this behavior and these actions because they are essential to maintain user engagement and tighten the sense-act loop of the agent.
Eighth, in some embodiments, all input and output from the conversation agent, module of machine system may get filtered by an ensemble of intent recognizer model modules to identify taboo topics, taboo language, persona violating phrases, and other out of scope responses. Ninth, once a taboo topic, etc. is identified the conversation agent, module or machine system, the conversation agent, module or machine may signal a redirect request and may initiate and/or invoke a redirect algorithm to immediately change (or quickly change) the topic of the conversation into a safe space. Tenth, in some embodiments, the conversation agent, module or machine may include an additional input filter that identifies special topics (e.g., social justice, self-harm, mental health, etc.) that trigger manually authored and specialized responses (that are stored in one or more memory modules and/or e knowledge database) that are carefully vetted interaction sequences to protect the user and the image of the automated agent. Eleventh, in some embodiments, the conversation agent, module and/or machine may include an output filter. In some embodiments, the output filter may identify a persona violation (e.g., Embodied's Moxie robot claims that it has children or was at a rock concert when it was younger) or taboo topic violation (e.g., violence, drugs, etc.), then the conversation agent, module and/or machine is informed of this violation and an algorithm of the conversation agent, module and/or machine may immediately or quickly search for one or more next best solutions (e.g., other groups of one or more text files). In some embodiments, the search may be a beam-search or k-top search or similar and may retrieve and/or find an acceptable group of one or more text files that are utilized to respond to and/or replace the persona violating output files. The replacement of one or more output text files does not contain a persona violation (or any other violation). If no such response (e.g., acceptable one or more output text files) is found after the search within a brief period of time (i.e., the robot needs to respond in close to real time—e.g., within a two to five seconds), a redirect phrase and topic reset (pre-authored) (in the form of output text files) may be selected and may be provided as a response and/or replacement for the persona violating prior output text files. These redirect phrases may be related to a certain topic to maintain consistency with the current topic (e.g., talking about space travel “What do you think the earth would look like from space?”, “Do you think humans will ever live on Mars?”, etc.), introduce a new topic (e.g., “Would you like to talk about something else? I really wanted to learn more about animals. What is the largest animal?”), or be derived from the memory module or knowledge base or database directly (e.g., “Last week we talked about ice cream. Did you have any since we talked?”). Twelfth, if s vocabulary violation (e.g., the conversation agent, module or machine produces or generates a word that is outside the vocabulary of the user population) is detected, the conversation agent, module or machine may selects a synonymous word or expression that is within the vocabulary (e.g., instead of using the biologically correct term of Alluropoda melonoleuca the agent would select Panda bear) leveraging word similarity algorithms, third party thesaurus or similar, and replace the word that created the vocabulary violation with the selected word in the output or input text files. Thirteenth, a context module may continuously monitor one of more input text files, may collect and follow the conversation to keep track of exchanged facts (e.g., the user states their name or intention to take a vacation next week, etc.) and may store these facts (in the form of text files) in one or more memory modules. In some embodiments, the conversation agent module, or machine may identify opportune moments to retrieve a memory fact from the one or more memory modules and may utilize these facts to inserts either a probing question in the form of a text file (e.g., how was your vacation last week?) or may leverage a fact (Hi, John, good to see you) to generate a text file response. In some embodiments, the conversation agent, module or machine may create abstractions of the current conversation to reduce the amount of context to be processed and stored in the one or more memory modules. In some embodiments, the conversation agent, module or machine may analyze the input one or more text flies and may, for example, eliminate redundant information as well as too detailed information (e.g., the input one or more text files representing “We went to Santa Monica from downtown on the 10 to go to the beach” may be reduced to the one or more input text files representing “We went to the beach.”)
Fourteenth, the conversation agent, module or machine may include an input filter that identifies factual questions or information retrieval questions that seek to request a certain datum (e.g., who was the fourteenth president of the United States). In some embodiments, once such a factual question has been identified, the input filter may communicate with a question and answer module to retrieve the information from a third party computing device (including but not limited to Encyclopedia Britannica or Wikipedia), through a third-party application programming interface. In another embodiment, a question or answer module may identify an appropriate context that matches the requested information (e.g., a story from the GRL that Moxie told a child earlier) and uses a question-answering algorithm (in a question/answer module) to pull or retrieve the information directly from the provided context that is stored in the memory module and/or the knowledge database. In some embodiments, the chat module may then utilize this information to generate output text files in response and the output text files including the retrieved answers is communicated to the human user after the markup module has also associated emotion indicators or parameters and/or multimodal output actions to the one or more output text files, before going through the multimodal behavior generation of the agent. Fifteenth, the markup module may receive the one or more output text files and a sentiment filer may identify the mood and/or sentiment of the output text files, relevant conversational and/or metaphorical aspects of the output text files, and/or contextual information or aspects of the one of more output text files (e.g., a character from the G.R.L., is named, or another named entity such as a Panda bear). In some embodiments, the markup module of the conversation agent, module or machine may create multimodal output actions (e.g., a behavioral markup that controls the facial expression, gestures (pointing etc.), voice (tonal inflections), as well as heads-up display (e.g., an image of a Panda bear)) to produce these actions on the robot computing device.
In some implementations, the child may also have one or more electronic devices 110. In some implementations, the one or more electronic devices 110 may allow a child to login to a website on a server computing device in order to access a learning laboratory and/or to engage in interactive games that are housed on the web site. In some implementations, the child's one or more computing devices 110 may communicate with cloud computing devices 115 in order to access the website 120. In some implementations, the website 120 may be housed on server computing devices. In some implementations, the website 120 may include the learning laboratory (which may be referred to as a global robotics laboratory (GRL) where a child can interact with digital characters or personas that are associated with the robot computing device 105. In some implementations, the website 120 may include interactive games where the child can engage in competitions or goal setting exercises. In some implementations, other users may be able to interface with an e-commerce website or program, where the other users (e.g., parents or guardians) may purchases items that are associated with the robot (e.g., comic books, toys, badges or other affiliate items).
In some implementations, the robot computing device or digital companion 105 may include one or more imaging devices, one or more microphones, one or more touch sensors, one or more IMU sensors, one or more motors and/or motor controllers, one or more display devices or monitors and/or one or more speakers. In some implementations, the robot computing devices may include one or more processors, one or more memory devices, and/or one or more wireless communication transceivers. In some implementations, computer-readable instructions may be stored in the one or more memory devices and may be executable to perform numerous actions, features and/or functions. In some implementations, the robot computing device may perform analytics processing on date, parameters and/or measurements, audio files and/or image files captured and/or obtained from the components of the robot computing device listed above.
In some implementations, the one or more touch sensors may measure if a user (child, parent or guardian) touches the robot computing device or if another object or individual comes into contact with the robot computing device. In some implementations, the one or more touch sensors may measure a force of the touch and/or dimensions of the touch to determine, for example, if it is an exploratory touch, a push away, a hug or another type of action. In some implementations, for example, the touch sensors may be located or positioned on a front and back of an appendage or a hand of the robot computing device or on a stomach area of the robot computing device. Thus, the software and/or the touch sensors may determine if s child is shaking a hand or grabbing a hand of the robot computing device or if they are rubbing the stomach of the robot computing device. In some implementations, other touch sensors may determine if the child is hugging the robot computing device. In some implementations, the touch sensors may be utilized in conjunction with other robot computing device software where the robot computing device could tell a child to hold their left hand if they want to follow one path of a story of hold a left hand if they want to follow the other path of a story.
In some implementations, the one or more imaging devices may capture images and/or video of a child, parent or guardian interacting with the robot computing device. In some implementations, the one or more imaging devices may capture images and/or video of the area around the child, parent or guardian. In some implementations, the one or more microphones may capture sound or verbal commands spoken by the child, parent or guardian. In some implementations, computer-readable instructions executable by the processor or an audio processing device may convert the captured sounds or utterances into audio files for processing. In some implementations, the one or more IMU sensors may measure velocity, acceleration, orientation and/or location of different parts of the robot computing device. In some implementations, for example, the IMU sensors may determine a speed of movement of an appendage or a neck. In some implementations, for example, the IMU sensors may determine an orientation of a section of the robot computing device, for example of a neck, a head, a body or an appendage in order to identify if the hand is waving or in a rest position. In some implementations, the use of the IMU sensors may allow the robot computing device to orient its different sections in order to appear more friendly or engaging to the user.
In some implementations, the robot computing device may have one or more motors and/or motor controllers. In some implementations, the computer-readable instructions may be executable by the one or more processors and commands or instructions may be communicated to the one or more motor controllers to send signals or commands to the motors to cause the motors to move sections of the robot computing device. In some implementations, the sections may include appendages or arms of the robot computing device and/or a neck or a head of the robot computing device.
In some implementations, the robot computing device may include a display or monitor. In some implementations, the monitor may allow the robot computing device to display facial expressions (e.g., eyes, nose, mouth expressions) as well as to display video or messages to the child, parent or guardian.
In some implementations, the robot computing device may include one or more speakers, which may be referred to as an output modality. In some implementations, the one or more speakers may enable or allow the robot computing device to communicate words, phrases and/or sentences and thus engage in conversations with the user. In addition, the one or more speakers may emit audio sounds of music for the child, parent or guardian when they are performing actions and/or engaging with the robot computing device.
In some implementations, the system may include a parent computing device 125. In some implementations, the parent computing device 125 may include one or more processors and/or one or more memory devices. In some implementations, computer-readable instructions may be executable by the one or more processors to cause the parent computing device 125 to perform a number of features and/or functions. In some implementations, these features and functions may include generating and running a parent interface for the system. In some implementations, the software executable by the parent computing device 125 may also after user (e.g., child, parent or guardian) settings. In some implementations, the software executable by the parent computing device 125 may also allow the parent or guardian to manage their own account or their child's account in the system. In some implementations, the software executable by the parent computing device 125 may allow the parent or guardian to initiate or complete parental consent to allow certain features of the robot computing device to be utilized. In some implementations, the software executable by the parent computing device 125 may allow a parent or guardian to set goals or thresholds or settings what is captured from the robot computing device and what is analyzed and/or utilized by the system. In some implementations, the software executable by the one or more processors of the parent computing device 125 may allow the parent or guardian to view the different analytics generated by the system in order to see how the robot computing device is operating, how their child is progressing against established goals, and/or how the child is interacting with the robot computing device.
In some implementations, the system may include a cloud server computing device 115. In some implementations, the cloud server computing device 115 may include one or more processors and one or more memory devices. In some implementations, computer-readable instructions may be retrieved from the one or more memory devices and executable by the one or more processors to cause the cloud server computing device 115 to perform calculations and/or additional functions. In some implementations, the software (e.g., the computer-readable instructions executable by the one or more processors) may manage accounts for all the users (e.g., the child, the parent and/or the guardian). In some implementations, the software may also manage the storage of personally identifiable information in the one or more memory devices of the cloud server computing device 115. In some implementations, the software may also execute the audio processing (e.g., speech recognition and/or context recognition) of sound files that are captured from the child, parent or guardian, as well as generating speech and related audio file that may be spoken by the robot computing device 115. In some implementations, the software in the cloud server computing device 115 may perform and/or manage the video processing of images that are received from the robot computing devices.
In some implementations, the software of the cloud server computing device 115 may analyze received inputs from the various sensors and/or other input modalities as well as gather information from other software applications as to the child's progress towards achieving set goals. In some implementations, the cloud server computing device software may be executable by the one or more processors in order perform analytics processing. In some implementations, analytics processing may be behavior analysis on how well the child is doing with respect to established goals.
In some implementations, the software of the cloud server computing device may receive input regarding how the user or child is responding to content, for example, does the child like the story, the augmented content, and/or the output being generated by the one or more output modalities of the robot computing device. In some implementations, the cloud server computing device may receive the input regarding the child's response to the content and may perform analytics on how well the content is working and whether or not certain portions of the content may not be working (e.g., perceived as boring or potentially malfunctioning or not working).
In some implementations, the software of the cloud server computing device may receive inputs such as parameters or measurements from hardware components of the robot computing device such as the sensors, the batteries, the motors, the display and/or other components. In some implementations, the software of the cloud server computing device may receive the parameters and/or measurements from the hardware components and may perform IOT Analytics processing on the received parameters, measurements or data to determine if the robot computing device is malfunctioning and/or not operating at an optimal manner.
In some implementations, the cloud server computing device 115 may include one or more memory devices. In some implementations, portions of the one or more memory devices may store user date for the various account holders. In some implementations, the user data may be user address, user goals, user details and/or preferences. In some implementations, the user data may be encrypted and/or the storage may be a secure storage.
In some implementations, a bus 201 may interface with the multi modal perceptual system 123 (which may be referred to as a multi-modal input system or multi-modal input modalities. In some implementations, the multi-modal perceptual system 123 may include one or more audio input processors. In some implementations, the multi-modal perceptual system 123 may include a human reaction detection sub-system. In some implementations, the multimodal perceptual system 123 may include one or more microphones. In some implementations, the multimodal perceptual system 123 may include one or more camera(s) or imaging devices.
In some implementations, the one or more processors 226A-226N may include one or more of an ARM processor, an X86 processor, a GPU (Graphics Processing Unit), and the like. In some implementations, at least one of the processors may include at least one arithmetic logic unit (ALU) that supports a SIMD (Single Instruction Multiple Data) system that provides native support for multiply and accumulate operations.
In some implementations, at least one of a central processing unit (processor), a GPU, and a multi-processor unit (MPU) may be included. In some implementations, the processors and the main memory form a processing unit 225. In some implementations, the processing unit 225 includes one or more processors communicatively coupled to one or more of a RAM, ROM, and machine-readable storage medium; the one or more processors of the processing unit receive instructions stored by the one or more of a RAM, ROM, and machine-readable storage medium via a bus; and the one or more processors execute the received instructions. In some implementations, the processing unit is an ASIC (Application-Specific Integrated Circuit).
In some implementations, the processing unit may be a SoC (System-on-Chip). In some implementations, the processing unit may include at least one arithmetic logic unit (ALU) that supports a SIMD (Single Instruction Multiple Data) system that provides native support for multiply and accumulate operations. In some implementations the processing unit is a Central Processing Unit such as an Intel Xeon processor. In other implementations, the processing unit includes a Graphical Processing Unit such as NVIDIA Tesla.
In some implementations, the one or more network adapter devices or network interface devices 205 may provide one or more wired or wireless interfaces for exchanging data and commands. Such wired and wireless interfaces include, for example, a universal serial bus (USB) interface, Bluetooth interface, Wi-Fi interface, Ethernet interface, near field communication (NFC) interface, and the like. In some implementations, the one or more network adapter devices or network interface devices 205 may be wireless communication devices. In some implementations, the one or more network adapter devices or network interface devices 205 may include personal area network (PAN) transceivers, wide area network communication transceivers and/or cellular communication transceivers.
In some implementations, the one or more network devices 205 may be communicatively coupled to another robot computing device (e.g., a robot computing device similar to the robot computing device 105 of
In some implementations, the processor-readable storage medium 210 may be one of (or & combination of two or more of) a hard drive, a flash drive, a DVD, a CD, an optical disk, a floppy disk, a flash storage, a solid state drive, a ROM, an EEPROM, an electronic circuit, a semiconductor memory device, and the like. In some implementations, the processor-readable storage medium 210 may include machine-executable instructions (and related data) for an operating system 211, software programs or application software 212, device drivers 213, and machine-executable instructions for one or more of the processors 226A-226N of
In some implementations, the processor-readable storage medium 210 may include a machine control system module 214 that includes machine-executable instructions for controlling the robot computing device to perform processes performed by the machine control system, such as moving the head assembly of robot computing device.
In some implementations, the processor-readable storage medium 210 may include an evaluation system module 215 that includes machine-executable instructions for controlling the robotic computing device to perform processes performed by the evaluation system 215. In some implementations, the processor-readable storage medium 210 may include a conversation system module 216 that may include machine-executable instructions for controlling the robot computing device 105 to perform processes performed by the conversation system 216. In some implementations, the processor-readable storage medium 210 may include machine-executable instructions for controlling the robot computing device 105 to perform processes performed by the testing system 350. In some implementations, the processor-readable storage medium 210, machine executable instructions for controlling the robot computing device 105 to perform processes performed by the conversation authoring system 141.
In some implementations, the processor-readable storage medium 210, machine-executable instructions for controlling the robot computing device 105 to perform processes performed by the goal authoring system 140. In some implementations, the processor-readable storage medium 210 may include machine-executable instructions for controlling the robot computing device 105 to perform processes performed by the evaluation module generator 142.
In some implementations, the processor-readable storage medium 210 may include the content repository 220. In some implementations, the processor-readable storage medium 210 may include the goal repository 180. In some implementations, the processor-readable storage medium 210 may include machine-executable instructions for an emotion detection module. In some implementations, emotion detection module may be constructed to detect an emotion based on captured image data (e.g., image data captured by the perceptual system 123 and/or one of the imaging devices). In some implementations, the emotion detection module may be constructed to detect an emotion based on captured audio data (e.g., audio data captured by the perceptual system 123 and/or one of the microphones). In some implementations, the emotion detection module may be constructed to detect an emotion based on captured image date and captured audio data. In some implementations, emotions detectable by the emotion detection module include anger, contempt, disgust, fear, happiness, neutral, sadness, and surprise. In some implementations, emotions detectable by the emotion detection module include happy, sad, angry, confused, disgusted, surprised, calm, unknown. In some implementations, the emotion detection module is constructed to classify detected emotions as either positive, negative, or neutral. In some implementations, the robot computing device 105 may utilize the emotion detection module to obtain, calculate or generate a determined emotion classification (e.g., positive, neutral, negative) after performance of an action by the machine, and store the determined emotion classification in association with the performed action (e.g., in the storage medium 210).
In some implementations, the testing system 350 may a hardware device or computing device separate from the robot computing device, and the testing system 350 includes at least one processor, a memory, a ROM, a network device, and a storage medium (constructed in accordance with a system architecture similar to a system architecture described herein for the machine 120), wherein the storage medium stores machine-executable instructions for controlling the testing system 350 to perform processes performed by the testing system 350, as described herein.
In some implementations, the conversation authoring system 141 may be a hardware device separate from the robot computing device 105, and the conversation authoring system 141 may include at least one processor, a memory, a ROM, a network device, and a storage medium (constructed in accordance with a system architecture similar to a system architecture described herein for the robot computing device 105), wherein the storage medium stores machine-executable instructions for controlling the conversation authoring system 141 to perform processes performed by the conversation authoring system.
In some implementations, the evaluation module generator 142 may be a hardware device separate from the robot computing device 105, and the evaluation module generator 142 may include at least one processor, a memory, a ROM, a network device, and a storage medium (constructed in accordance with a system architecture similar to a system architecture described herein for the robot computing device), wherein the storage medium stores machine-executable instructions for controlling the evaluation module generator 142 to perform processes performed by the evaluation module generator, as described herein.
In some implementations, the goal authoring system 140 may be a hardware device separate from the robot computing device, and the goal authoring system 140 may include at least one processor, a memory, a ROM, a network device, and a storage medium (constructed in accordance with a system architecture similar to a system architecture described instructions for controlling the goal authoring system to perform processes performed by the goal authoring system 140. In some implementations, the storage medium of the goal authoring system may include data, settings and/or parameters of the goal definition user interface described herein. In some implementations, the storage medium of the goal authoring system 140 may include machine-executable instructions of the goal definition user interface described herein (e.g., the user interface). In some implementations, the storage medium of the goal authoring system may include data of the goal definition information described herein (e.g., the goal definition information). In some implementations, the storage medium of the goal authoring system may include machine-executable instructions to control the goal authoring system to generate the goal definition information described herein (e.g., the goal definition information).
In some embodiments, the SocialX cloud-based module 301 may include one or more memory devices or memory modules 366, a conversation summary module 354 (e.g., SocialX summary module), a chat module 362 (e.g., a SocialX chat module), a conversation markup module 365 (e.g., SocialX markup module), a question and answer module 368 (e.g., a SocialX Q&A module), a knowledge base or database 360, a third-party API or software program 361, and/or an intention or filtering module 308 (e.g., SocialX intention module). In some embodiments, the intention filtering module 308 may analyze, in one and/or multiple ways, the received input text from automatic speech recognition module 341 in order to generate specific measurements and/or parameters. In some embodiments, the intention or filtering module 308 may include an input filtering module 351, an output filtering module 355, an intent recognition module 353, a sentiment analysis module 357, a message brokering module 359, a personal protection module 356, an intention fusion module 352, and/or an environmental cues fusion module 354. In some embodiments, the input filtering module 351 may include a prohibited speech filter and/or a special topics filter according to some embodiments. In some embodiment's, the third-party application software or API 361 may be located on the same cloud computing device or server as the conversation module, however, in alternative embodiments, the third party application software or API may be located on another cloud computing device or server. Interactions between the various hardware and/or software modules are discussed in detail with respect to
In some embodiments, the chat module 362 may generate output text files (associated with step 410) and may communicate the one or more output text files to the conversation markup module 365 (associated with step 412). In some embodiments, the chat module 362 may communicate with the one or more memory devices 366 to retrieve potential output text files to add to and/or replace the generated output text files (if for example, the received and analyzed input text files include a prohibited topic). In some embodiments, a markup module 365 may utilize a sentiment analysis module 357 to analyze the sentiment and/or emotion of the output text files (associated with step 414). In some embodiments, the markup module 365 may generate and/or assign or associate an emotion indicator or parameter and/or multimodal output actions (e.g., facial expressions, arm movements, additional sounds, etc.) to the output text files (step 416). In some embodiments, the output filter module 355 may utilize a prohibited speech filter to analyze whether or not the one or more output text files include prohibited subjects (or verify that the one or more output text files do not include prohibited subjects) (associated with step 420). In other words, the input text files and the output text files may both be analyzed by a prohibited speech filter to make sure that these prohibited subjects are not spoken to the robot computing device and/or spoken by the robot computing device (e.g., both input and/or output). In some embodiments, a persona protection module 356 may analyze the one or more output text files, the associated emotion indicator or parameter(s), and/or the associated multi-modal output action(s) to verify that these files, parameter(s), and/or action(s) conform with established and/or predetermined robot device persona parameters. In some embodiments, if the guidelines are met (e.g., there is no prohibited speech topics and the output text files are aligned with the robot computing device's persona), the intention module 308 of the SocialX cloud-based module 301 may communicate the one or more output text files, the associated emotion indicator or parameter(s), and/or the associated multi-modal output action(s) to the robot computing device (associated with step 423).
If some embodiments, if the generated output text files include prohibited speech topics and/or if the generated output text files do not match with the robot computing device's persona, the chat module 362 may search for and/or locate acceptable output text files, emotion indicators or parameters, and/or multimodal output actions including topics (associated with step 424). In some embodiments, if the chat module 362 locates acceptable output text files, emotion indicators or parameters, and/or multimodal output actions, the chat module 362 and/or intention module 308 may communicate the acceptable output text files, emotion indicators or parameters, and/or multimodal output actions to the robot computing device (associated with step 426). In some embodiments, the chat module 362 cannot find or located acceptable output text files, the chat module may retrieve redirect text files from the one or more memory modules 366 and/or knowledge database 360 and communicate the redirect text files to the markup module for processing (associated with step 428).
Computing platform(s) 302 may be configured by machine-readable instructions 306. Machine-readable instructions 306 may include one or more instruction modules. The instruction modules may include computer program modules. The instruction modules may include a SocialX cloud-based module conversation 301.
SocialX cloud-based conversation module 301 may be configured to receive, from a computing device performing speech-to-text recognition, one or more input text files associated with the individual's speech, may analyze the one or more input text files to determine further actions to be taken, may generate one or more output text files, and may associate emotion parameter(s) and/or multimodal action files with the one or more output text files and may communicate the one or more output text files, the associated emotion parameter(s), and/or the multi-modal action files to the robot computing device.
In some implementations, an open question may be present. In some implementations, there is a lack of may match existing conversation patterns on the robot device in order to determine whether or not to utilize the cloud-based social chat modules. In some implementations, the social chat module searches for acceptable output text files, associated emotion indicators, may and/or multimodal output actions in a knowledge database 360 and/or the one or memory modules 366.
In some implementations, computing platform(s) 302, remote platform(s) 304, and/or external resources 340 may be operatively linked via one or more electronic communication links. For example, such electronic communication links may be established, at least in part, via a network such as the Internet and/or other networks. It will be appreciated that this is not intended to be limiting, and that the scope of this disclosure includes implementations in which computing platform(s) 302, remote platform(s) 304, and/or external resources 340 may be operatively linked vis some other communication media.
A given remote platform 304 may include one or more processors configured to execute computer program modules. The computer program modules may be configured to enable an expert or user associated with the given remote platform 304 to interface with system 300 and/or external resources 340, and/or provide other functionality attributed herein to remote platform(s) 304. By way of non-limiting example, a given remote platform 304 and/or a given computing platform 302 may include one or more of a server, a desktop computer, a laptop computer, a handheld computer, a tablet computing platform, a NetBook, a Smartphone, a gaming console, and/or other computing platforms.
External resources 340 may include sources of information outside of system 300, external entities participating with system 300, and/or other resources. In some implementations, some or all of the functionality attributed herein to external resources 340 may be provided by resources included in system 300.
Computing platform(s) 302 may include electronic storage 342, one or more processors 344, and/or other components. Computing platform(s) 302 may include communication lines, or ports to enable the exchange of information with a network and/or other computing platforms. Illustration of computing platform(s) 302 in
Electronic storage 342 may comprise non-transitory storage media that electronically stores information. The electronic storage media of electronic storage 342 may include one or both of system storage that is provided integrally (i.e., substantially non-removable) with computing platform(s) 302 and/or removable storage that is removably connectable to computing platform(s) 302 via, for example, a port (e.g., a USB port, a firewire port, etc.) or a drive (e.g., a disk drive, etc.). Electronic storage 342 may include one or more of optically readable storage media (e.g., optical disks, etc.), magnetically readable storage media (e.g., magnetic tape, magnetic hard drive, floppy drive, etc.), electrical charge-based storage media (e.g., EEPROM, RAM, etc.), solid-state storage media (e.g., flash drive, etc.), and/or other electronically readable storage media. Electronic storage 342 may include one or more virtual storage resources (e.g., cloud storage, a virtual private network, and/or other virtual storage resources). Electronic storage 342 may store software algorithms, information determined by processor(s) 344, information received from computing platform(s) 302, information received from remote platform(s) 304, and/or other information that enables computing platform(s) 302 to function as described herein.
Processor(s) 344 may be configured to provide information processing capabilities in computing platform(s) 302. As such, processor(s) 344 may include one or more of a digital processor, an analog processor, a digital circuit designed to process information, an analog circuit designed to process information, a state machine, and/or other mechanisms for electronically processing information. Although processor(s) 344 is shown in
It should be appreciated that although modules 301 are illustrated in
In some implementations, method 400 may be implemented in one or more processing devices (e.g., a digital processor, an analog processor, a digital circuit designed to process information, an analog circuit designed to process information, a state machine, and/or other mechanisms for electronically processing information). The one or more processing devices may include one or more devices executing some or all of the operations of method 400 in response to instructions stored electronically on an electronic storage medium. The one or more processing devices may include one or more devices configured through hardware, firmware, and/or software to be specifically designed for execution of one or more of the operations of method 400.
In some embodiments, an operation 402 may include receiving, from a computing device performing speech-to-text recognition 341, one or more input text files associated with the individual's speech. Operation 402 may be performed by one or more hardware processors configured by machine-readable instructions including a module that is the same as or similar to SocialX cloud-based conversation module 301, in accordance with one or more implementations. In an alternative embodiment, an automatic speech recognition module 341 may not utilize the SocialX cloud-based conversation module 301 and instead the text may be sent to the dialog manager module 335 for processing. As discussed previously, utilizing the SocialX cloud-based conversation module may be triggered by special commands, lack of matching with known patterns, if an open question is present or if a communication between participating devices and/or individuals is too complex.
In some embodiments, an operation 404 may include filtering, via a prohibited speech filter module (which may also be referred to as input filtering module) 351, the one or more input text files to verify the one or more input text files are not associated with prohibited subjects or subject matter. Operation 404 may be performed by one or more hardware processors configured by machine-readable instructions including a module that is the same as or similar to a prohibited speech filter module/input filtering module 351 in an intention module 308, in accordance with one or more implementations. In some embodiments, prohibited subjects and/or subject matter may include topics such as violence, sex and/or self-harm. In some embodiments, if the prohibited speech filter module determines that the one or more input text files are associated with prohibited subject matter, the intention module 308 and prohibited speech filter module/input filtering module 351 may communicate with a knowledge database 360 in order to retrieve safe one or more output text files. In some embodiments, the intention module 308 and/or the message brokering module 359 may communicate the one or more retrieved safe output text files to the chat module 362 for processing. In some embodiments, the one or more safe text files may provide instructions for the robot computing device to speak phrases such as “Please, talk to a trusted adult about this” or “That is a topic I don't know much about” and/or also “Would you like to talk about something else.” In some embodiments, in operation 444, the chat module 362 may communicate the one or more specialized redirect text files to the markup module 354 for processing.
In some embodiments, an operation 406 may include analyzing the one or more input text files to determine an intention on the individual's speech as identified in the input text files. In some embodiments, intention parameters and/or classifications may be associated and/or assigned to the one or more input text files based, at least in part, on the analysis. In some embodiments, the one or more text files and/or the intention parameters and/or classifications may be communicated to the message brokering module 359. Operation 406 may be performed by one or more hardware processors configured by machine readable instructions including a module that is the same as of similar to intention recognition module 353, in accordance with one or more implementations. Intention fusion module—In some embodiments, an operation 408 may include receiving multimodal user parameters, measurements and/or files from the multimodal abstraction module 389 (in addition to the one or more text files) to assist in determining an intention of the user and/or a conversation topic that which the user may be interested in. In these embodiments, the intention fusion module 352 may analyze the multimodal user parameters, measurements and/or files in order to generate intention parameters and/or classifications or potential conversations topics. In some embodiments, the intention fusion module 352 may communicate the one or more input text files, the intention parameters and/or classifications or potential conversation topics to the message brokering module 350 which in turn communicates the one or more input text files, the intention parameters and/or classifications or potential conversation topics to the chat module 362. As an example, the multimodal abstraction module 359 may communicate multimodal intention parameters or files (such as an image that the user is smiling and shaking their head up and down or parameters representing the same) to the intention fusion module 352 which may indicate the user is happy. In this example, the intention fusion module 352 may generate intention parameters or measurements identifying that the user is happy and engaging. In an alternative embodiment, the multimodal abstraction module 359 may communicate multimodal intention parameters or files (such as an image showing the user's hands up in the air and/or the user looking confused or parameters representing the same) and the intention fusion module 352 may receive these multimodal intention parameters or files and determine that the user is confused. In these embodiments, the intention fusion module may generate intention parameters or classifications identifying that the user is confused.
Environmental cues fusion module—In some embodiments, an operation 409 may include receiving multimodal environmental parameters, measurements and/or files from the multimodal abstraction module 389 and/or world tracking module 388 (in addition to the one or more text files) to assist in determining an intention of the user and/or conversation topics the user may be interested in. In these embodiments, the environmental cues fusion module 354 may analyze the received environmental parameters, measurements and/or files to generate intention parameters or classification of potential interest in conversation topics. In these embodiments, the environmental cues fusion module 354 may communicate the one or more text files and/or the generated intention parameters or classifications or potential interest in conversation topics to the message brokering module 359 which in turn may communicate this information to the correct module (e.g., the chat module 362 or the question & answer module 368. As an example, the user may be walking to a pet like his or her dog and saying “Come here spot” and the multimodal abstraction module 389 may communicate the environmental parameters, measurements and/or files with this image or parameters representing these images and sounds to the environmental cues fusion module 354. In this example, the environmental cues fusion module 354 may analyze the environmental parameters and/or images and the user's statement and identify that the user may be receptive to talk about their dog. In this example, the environmental cues fusion module 354 may generate intention parameters or classifications or conversation topics indicating the dog topic and may communicate these intention parameters, classifications or conversation topics to the message brokering module 359. As another example, the user may be in a crowded area with lots of noise and everyone wearing a football jersey and the multimodal abstraction module 389 and/or world tracking module 388 may generate environmental parameters, measurements and/or files that are transmitted to the conversation cloud module 301 and specifically the environmental cues fusion module 354. In this example, the environmental cues fusion module 354 may analyze the received environmental parameters, measurements and/or files and identify that the user may be receptive to talking about football and may also need to move to another area with less people due to the noise and therefore may generation intention parameters, classifications and/or topics with respect associated with football topics and/or moving to a quieter place. In some embodiments, the environmental cues fusion module 354 may communicate the generated intention parameters, classifications and/or topics to the message brokering module.
In some embodiments, an operation 410 may include performing actions on the one or more input text files based at least in part on the analyzation and/or understanding of the one or more input text files and/or the received intention parameters, classifications and/or topics. Operation 410 may be performed by one or more hardware processors configured by machine-readable instructions including a module that is the same as or similar to the intention module 308 and/or the message brokering module 359, in accordance with one or more implementations.
In some embodiments, an operation 411 may include generating one or more output text files based on the performed actions. Operation 411 may be performed by one or more hardware processors configured by machine-readable instructions including a module that is the same as or similar to the chat module 362, in accordance with one or more implementations.
In some embodiments, an operation 412 may include communicating the created one or more output text files to the markup module 365. Operation 412 may be performed by one or more hardware processors configured by machine-readable instructions including a module that is the same as or similar to the chat module 362, in accordance with one or more implementations.
In some embodiments, an operation 414 may include analyzing, by the sentiment analysis module 357 and/or the markup module 365, the received one or more output text files for sentiment and determining a sentiment parameter of the received one or more output text files. Operation 414 may be performed by one or more hardware processors configured by machine-readable instructions including a module that is the same as or similar to the sentiment analysis module 357, in accordance with one or more implementations.
In some embodiments, an operation 416 may include and based at least in part on the sentiment parameter determined by sentiment analysis, associating an emotion indicator, and/or multimodal output actions for the robot device with the one or more output text files. Operation 416 may be performed by one or more hardware processors configured by machine-readable instructions including a module that is the same as or similar to the markup module 365, in accordance with one or more implementations.
In some embodiments, an operation 420 may include verifying, by the prohibited speech filter, the one or more output text files do not include prohibited subjects or subject matters. Operation 420 may be performed by one or more hardware processors configured by machine-readable instructions including a module that is the same as or similar to an output filtering module 355, in accordance with one or more implementations. In some embodiments, prohibited speech may include violence-related topics and/or sexual related topics.
In some embodiments, an operation 422 may analyze the one or more output text files, the associated emotion indicator parameter or measurement, and/or multimodal output actions to verify conformance with robot device persona parameters and measurements. Operation 422 may be performed by one or more hardware processors configured by machine-readable instructions including a module that is the same as or similar to a persona protection module 356, in accordance with one or more implementations. In some embodiments, in operation 424, if the persona protection module 356 determines and/or identifies that the one or more output text files, the associated emotion indicator and the multimodal output actions are not in conformance with the robot's persona, the SocialX Chat module 362 or the SocialX intention module 308 may search for acceptable output text files, associated emotion indicators and/or multimodal output actions that match the robot device's persona parameters and/or measurements. In some embodiments, the SocialX Chat module 362 or SocialX module 308 may search the one or more memory modules 366 and/or the knowledge database 360 for the acceptable one or more output text files, the associated emotion indicator and the multimodal output actions. In some embodiments, in operation 426, if the acceptable one or more output text files, the associated emotion indicator and the multimodal output actions are located after the search process, the SocialX intention module 308 may communicate the one or more output text files, the emotion indicator and/or the multimodal output actions to the robot computing device. If some embodiments, in operation 428, if no acceptable one or more output text files, the associated emotion indicator and the multimodal output actions are located after the search, the SocialX chat module 362 or the SocialX module 308 may retrieve redirect text files from the knowledge database 362 and/or the one or more memory modules 366 and may communicate the one or more redirect text files to the markup module 365.
In some embodiments, the factual information may be located from another source which may be located in the cloud-based computing device. In some embodiments, in operation 433, the factual information may be retrieved from the knowledge database 360 and/or the one or more memory modules 366. Operation 433 may be performed by one or more hardware processors configured by machine-readable instructions including a module that is the same as or similar to SocialX Q&A module 368 and/or the knowledge database 360, in accordance with one or more implementations. After gathering the factual information, in operation 434, the question/answer module 368 and/or the chat module 362 may add the retrieved or obtained factual information to the one or more output text files communicated to the markup module 365.
In some embodiments, a method of establishing or generating multi-turn communications between a robot device and an individual, may include; accessing instructions from one or more physical memory devices for execution by one or more processors; executing instructions accessed from the one or more physical memory devices by the one or more processors; storing, in at least one of the physical memory devices, signal values resulting from having executed the instructions on the one or more processors; wherein the accessed instructions are to enhance conversation interaction between the robot device and the individual; and wherein executing the conversation interaction instructions further comprising: receiving, from a speech-to-text recognition computing device, one or more input text files associated with the individual's speech; filtering, via a prohibited speech filter, the one or more input text files to verify the one or more input text files are not associated with prohibited subjects; analyzing the one or more input text files to determine an intention on the individuals speech; and performing actions on the one or more input text files based at least in part on the analyzed intention. In some embodiments, the method may include generating one or more output text files based on the performed actions; communicating the created one or more output text files to the markup module; analyzing, by the markup module, the received one or more output text files for sentiment, based at least in part on the sentiment analysis, associating an emotion indicator, and/or multimodal output actions for the robot device with the one or more output text files; verifying, by the prohibited speech filter, the one or more output text files do not include prohibited subjects; analyzing the one or more output text files, the associated emotion indicator and the multimodal output actions to verify conformance with the robot device persona parameters; and communicating the one or more output text files, the associated emotion indicator and the multimodal output actions to the robot device.
Although the present technology has been described in detail for the purpose of illustration based on what is currently considered to be the most practical and preferred implementations, it is to be understood that such detail is solely for that purpose and that the technology is not limited to the disclosed implementations, but, on the contrary, is intended to cover modifications and equivalent arrangements that are within the spirit and scope of the appended claims. For example, it is to be understood that the present technology contemplates that, to the extent possible, one or more features of any implementation can be combined with one or more features of any other implementation.
This Patent Cooperation Treaty (PCT) application claims priority to U.S. provisional patent application Ser. No. 63/303,860, filed Jan. 27, 2022 and entitled “Methods and systems enabling natural language processing, understanding, and generation” and U.S. provisional patent application Ser. No. 63/143,000, filed Jan. 28, 2021 and entitled “SocialX Chat—Methods and systems enabling natural language processing, understanding, and generation on the edge, ” the disclosures of which are both hereby incorporated by reference in their entirety. This application is related to SYSTEMS AND METHODS TO MANAGE CONVERSATION INTERACTIONS BETWEEN A USER AND A ROBOT COMPUTING DEVICE OR CONVERSATION AGENT, Application Ser. No. 62/983,592, filed Feb. 29, 2020, and SYSTEMS AND METHODS FOR SHORT- AND LONG-TERM DIALOG MANAGEMENT BETWEEN A ROBOT COMPUTING DEVICE/DIGITAL COMPANION AND A USER, application Ser. No. 62/983,592, filed Feb. 29, 2020, the contents of which are incorporated herein by reference in their entirety,
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US22/14213 | 1/28/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63303860 | Jan 2022 | US | |
63143000 | Jan 2021 | US |