Currently claimed embodiments of the invention relate to an artificial intelligence system, and more specifically, to a system for predicting and preventing customer churn.
Customer Churn is the phenomenon where customers of a business no longer purchase or interact with the business. The ability to prevent customer churn is a key factor for success in many types of business. The problem becomes even harder when the service offered is in the form of a mobile application where customers are free to sign up and cancel the service at any time. This is the case for products such as human capital management (HCM) mobile solutions that are designed as an intelligent virtual assistant that requires no human interaction with staff.
Customer churn detection is very important for different types of business for customer retention and understanding of engagement levels. Also, acquiring new clients almost always costs more than retaining existing ones, so every leaving client is an investment loss and also a potential degradation in the net promoter score (NPS). Monitoring churn is the first step in understanding how good the business is at retaining customers and identifying what actions might result in a higher retention rate.
This becomes crucial with online automated services offered to a big number of remote customers. In this scenario, identifying customers at risk of cancelling the service is an even harder task, and taking marketing actions or other classic customer relationship management (CRM) approaches are not feasible or applicable.
Common automated churn detection methods apply classic statistics and data-mining techniques that provide good results but usually are limited to the outcome of informing how likely the customer is to churn. More advanced proposals use techniques to guide specific marketing strategies, which is not an option given the dynamics of a service to many customers that can easily subscribe and cancel without any interaction with staff.
According to an embodiment of the invention, a non-transitory computer-readable medium stores a set of instructions for predicting customer churn, which when executed by a computer, configure the computer to receive a graph data structure storing data associated with activity of a user, the graph data structure including multiple nodes that include a user input node associated with the user. The instructions further configure the computer to update at least the user input node with a vector representation of a received user input, and, using historical user input, train a sentiment model to classify user input according to one of multiple sentiments. The instructions further configure the computer to use the trained sentiment model to classify the received user input as a particular sentiment from multiple sentiments, add to the graph data structure a sentiment node that is associated with the particular sentiment and that is connected to the user input node, and, using the graph data structure, train a churn model to estimate user churn probability. The instructions further configure the computer to use the trained churn model to estimate a particular churn probability for the user.
According to an embodiment of the invention, a non-transitory computer-readable medium stores a set of instructions for preventing customer churn, which when executed by a computer, configure the computer to, for a particular customer, receive a probability that the particular customer is likely to churn, and use multiple user retentions to update a reinforcement model for selecting retention actions for customers. The instructions further configure the computer to, based on a determination that the particular customer is likely to churn, use the updated reinforcement model to select a particular retention action from multiple retention actions, and implement the particular retention action for the particular customer.
According to an embodiment of the invention, a method for predicting customer churn includes receiving a graph data structure storing data associated with activity of a user, the graph data structure including multiple nodes that include a user input node associated with the user. The method further includes updating at least the user input node with a vector representation of a received user input, and, using historical user input, training a sentiment model to classify user input according to one of multiple sentiments. The method further includes using the trained sentiment model to classify the received user input as a particular sentiment from the plurality of sentiments, adding to the graph data structure a sentiment node that is associated with the particular sentiment and that is connected to the user input node, and, using the graph data structure, training a churn model to estimate customer churn probability. The method further includes using the trained churned model to estimate a particular churn probability for the user.
According to an embodiment of the invention, a method for preventing customer churn includes, for a particular customer, receiving a probability that the particular customer is likely to churn, and, using multiple user retentions, updating a reinforcement model for selecting retention actions for customers. The method further includes, based on a determination that the particular customer is likely to churn, using the updated reinforcement model to select a particular retention action from multiple retention actions, and implementing the particular retention action for the particular customer.
Further objectives and advantages will become apparent from a consideration of the description, drawings, and examples.
Some embodiments of the current invention are discussed in detail below. In describing embodiments, specific terminology is employed for the sake of clarity. However, the invention is not intended to be limited to the specific terminology so selected. A person skilled in the relevant art will recognize that other equivalent components can be employed, and other methods developed, without departing from the broad concepts of the current invention. All references cited anywhere in this specification, including the Background and Detailed Description sections, are incorporated by reference as if each had been individually incorporated.
Some embodiments describe a system and method to address the customer churn prediction problem by monitoring customers, predicting potential churns with accuracy, and automatically acting to prevent these churns. Some embodiments apply Artificial Intelligence (AI) or machine-learning (ML) techniques to automatically and dynamically monitor the activity of users of a business's application, detect customers at potential risk to cancel the subscription of the product, and proactively explore and learn the best actions to prevent losing those customers. These actions may be automatically applied or suggested as insights for the business.
Some embodiments provide a fully automated solution that predicts churns using graph neural networks, and also learns the best prevention strategy that is automatically applied and evaluated via reinforcement learning techniques.
In some embodiments, the churn prediction uses application data that has been transformed into graph data structures, so that multiple techniques of Machine Learning for Graphs may be applied. These techniques are capable of learning multiple aspects of the relationships revealed by the graph structures, and are able to propagate this information to perform the predictions. The learned aspects of the graph structures are also captured to provide a diversity of insights, such as application domains usage and functionalities more associated with activity of unsatisfied customers.
Some embodiments also explore and learn the best actions to be taken in order to prevent the predicted churns (referred to as customer retention). Some embodiments model the problem as a Multi-Armed Bandit (MAB) problem and treat it using reinforcement learning techniques.
Some embodiments provide a human capital management (HCM) system as a mobile solution, designed as an intelligent virtual assistant interface 100 where customers can do tasks like run payroll, hire employees and tax filing just like sending a chat message, as shown in
The HCM system of some embodiments offers hundreds of intents that are therefore mapped to conversations. Each of these conversations can have multiple flows and end up in different actions depending on the answers. Some embodiments use Graph Neural Networks (GNN) or Graph Convolution Networks (GCN) to analyze this diversity of ways and how different users interact with the assistant, predict customers with high risk of churn and, given a set of possible actions to prevent the churns, explore, learn the most effective one, and then exploit it.
The neural network of some embodiments is a multi-layer machine-trained network (e.g., a feed-forward neural network). Neural networks, also referred to as machine-trained networks, will be herein described. One class of machine-trained networks are deep neural networks with multiple layers of nodes. Different types of such networks include feed-forward networks, convolutional networks, recurrent networks, regulatory feedback networks, radial basis function networks, long-short term memory (LSTM) networks, and Neural Turing Machines (NTM). Multi-layer networks are trained to execute a specific purpose, including face recognition or other image analysis, voice recognition or other audio analysis, large-scale data analysis (e.g., for climate data), etc. In some embodiments, such a multi-layer network is designed to execute on a mobile device (e.g., a smartphone or tablet), an IOT device, a web browser window, etc.
A typical neural network operates in layers, each layer having multiple nodes. In convolutional neural networks (a type of feed-forward network), a majority of the layers include computation nodes with a (typically) nonlinear activation function, applied to the dot product of the input values (either the initial inputs based on the input data for the first layer, or outputs of the previous layer for subsequent layers) and predetermined (i.e., trained) weight values, along with bias (addition) and scale (multiplication) terms, which may also be predetermined based on training. Other types of neural network computation nodes and/or layers do not use dot products, such as pooling layers that are used to reduce the dimensions of the data for computational efficiency and speed.
For convolutional neural networks that are often used to process electronic image and/or video data, the input activation values for each layer (or at least each convolutional layer) are conceptually represented as a three-dimensional array. This three-dimensional array is structured as numerous two-dimensional grids. For instance, the initial input for an image is a set of three two-dimensional pixel grids (e.g., a 1280×720 RGB image will have three 1280×720 input grids, one for each of the red, green, and blue channels). The number of input grids for each subsequent layer after the input layer is determined by the number of subsets of weights, called filters, used in the previous layer (assuming standard convolutional layers). The size of the grids for the subsequent layer depends on the number of computation nodes in the previous layer, which is based on the size of the filters, and how those filters are convolved over the previous layer input activations. For a typical convolutional layer, each filter is a small kernel of weights (often 3×3 or 5×5) with a depth equal to the number of grids of the layer's input activations. The dot product for each computation node of the layer multiplies the weights of a filter by a subset of the coordinates of the input activation values. For example, the input activations for a 3×3×Z filter are the activation values located at the same 3×3 square of all Z input activation grids for a layer.
In this example, the neural network 200 only has one output node 230 that provides a single output 220. Other neural networks of other embodiments have multiple output nodes in the output layer LM that provide more than one output value. In different embodiments, the output 220 of the network is a scalar in a range of values (e.g., 0 to 1), a vector representing a point in an N-dimensional space (e.g., a 128-dimensional vector), or a value representing one of a predefined set of categories (e.g., for a network that classifies each input into one of eight possible outputs, the output could be a three-bit value).
Portions of the illustrated neural network 200 are fully-connected in which each node in a particular layer receives as inputs all of the outputs from the previous layer. For example, all the outputs of layer L0 are shown to be an input to every node in layer L1. The neural networks of some embodiments are convolutional feed-forward neural networks, where the intermediate layers (referred to as “hidden” layers) may include other types of layers than fully-connected layers, including convolutional layers, pooling layers, and normalization layers.
The convolutional layers of some embodiments use a small kernel (e.g., 3×3×3) to process each tile of pixels in an image with the same set of parameters. The kernels (also referred to as filters) are three-dimensional, and multiple kernels are used to process each group of input values in a layer (resulting in a three-dimensional output). Pooling layers combine the outputs of clusters of nodes from one layer into a single node at the next layer, as part of the process of reducing an image (which may have a large number of pixels) or other input item down to a single output (e.g., a vector output). In some embodiments, pooling layers can use max pooling (in which the maximum value among the clusters of node outputs is selected) or average pooling (in which the clusters of node outputs are averaged).
Each node computes a dot product of a vector of weight coefficients and a vector of output values of prior nodes (or the inputs, if the node is in the input layer), plus an offset. In other words, a hidden or output node computes a weighted sum of its inputs (which are outputs of the previous layer of nodes) plus an offset (also referred to as a bias). Each node then computes an output value using a function, with the weighted sum as the input to that function. This function is commonly referred to as the activation function, and the outputs of the node (which are then used as inputs to the next layer of nodes) are referred to as activations.
Consider a neural network with one or more hidden layers 240 (i.e., layers that are not the input layer or the output layer). The index variable l can be any of the hidden layers of the network (i.e., l∈{1, . . . M−1}, with l=0 representing the input layer and l=M representing the output layer).
The output yl+1 of node in hidden layer l+1 can be expressed as:
y
l+1ƒ((wl+1·yl)*c+bl+1) (1)
This equation describes a function, whose input is the dot product of a vector of weight values wl+1 and a vector of outputs yl from layer l, which is then multiplied by a constant value c, and offset by a bias value bl+1. The constant value c is a value to which all the weight values are normalized. In some embodiments, the constant value c is 1. The symbol * is an element-wise product, while the symbol is the dot product. The weight coefficients and bias are parameters that are adjusted during the network's training in order to configure the network to solve a particular problem (e.g., object or face recognition in images, voice analysis in audio, depth analysis in images, etc.).
In equation (1), the function ƒ is the activation function for the node. Examples of such activation functions include a sigmoid function (ƒ(x)=1/(1+e−x)), a tan h function, or a ReLU (rectified linear unit) function (ƒ(x)=max(0,x)). See Nair, Vinod and Hinton, Geoffrey E., “Rectified linear units improve restricted Boltzmann machines,” ICML, pp. 807-814, 2010, incorporated herein by reference in its entirety. In addition, the “leaky” ReLU function (f(x)=max(0.01*x, x)) has also been proposed, which replaces the flat section (i.e., x<0) of the ReLU function with a section that has a slight slope, usually 0.01, though the actual slope is trainable in some embodiments. See He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” arXiv preprint arXiv:1502.01852, 2015, incorporated herein by reference in its entirety. In some embodiments, the activation functions can be other types of functions, including gaussian functions and periodic functions.
Before a multi-layer network can be used to solve a particular problem, the network is put through a supervised training process that adjusts the network's configurable parameters (e.g., the weight coefficients, and additionally in some cases the bias factor). The training process iteratively selects different input value sets with known output value sets. For each selected input value set, the training process typically (1) forward propagates the input value set through the network's nodes to produce a computed output value set and then (2) back-propagates a gradient (rate of change) of a loss function (output error) that quantifies the difference between the input set's known output value set and the input set's computed output value set, in order to adjust the network's configurable parameters (e.g., the weight values).
In some embodiments, training the neural network involves defining a loss function (also called a cost function) for the network that measures the error (i.e., loss) of the actual output of the network for a particular input compared to a pre-defined expected (or ground truth) output for that particular input. During one training iteration (also referred to as a training epoch), a training dataset is first forward-propagated through the network nodes to compute the actual network output for each input in the data set. Then, the loss function is back-propagated through the network to adjust the weight values in order to minimize the error (e.g., using first-order partial derivatives of the loss function with respect to the weights and biases, referred to as the gradients of the loss function). The accuracy of these trained values is then tested using a validation dataset (which is distinct from the training dataset) that is forward propagated through the modified network, to see how well the training performed. If the trained network does not perform well (e.g., have error less than a predetermined threshold), then the network is trained again using the training dataset. This cyclical optimization method for minimizing the output loss function, iteratively repeated over multiple epochs, is referred to as stochastic gradient descent (SGD).
In some embodiments the neural network is a deep aggregation network, which is a stateless network that uses spatial residual connections to propagate information across different spatial feature scales. Information from different feature scales can branch-off and re-merge into the network in sophisticated patterns, so that computational capacity is better balanced across different feature scales. Also, the network can learn an aggregation function to merge (or bypass) the information instead of using a non-learnable (or sometimes a shallow learnable) operation found in current networks.
Deep aggregation networks include aggregation nodes, which in some embodiments are groups of trainable layers that combine information from different feature maps and pass it forward through the network, skipping over backbone nodes. Aggregation node designs include, but are not limited to, channel-wise concatenation followed by convolution (e.g., DispNet), and element-wise addition followed by convolution (e.g., ResNet). See Mayer, Nikolaus, Ilg, Eddy, Musser, Philip, Fischer, Philipp, Cremers, Daniel, Dosovitskiy, Alexey, and Brox, Thomas, “A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation,” arXiv preprint arXiv:1512.02134, 2015, incorporated herein by reference in its entirety. See He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian, “Deep Residual Learning for Image Recognition,” arXiv preprint arXiv: 1512.03385, 2015, incorporated herein by reference in its entirety.
In some embodiments, the graph data structure stores data for one or more users who are associated with a customer, and the graph also includes a customer node that is associated with (e.g., connected to) the corresponding user input nodes for each user associated with the customer. Each user may also have a user node connected to the corresponding user input node and the associated customer node.
In some embodiments, the graph data structure stores activity data for multiple customers, each with multiple associated users, and has corresponding nodes for the customers, their users, and the users' data input. Some customers may have churned, and the graph data structure may assign a label indicating the churned status (e.g., a “canceled” label) to such customers, or have at least one additional node to indicate the churned status (e.g., a “canceled” node) that is connected to the churned customers.
In some embodiments, the graph data structure is generated by an extract, transform, load process, by extracting data from a relational database that stores a system of records associated with the activity of the user, transforming the extracted data to a database format that natively supports graph data structures, and loading the transformed data into the graph data structure.
An ETL process 410 periodically reads the SOR 405 data, transforms it into the target graph data model, and loads into a database that supports a graph model 415. Neo4J (Neo4j, Inc., San Mateo CA) and Amazon Neptune (Amazon.com, Inc., Seattle WA) are examples of graph database technologies that can be applied to persist the data natively in graph structures.
The graph data structure resulting from an ETL process such as the example in
The visualization 500 of the graph data structure may be displayed as a graphical user interface (GUI) on a display in some embodiments. In this case, the GUI allows the user to directly create, read, update, and delete the nodes and edges in the graph data structure, by interacting (e.g., with a mouse, keyboard, and/or touchscreen) with different GUI elements and menus. For example, the GUI may include a menu 501 that lists the different types of nodes as well as listing the current total number of nodes and the current number of nodes of each type. The GUI may also include an edge menu 502 that lists the different types of edges as well as the current total number of edges and the current number of edges of each type. The GUI may also include a graph 503 that graphically depicts the nodes and edges for ease of visualization and interaction. In some embodiments, the GUI may also use coding to indicate different types and properties of the nodes and/or edges, such as (but not limited to) node shape, color, edge type (e.g., solid, dashed, dotted, etc.), edge weight (thickness), and node size.
In the example of
Joe wants to hire a new employee, so initiates this action by writing an “Add my new employee” message, e.g., by speaking or typing into the interface 100 of the intelligent virtual assistant. In the graph data structure, the chat message is represented by a chat message node 520, which is connected to the user node 505 by an edge 522 to indicate a WRITES relationship, since Joe wrote the chat message.
The chat message is processed by an ensemble of Natural Language Processing Machine Learning models that identify and resolve the chat message to a HIRE intent. The HIRE intent is represented by an intent node 525, which is connected to the chat message node 520 by an edge 527 to indicate a RESOLVED_TO relationship.
The HIRE intent starts a “Worker Hire” chat conversation, that is represented in the graph data structure by a conversation node 530. The conversation node 530 is connected to the intent node 525 by an edge 532 that indicates a STARTS relationship. Through the conversation, the intelligent virtual assistant asks questions to Joe (through the interface 100) in order to get all the information necessary to complete the hiring process (e.g., new employee name, tax information, contact information, salary, role, etc.). The user node 505 is connected to the conversation node 530 by an edge 534 that indicates an ANSWERS relationship. At the end of the conversation, the user is asked to provide feedback about how satisfied they were with the process. The feedback is represented by a feedback node 535, that is connected to the user node 505 by an edge 537 that indicates a GIVES relationship.
In some embodiments, the feedback is applied to reinforcement learning techniques that will be discussed in more detail below, and may also used to evaluate the conversation process, which can be improved based on user suggestions. For example, in the graph data structure, the feedback node 535 is connected to the conversation node 530 by an edge 539 that indicates an EVALUATES relationship.
In the example visualization 500 of
Returning to
The generated semantic vector representations 615 (also known as embeddings) are added as features to those nodes in the graph model 415 that represent the text messages 610. This information can be used in many NLP tasks like text classification and semantic analysis, which are required for the sentiment classification operation that is performed in some embodiments as described in further detail with reference below to operation 330 of process 300.
Returning to
The sentiment model is trained in some embodiments using historical user input. For example, the sentiment model may be multiple convolution neural networks, arranged in parallel and trained in parallel using the historical user input.
For example, in some embodiments, to perform such classification task, an ensemble of bootstrapped deep learning models is adopted to classify the sentiment along with uncertainty estimation. The uncertainty estimation is based on the entropy of the predictions of each member of the ensemble and is important to prevent models' misbehavior when facing chat messages out of the training distribution. As an example, if the model is trained on a distribution of words from the HCM domain, its outcomes for text messages about physics are potentially unpredictable. The ensemble approach returns high uncertainty for these cases, so special actions can be taken, like discarding the prediction.
This process is applied to text messages entered by the users and the information of resulting classifications is added to the graph as new nodes representing the sentiment.
After updating the graph data structure with sentiment classification, the user feedback from Joe is associated with the “positive” sentiment. The positive sentiment is represented in the graph data structure by a positive sentiment node 905, that is connected to feedback node 535 by an edge 907 that indicates a FEELS relationship. Also, the text message is associated with the “neutral” sentiment. The neutral sentiment is represented by a neutral sentiment node 910, that is connected to the chat message node 520 by an edge 912 that indicates a FEELS relationship. In addition, the graph data structure may also have a negative sentiment node 915. In this example, none of the inputs have been classified by the ensemble model 800 as having a negative sentiment, so there are no edges connecting the negative sentiment node 915 to any other node.
In some embodiments, the indicator 551 of the menu 501 is also updated to indicate that there are now ten total nodes in the graph data structure. This is because three new sentiment nodes (nodes 905, 910, 915) were added after sentiment classification of the chat message and the user feedback. Likewise, the indicator 553 of the edge menu 502 also has been updated to indicate that there are now ten total edges in the graph data structure. This is to indicate the two new FEELS relationships (edges 907, 912), between the positive sentiment node 905 and the user feedback node 535, and between the neutral sentiment node 910 and the chat message node 520. In other embodiments, the visualization is not updated.
In some embodiments, only a single sentiment node for each possible sentiment is present in the graph data structure, with edge connections to all relevant nodes that have been classified according to that sentiment (if any). In that case, the total number of sentiment nodes would be constant and equal to the number of possible sentiments (e.g., three, when the sentiments are positive, neutral, and negative). In other embodiments, each node that has been classified according to a sentiment is connected to a separate sentiment node of the appropriate type. In that case, the total number of sentiment nodes would be equal to the total number of nodes that have been classified by sentiment. In still other embodiments, no sentiment nodes may be added to the graph data structure, and instead the sentiment to which the user input was classified is added to the corresponding node as a metadata update, analogous to updating the node with the vector representation in operation 320 of process 300.
Returning to
In some embodiments, the nodes and edges of the graph data structure are mapped to dense vector representations, analogous to the semantic vector representations generated in operation 320 of process 300. The difference is that instead of learning language representations of text messages, here the learned vector representations describe many aspects of the resulting graph structures, such as community structures and roles of nodes. This latent information carried by these learned vector representations can be used in many downstream graph analytics tasks, such as graph visualization, node classification, link prediction, and graph clustering. In the context of churn prediction, node classification and link prediction are tasks that help solve the problem.
In some embodiments, a second-order random walk technique named Node2Vec [4] is adopted because of its capability to preserve structural equivalence and explore nodes' neighborhoods with higher orders of proximity. In other embodiments, other representation learning techniques can be adopted, like Structural Deep Network Embedding (SDNE) [5] and Higher-Order Proximity preserved Embedding (HOPE) [6] which are also being experimented.
This representation learning task can be applied synchronously or asynchronously after the graph updates, and the resulting graph updated with the learned representations is used to predict potential churns, as described with reference to operation 350 of process 300, described in further detail below.
Returning to
In some embodiments, the churn model is trained using the graph data structure, updated with the corresponding sentiment nodes, as well as the learned vector representations of the graph. In some embodiments, the churn model has multiple convolution neural networks that are arranged in series. The churn model may be trained in some embodiments whenever a predetermined amount or type of data (e.g., new customer nodes, new user input, etc.) has been added to the graph model 415, after a predetermined period of time, or according to other criteria so that the churn model remains up to date.
In some embodiments, the process 300 performs the churn probability estimation using the graph structures and vector representations learned in previous steps as input features to multiple graph convolution networks (GCNs). The GCNs perform the classification by aggregating the features of each node and its neighbors and passing them through multiple neural network layers to reduce the dimension of the representations, a process called feature smoothing.
Being a semi-supervised process, feature smoothing needs some examples of real churns. This information is added to the graph in some embodiments by labeling clients or customers that previously churned (e.g., cancelled the service) with a cancelled label.
In this example, Nakatomi Corporation has previously churned. This is represented in the graph data structure by a canceled node 1220, which is connected to the customer node 1210 by an edge 1222 to indicate a STATUS relationship. The menu 501, edge menu 502, and their associated indicators 551, 553 may also be updated in some embodiments to reflect the additional nodes and edges associated with the second customer (Nakatomi Corporation), its employees, and user inputs. In other embodiments, the customer's status (e.g., “canceled,” “renewed,”, “new,” etc.) may be represented as a label that is applied to the customer node as metadata, instead of being represented by a separate node.
The visualizations 500, 900, 1200 of the graph data structure are optional in some embodiments. The underlying graph data structure may be manipulated, updated, etc. without visually reflecting every change.
The GCNs are applied to the graph and nodes classified with high potential of discontinuing the service are identified. In some embodiments, these identified nodes are targeted for strategic churn prevention and customer retention actions.
At 1320, the process 1300 uses the probability to make a determination whether the customer is likely to churn. If the process 1300 determines that the customer is not likely to churn, then the process 1300 ends. If the process 1300 determines that the customer is likely to churn, the process 1300 continues to 1330, which is described below.
At 1330, the process 1300 selects a retention action for the customer, using a reinforcement model that explores a plurality of retention actions, learns the optimal retention action for each context by reinforcement learning based on received rewards from successes, and exploits the optimal retention action by implementing it for target customers for the context.
In some embodiments, the reinforcement model learns, from a pre-defined set of retention actions, an optimal retention action for each context, by exploring the options (i.e., implementing the actions for customers with high probability of churn) and learning from rewards received for the cases of success. In some embodiments, the success is inferred by checking whether the customer with high probability of churn keeps using the application after being offered a retention action, e.g., after a pre-defined period of time. Selecting the retention action is also based on other customer characteristics, including the customer type (e.g., type of business, including but not limited to retail, manufacturing, food service, professional, etc.) and the customer size (e.g., number of users, employees, etc.), and this is modeled as the context. Once the model has learnt the optimal retention action (the one with the highest probability of receiving reward) for each context, it begins to exploit them, by selecting it more frequently.
In some embodiments, the reinforcement model deliberately (e.g., randomly) selects other retention actions that are not optimal. This is because there is a trade-off between exploration (random selection of retention actions, which provides no information on results) and exploitation (observing results and using those results to determine the optimal retention actions). Once result feedback from selected retention actions is received, that the reinforcement model can incorporate that knowledge (e.g., by creating or modifying policies). That knowledge is exploited but in order to anticipate temporal changes in context, some fraction of retention actions are still chosen in an exploratory manner. The ratio of exploration to exploitation retention actions is defined by the policies that define the reinforcement model. For example, in some embodiments, the exploration to exploitation ratio is 10%. The results of exploration-based retention are also used to update the policies in the model.
In some embodiments, the reinforcement model observes the results of selecting retention actions. Based on these observations, the reinforcement model learns to select the optimal action based on the context for that customer. The contextual inputs to the reinforcement model may include one or more of the probability of churn, the available retention actions, observations of the results of previous retention actions, and other environmental details. The results may include an assessment of success or failure in retaining the customer after a defined period of time (e.g., two weeks). The output of the reinforcement model is the selection of the retention action.
In some embodiments, the reinforcement model may be considered a “black box” with the above input and output. Inside the model, rules and policies define how to process the input to determine the output. These rules and policies are updated on a regular basis, and that update process is also defined by the rules of the model itself.
At 1340, the process 1300 implements the selected retention action for the customer. The process 1300 then ends. The customers (nodes of the graph) identified with high probability of discontinuing the service need to receive special care. Strategic actions that can be taken include offers of detailed help information, free product packages trials, live chat support redirection, or even discounts, all can be offered to attempt to retain these customers.
Given a set of such strategic customer retention actions to prevent losing these customers,
Also, in some embodiments, the unknown reward probabilities can change depending on the context. As an analogy, a given move of a chess piece (the bandit) can have different results depending on the state of the game (context). In the context of customer retention, a churn prevention action (bandit) can be more or less effective depending on the customer's industry type, market environment, years of service usage, geographic location, or size of the business in terms of revenue, number of employees, etc. (context). This is known as Contextual Bandits [8], a variation of the MAB problem.
The purpose of this step is to automatically select and offer the available options, and, based on feedback in form of rewards and the context, learn in an efficient way which action is the most successful one. For this task, any reinforcement learning technique can be adopted that solves the MAB problem. In some embodiments, the proposed solution adopts a technique called Thompson Sampling [9]. This technique is based on the idea of probability matching [10], where the actions are selected based on reward estimations. These reward estimations are sampled from distributions that are updated based on the history of rewards.
In some embodiments, multiple models may be applied for different contexts, and the output of the multiple models compared to find a consensus or majority opinion on the selected action. As an example, one model may be implemented to learn the best retention action based on customer size, another model to learn the best action based on geographic location, and a third model to learn the best option based on the revenue of the customers. Each customer may be assigned to different classifications based on their profile according to these different contexts, to provide one or more proposed actions, from which a final action can be selected.
Once the action with the best distribution is identified, it is then exploited (selected as the one to be offered more frequently than the others), but the other options continue to be explored at times. The tradeoff between exploration and exploitation is important to keep the ability to identify changes in the environment. For instance, new retention actions added as options, or changes in customer profiles can be identified and learned automatically.
It is important to notice that this ability to learn the best offer to the customer based on reward and penalty feedbacks can be applied not only to select customer churn prevention, but also for general advertisement purposes.
The peripherals interface 1515 is coupled to various sensors and subsystems, including a camera subsystem 1520, an audio subsystem 1530, an I/O subsystem 1535, and other sensors 1545 (e.g., motion/acceleration sensors), etc. The peripherals interface 1515 enables communication between the processing units 1510 and various peripherals. For example, an orientation sensor (e.g., a gyroscope) and an acceleration sensor (e.g., an accelerometer) can be coupled to the peripherals interface 1515 to facilitate orientation and acceleration functions. The camera subsystem 1520 is coupled to one or more optical sensors (e.g., charged coupled device (CCD) optical sensors, complementary metal-oxide-semiconductor (CMOS) optical sensors, etc.). The camera subsystem 1520 and the optical sensors facilitate camera functions, such as image and/or video data capturing.
The audio subsystem 1530 couples with a speaker to output audio (e.g., to output voice navigation instructions). Additionally, the audio subsystem 1530 is coupled to a microphone to facilitate voice-enabled functions, such as voice recognition, digital recording, etc. The I/O subsystem 1535 involves the transfer between input/output peripheral devices, such as a display, a touch screen, etc., and the data bus of the processing units 1510 through the peripherals interface 1515. The I/O subsystem 1535 various input controllers 1560 to facilitate the transfer between input/output peripheral devices and the data bus of the processing units 1510. These input controllers 1560 couple to various input/control devices, such as one or more buttons, a touch-screen, etc. The input/control devices couple to various dedicated or general controllers, such as a touch-screen controller 1565.
In some embodiments, the device includes a wireless communication subsystem (not shown in
As illustrated in
The memory 1570 may represent multiple different storages available on the device 1500. In some embodiments, the memory 1570 includes volatile memory (e.g., high-speed random access memory), non-volatile memory (e.g., flash memory), a combination of volatile and non-volatile memory, and/or any other type of memory.
The instructions described above are merely examples and the memory 1570 includes additional and/or other instructions in some embodiments. For instance, the memory for a smartphone may include phone instructions to facilitate phone-related processes and functions. An IOT device, for instance, might have fewer types of stored instructions (and fewer subsystems), to perform its specific purpose and have the ability to receive a single type of input that is evaluated with its neural network.
The above-identified instructions need not be implemented as separate software programs or modules. Various other functions of the device can be implemented in hardware and/or in software, including in one or more signal processing and/or application specific integrated circuits. For example, a neural network parameter memory stores the weight values, bias parameters, etc. for implementing one or more machine-trained networks by the integrated circuit 1505. Different clusters of cores can implement different machine-trained networks in parallel in some embodiments. In different embodiments, these neural network parameters are stored on-chip (i.e., in memory that is part of the integrated circuit 1505) or loaded onto the integrated circuit 1505 from the memory 1570 via the processing unit(s) 1510.
While the components illustrated in
The bus 1605 collectively represents all system, peripheral, and chipset buses that communicatively connect the numerous internal devices of the electronic system 1600. For instance, the bus 1605 communicatively connects the processing unit(s) 1610 with the read-only memory 1630, the system memory 1625, and the permanent storage device 1635.
From these various memory units, the processing unit(s) 1610 retrieves instructions to execute and data to process in order to execute the processes of the invention. The processing unit(s) may be a single processor or a multi-core processor in different embodiments.
The read-only-memory 1630 stores static data and instructions that are needed by the processing unit(s) 1610 and other modules of the electronic system. The permanent storage device 1635, on the other hand, is a read-and-write memory device. This device is a non-volatile memory unit that stores instructions and data even when the electronic system 1600 is off. Some embodiments of the invention use a mass-storage device (such as a magnetic or optical disk and its corresponding disk drive) as the permanent storage device 1635.
Other embodiments use a removable storage device (such as a floppy disk, flash drive, etc.) as the permanent storage device. Like the permanent storage device 1635, the system memory 1625 is a read-and-write memory device. However, unlike storage device 1635, the system memory is a volatile read-and-write memory, such a random-access memory. The system memory stores some of the instructions and data that the processor needs at runtime. In some embodiments, the invention's processes are stored in the system memory 1625, the permanent storage device 1635, and/or the read-only memory 1630. From these various memory units, the processing unit(s) 1610 retrieves instructions to execute and data to process in order to execute the processes of some embodiments.
The bus 1605 also connects to the input devices 1640 and output devices 1645. The input devices enable the user to communicate information and select commands to the electronic system. The input devices 1640 include alphanumeric keyboards and pointing devices (also called “cursor control devices”). The output devices 1645 display images generated by the electronic system. The output devices include printers and display devices, such as cathode ray tubes (CRT) or liquid crystal displays (LCD). Some embodiments include devices such as a touchscreen that function as both input and output devices.
Finally, as shown in
As used in this specification, the terms “computer”, “server”, “processor”, and “memory” all refer to electronic or other technological devices. These terms exclude people or groups of people. As used in this specification, the terms “computer readable medium,” “computer readable media,” and “machine readable medium,” etc. are entirely restricted to tangible, physical objects that store information in a form that is readable by a computer. These terms exclude any wireless signals, wired download signals, and any other ephemeral signals.
The term “computer” is intended to have a broad meaning that may be used in computing devices such as, e.g., but not limited to, standalone or client or server devices. The computer may be, e.g., (but not limited to) a personal computer (PC) system running an operating system such as, e.g., (but not limited to) MICROSOFT® WINDOWS® available from MICROSOFT® Corporation of Redmond, Wash., U.S.A. or an Apple computer executing MAC® OS from Apple® of Cupertino, Calif., U.S.A. However, the invention is not limited to these platforms. Instead, the invention may be implemented on any appropriate computer system running any appropriate operating system. In one illustrative embodiment, the present invention may be implemented on a computer system operating as discussed herein. The computer system may include, e.g., but is not limited to, a main memory, random access memory (RAM), and a secondary memory, etc. Main memory, random access memory (RAM), and a secondary memory, etc., may be a computer-readable medium that may be configured to store instructions configured to implement one or more embodiments and may comprise a random-access memory (RAM) that may include RAM devices, such as Dynamic RAM (DRAM) devices, flash memory devices, Static RAM (SRAM) devices, etc.
The secondary memory may include, for example, (but not limited to) a hard disk drive and/or a removable storage drive, representing a floppy diskette drive, a magnetic tape drive, an optical disk drive, a read-only compact disk (CD-ROM), digital versatile discs (DVDs), flash memory (e.g., SD cards, mini-SD cards, micro-SD cards, etc.), read-only and recordable Blu-Ray® discs, etc. The removable storage drive may, e.g., but is not limited to, read from and/or write to a removable storage unit in a well-known manner. The removable storage unit, also called a program storage device or a computer program product, may represent, e.g., but is not limited to, a floppy disk, magnetic tape, optical disk, compact disk, etc. which may be read from and written to the removable storage drive. As will be appreciated, the removable storage unit may include a computer usable storage medium having stored therein computer software and/or data.
In alternative illustrative embodiments, the secondary memory may include other similar devices for allowing computer programs or other instructions to be loaded into the computer system. Such devices may include, for example, a removable storage unit and an interface. Examples of such may include a program cartridge and cartridge interface (such as, e.g., but not limited to, those found in video game devices), a removable memory chip (such as, e.g., but not limited to, an erasable programmable read only memory (EPROM), or programmable read only memory (PROM) and associated socket, and other removable storage units and interfaces, which may allow software and data to be transferred from the removable storage unit to the computer system.
Some embodiments include electronic components, such as microprocessors, storage and memory that store computer program instructions in a machine-readable or computer-readable medium (alternatively referred to as computer-readable storage media, machine-readable media, or machine-readable storage media). The computer-readable media may store a computer program that is executable by at least one processing unit and includes sets of instructions for performing various operations. Examples of computer programs or computer code include machine code, such as is produced by a compiler, and files including higher-level code that are executed by a computer, an electronic component, or a microprocessor using an interpreter.
The computer may also include an input device may include any mechanism or combination of mechanisms that may permit information to be input into the computer system from, e.g., a user. The input device may include logic configured to receive information for the computer system from, e.g., a user. Examples of the input device may include, e.g., but not limited to, a mouse, pen-based pointing device, or other pointing device such as a digitizer, a touch sensitive display device, and/or a keyboard or other data entry device (none of which are labeled). Other input devices may include, e.g., but not limited to, a biometric input device, a video source, an audio source, a microphone, a web cam, a video camera, and/or another camera. The input device may communicate with a processor either wired or wirelessly.
The computer may also include output devices which may include any mechanism or combination of mechanisms that may output information from a computer system. An output device may include logic configured to output information from the computer system. Embodiments of output device may include, e.g., but not limited to, display, and display interface, including displays, printers, speakers, cathode ray tubes (CRTs), plasma displays, light-emitting diode (LED) displays, liquid crystal displays (LCDs), printers, vacuum florescent displays (VFDs), surface-conduction electron-emitter displays (SEDs), field emission displays (FEDs), etc. The computer may include input/output (I/O) devices such as, e.g., (but not limited to) communications interface, cable and communications path, etc. These devices may include, e.g., but are not limited to, a network interface card, and/or modems. The output device may communicate with processor either wired or wirelessly. A communications interface may allow software and data to be transferred between the computer system and external devices.
The term “data processor” is intended to have a broad meaning that includes one or more processors, such as, e.g., but not limited to, that are connected to a communication infrastructure (e.g., but not limited to, a communications bus, cross-over bar, interconnect, or network, etc.). The term data processor may include any type of processor, microprocessor and/or processing logic that may interpret and execute instructions, including application-specific integrated circuits (ASICs) and field-programmable gate arrays (FPGAs). The data processor may comprise a single device (e.g., for example, a single core) and/or a group of devices (e.g., multi-core). The data processor may include logic configured to execute computer-executable instructions configured to implement one or more embodiments. The instructions may reside in main memory or secondary memory. The data processor may also include multiple independent cores, such as a dual-core processor or a multi-core processor. The data processors may also include one or more graphics processing units (GPU) which may be in the form of a dedicated graphics card, an integrated graphics solution, and/or a hybrid graphics solution. Various illustrative software embodiments may be described in terms of this illustrative computer system. After reading this description, it will become apparent to a person skilled in the relevant art(s) how to implement the invention using other computer systems and/or architectures.
The term “data storage device” is intended to have a broad meaning that includes removable storage drive, a hard disk installed in hard disk drive, flash memories, removable discs, non-removable discs, etc. In addition, it should be noted that various electromagnetic radiation, such as wireless communication, electrical communication carried over an electrically conductive wire (e.g., but not limited to twisted pair, CATS, etc.) or an optical medium (e.g., but not limited to, optical fiber) and the like may be encoded to carry computer-executable instructions and/or computer data that embodiments of the invention on e.g., a communication network. These computer program products may provide software to the computer system. It should be noted that a computer-readable medium that comprises computer-executable instructions for execution in a processor may be configured to store various embodiments of the present invention.
The term “network” is intended to include any communication network, including a local area network (“LAN”), a wide area network (“WAN”), an Intranet, or a network of networks, such as the Internet.
The term “software” is meant to include firmware residing in read-only memory or applications stored in magnetic storage, which can be read into memory for processing by a processor. Also, in some embodiments, multiple software inventions can be implemented as sub-parts of a larger program while remaining distinct software inventions. In some embodiments, multiple software inventions can also be implemented as separate programs. Finally, any combination of separate programs that together implement a software invention described here is within the scope of the invention. In some embodiments, the software programs, when installed to operate on one or more electronic systems, define one or more specific machine implementations that execute and perform the operations of the software programs.
The various embodiments described above are provided by way of illustration only and should not be construed to limit the scope of the disclosure. Various modifications and changes may be made to the principles described herein without following the example embodiments and applications illustrated and described herein, and without departing from the spirit and scope of the disclosure.