The present disclosure generally relates to messaging, such as might be used communicate information to end users of a product or service. More specifically, but not by way of limitation, the present disclosure relates to statistically based techniques for evaluating various messaging treatments on an experimental basis in order to optimize message content, format, delivery channel, etc.
A/B tests can be used for conducting experiments to determine which of two alternative commercial treatments, for example, alternative messages or kinds of messages to users or consumers, provides the best experience for the audience. The term “best” in this context can be defined differently depending on a goal to be achieved. For example, “best” can be defined in terms of some metric of significance for example, the portion of visitors to a web site or recipients of the message that respond positively in some way, or the average time that a user spends on a web site. The term “message” can refer to a text message, email message, push message, instant message, or the like. The term message can also refer to a web site, web page, or a portion thereof, considering design, color scheme, descriptive text, fonts, or any other characteristic. In order to conduct an A/B test of messaging treatments, two different messages are provided, and each is forwarded to or otherwise provided to a different group of recipients. Data describing the responses from within the two groups is collected and compared. This analysis can be used in determining the best alternative as between the two messages.
Certain aspects and features of the present disclosure relate to providing anytime-valid confidence sequences for multiple treatments. For example, a method involves transmitting each of multiple test messages to an independent group of recipients, and assaying, over time, using a response module, a metric corresponding to a message response from the independent group of recipients for each of the test messages. The method further involves deriving, iteratively over time using a difference module, a comparative difference between an assayed value of the metric for the message response and a baseline value of the metric. The method also involves estimating, iteratively over time using a variance module, a variance of an average of the metric for the test messages. The method involves calculating, iteratively over time using a confidence module and based on the variance and an error-corrected p-value normalized within confidence bounds, a current confidence value corresponding to a current difference value for the comparative difference. The method additionally involves displaying, while updating over time using an interface module, the current confidence value to produce a confidence sequence.
Other embodiments include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of a method.
This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used in isolation to determine the scope of the claimed subject matter. The subject matter should be understood by reference to appropriate portions of the entire specification of this disclosure, any or all drawings, and each claim.
Features, embodiments, and advantages of the present disclosure are better understood when the following Detailed Description is read with reference to the accompanying drawings, where:
Messaging treatments can be tested and compared in pairs, by producing two different messages and forwarding or otherwise providing each message to a different group of recipients. One message is a baseline, perhaps a messaging treatment currently in use, and the other is a proposed, new message or a new type of message. Thus, a messaging treatment is compared against a control. Data describing the responses from within the two groups can be collected, characterized, and compared.
Such A/B testing can be conducted so as to provide an anytime-valid confidence sequence (ACS). An ACS provides a statistically useable confidence value for a current result of the test while the test is still running by controlling the type I statistical error (i.e., the rejection of a hypothesis that is actually true). An analyst can explore the results of the A/B test in real time, continuously monitoring and evaluating whether there is enough data to stop the test. ACS operates in contrast to normal confidence values, which are designed to control type I error only after a certain portion of the test has been completed, for example, at a pre-specified time when a certain sample size for the A/B test is reached.
It is desirable to test multiple treatments (messages) at once, to improve efficiency and reduce the amount of time required to test new messaging treatments. However, a real-time display of changing metrics and confidence values for large numbers of messaging treatments can be cumbersome. Additionally, ACS may not provide accurate confidence values when more than two messaging treatments are tested for comparison at the same time. Firstly, each hypothesis test evolves over time; thus, comparisons may be performed multiple times over time. Secondly, more than two comparisons are being performed across treatment arms; where there is one control arm, and k treatment arms, there are k pairwise comparisons with the control. When there is a collection of hypothesis tests where the null hypothesis is true for all of them (i.e., the treatment's effect is zero over time and for all treatment arms), the probability that at least one of the test messages is found to be a significant improvement over another erroneously grows to one as the number of comparisons increases. In addition to producing potentially inaccurate results, the growing numbers of statistical comparisons and probabilities result in an increasing computational load as a test is run, resulting in high latency in computing and displaying results.
Current paths to confidence sequences for messaging treatments are thus error-prone, computationally burdensome, or overly restrictive with respect to simultaneous testing. Embodiments described herein address these issues by providing a process that controls and/or corrects statistical error when multiple messaging treatments are being tested together. An estimated variance of the average treatment effect is computed and initially used to calculate the anytime-valid confidence sequence (ACS). A p-value is determined within confidence bounds for each messaging treatment based on the initially calculated ACS. The p-value can be corrected to control the type I error, using an error correction module to provide, as examples, a Bonferroni correction or a Benjamini-Hochberg procedure.
The use of corrected p-values normalized to be within confidence bounds provides a process that produces statistically valid, current confidence values irrespective of the number of messaging treatments being simultaneously tested. The confidence values are updated over time, and current values can be observed at any time during an experiment. The differences between treatments and the confidence values can be displayed in alignment with each other and scrolled as required by the number of messaging treatments being tested, while the values displayed are iteratively updated over time.
For example, an analytics application is executed on a computing system and provides testing and related statistical evaluation in connection with various messaging treatments to determine which one produces the most desirable results in terms of consumer response. The messages can be stored, formatted and sent from a communication server or from the same computing system that is used to execute the analytics application. Once a test is started, the analytics application causes the test messages to be sent. Each test message can be sent to an independent group of recipients over some period of time. The analytics application programmatically evaluates a metric related to message responses over time and determines a difference in the metric for each of several unique messages as compared to a baseline message. The analytics application also determines a confidence value for each of the several messages and can display these dynamically and sequentially over time. The analytics application can also display the current difference value, or “lift,” updating any or all of these values over time while maintaining the accuracy of the values.
In some examples, the analytics application can be configured to display valid current confidence values and current difference values along with test message identifiers in visual alignment on a display device. The display device can be configured by the analytics application to be scrollable while being iteratively updated over time with the sequence of values.
The use of normalized p-values provides sequences of statistically valid, current confidence values irrespective of the number of messaging treatments being tested. An analytics application as described herein can also control type I error by using an error correction module to correct the p-values using statistical procedures. The entire process is computationally light weight and thus scales easily to large numbers of simultaneous treatment arms, while still providing a low-latency display of accurate, live results. An analytics application may optionally be configured to identify, based on test results, a best message and automatically transmit that best message to an expanded group of recipients.
A “message,” as the term is used herein, can include any electronic communication. For example, a message can be a text message, an email message, a push message, or any message that is sent and received through a typical messaging application or the web. A message may also be an audio message sent through the audio feature of an application or via a telephone call.
A message can also be a web site, or a portion of a web site and testing may involve varying text, color schemes, images, or any other aspect of a web page. In such a case, transmitting the message may involve transmitting different versions of one or more web pages to different web browsers corresponding to various users.
As used herein, the terms, “p-value,” “variance,” “confidence,” and their values have the meanings normally understood in the field of statistics. An “error-corrected p-value” is a p-value that has been adjusted to eliminate type I statistical error. A “confidence sequence” is a recorded stream of confidence values stored for reference in a memory device and/or displayed, either sequentially or simultaneously. An anytime-valid confidence sequence (ACS) is a stream of statistically useable confidence values for a live, current result of a test even if the result is an intermediate result. The confidence values in an ACS can be derived any time from early on, when only a few responses have been received, through the conclusion of the test.
The term “treatment” as used herein is a plan for approaching potential customers or users in the normal course of business. A “messaging treatment” as the term is used herein refers to the content of a message to potential customers or users, as well as the manner in which it will be communicated, and optionally additional parameters regarding its application such as timing and duration.
The term “dynamically” as used herein, for example, to describe the display of results, refers to values being provided in such a manner that the values can be changed and updated continuously during a test. The term “selectively,” for example, as used in reference to “selectively displaying,” refers to an operation, such as displaying a certain value, which can take place or not based on a configuration parameter, or a selection made through an input device.
Still referring to
In addition to computing device 101, computing environment 100 includes computing device 146, which in this example is a mobile device receiving message 107a from among messages 107. Computing device 146 is connected to the communication server 106 through network 104. Computing environment 100 also includes computing device 148, which in this example is another mobile device, in this case receiving message 107b from among messages 107. Computing device 148 is also connected to the communication server 106 through network 104. Each of computing device 146 and computing device 148 receives a test message directed to a different independent group of recipients as part of a current test. Either or both of computing device 101 and communication server 106 can be implemented as either real or virtual (e.g., cloud-based) computing devices and can be implemented on any number of computing platforms.
Continuing with
To define a confidence sequence for an individual messaging treatment, let {circumflex over (μ)}n be the sample mean, and σn the sample standard deviation after n samples have been recorded. Then for any pre-specified constant ρ.
forms a (1-a) confidence sequence (CS) for the true mean, μ. The parameter ρ is a free parameter that is tuned. It has been found that a value of ρ2=10−2.8 works well. The confidence sequence bounds for a messaging treatment can be determined by adding the sum of users for the treatment, the sum of the metric, for example lift, for the treatment, and the sum of the squared metric for the particular treatment.
At block 212 the computing device displays, dynamically while updating over time, at least the current confidence value and the current difference value. For example, a dynamic display 136 can be output through the interface module 130 to the presentation device 108. In some examples, the displayed output at block 212 of process 200 is scrollable and can be updated while scrolling for optimized viewing and review of the valid confidence sequences for each treatment. These confidence values can be observed at any time during the experiment; the test does not need to be completed for valid confidence values to be displayed. A screenshot of an example display will be discussed below with respect to
8For processing the conversions, it can be assumed that some recipients “convert” multiple times. The number of conversions at any given time can be modeled as the Poisson distribution:
with λi varying. Note that in some contexts, the binarized conversion rate of this model may be of greater interest. The binarized conversion rate ρi is the fraction of users that convert, and is related to the average number of conversions λi by ρi=1−e−λ
For processing conversions to produce anytime valid confidence sequences, it can be assumed that the time delay between an assignment event and a conversion event follows an exponential distribution, with one day being the average response time, and that these characteristics do not vary between treatment arms. Thus, if assignment events (messages) are sent uniformly over the course of the campaign conversion events will be spread out exponentially over time after that.
The calculation of p-values and confidence values in some examples can be accomplished in four operations. Firstly, storage is allocated for computations with respect to a baseline treatment. Thus, if a different unique test message is sent to four different groups of recipients for four messaging treatments, there will be three p-values at any given time since these are calculated with respect to a treatment that is a baseline treatment. For example, the baseline treatment may be selected as the treatment already in use. Secondly, a 95% confidence sequence is computed for each treatment effect. An inverse propensity weighted estimator can be used for this computation. Thirdly, confidence bounds are used to estimate a sampling distribution for each treatment effect to restrict and normalize the variance and p-values. Finally, each p-value is computed based on a probability an observation being more extreme than an actual observed difference in means.
At block 802 of process 800, messaging treatments are generated and stored. These messaging treatments may be stored in the computing device running the analytics application or in the communication server. They may be generated based on input received from input device 140. At block 804, assuming these messaging treatments include messages to be sent as email, push messages, SMS, or similar techniques, processing proceeds to block 806 where the computing device transmits test messages to groups of recipients in a manner like that described with respect to block 202 of
Staying with
At block 818 of
At block 820 in
To calculate p-values, and in turn, confidence values, a confidence sequence for the difference in the metric values for the baseline messaging treatment and each individual treatment can be determined. The confidence sequence can then be “inverted” to find a p-value, using a normal distribution as the sampling distribution of the mean difference. Consider two treatments with treatment IDs 0 and 1, and N=N_0+N_1 total visitors across the two treatments. In terms of the sample means {circumflex over (μ)}0 and {circumflex over (μ)}1 and standard deviations {circumflex over (σ)}0 and {circumflex over (σ)}1, the confidence sequence for the difference is given by:
To find a p-value, the confidence bounds given by a confidence sequence can be interpreted as the normalizing factor for the test statistic of interest. For a regular hypothesis test for the difference in means, the test statistic is defined as:
where {circumflex over (σ)}p is the pooled sample standard deviation. For large enough sample sizes, the t-test and the z-test are equivalent, and the p-value can be defined in terms of the cumulative distribution of the normal:
Then, a 1-α confidence interval for the difference in means can be given by:
where
is a value for the standard normal. As an example, for α=0.05, z≈1.96. To derive an equivalent “always valid” p-value, the (1-α) confidence sequence is used as:
Where I′N is the long expression for CS1-a just given. An analogous relationship can be created between the confidence bounds and the test statistic as shown below. The denominator is the anytime-valid variance:
The p-value is then given by:
Continuing with
In this example, at least the current confidence value, and the current difference value are displayed, and values are updated so that confidence values are sequentially displayed until the end of the experiment. In some examples, a display window or GUI can be scrolled to view messaging treatments and the values can continue to be updated. In some examples, the analytics application can be configured to display valid current confidence values and difference values along with test message identifiers in visual alignment on a display device such as display device such as presentation device 108. The display device can be configured by the analytics application to be scrollable while being iteratively updated over time with the sequence of values.
In some embodiments, at any time, up to and including, the conclusion of the test, the best test message can be selected and deployed as appropriate for example through input to computing device 101 through input device 140. In other embodiments, this selection and deployment can take place programmatically. For example, at block 828 of process 800 in
Staying with
Still referring to
The system 1000 of
Staying with
Numerous specific details are set forth herein to provide a thorough understanding of the claimed subject matter. However, those skilled in the art will understand that the claimed subject matter may be practiced without these specific details. In other instances, methods, apparatuses, or systems that would be known by one of ordinary skill have not been described in detail so as not to obscure claimed subject matter.
Unless specifically stated otherwise, it is appreciated that throughout this specification discussions utilizing terms such as “generating,” “assaying,” “processing,” “computing,” “determining,” and “identifying” or the like refer to actions or processes of a computing device, such as one or more computers or a similar electronic computing device or devices that manipulate or transform data represented as physical electronic or magnetic quantities within memories, registers, or other information storage devices, transmission devices, or display devices of the computing platform.
The system or systems discussed herein are not limited to any particular hardware architecture or configuration. A computing device can include any suitable arrangement of components that provide a result conditioned on one or more inputs. Suitable computing devices include multi-purpose microprocessor-based computer systems accessing stored software that programs or configures the computing system from a general-purpose computing apparatus to a specialized computing apparatus implementing one or more implementations of the present subject matter. Any suitable programming, scripting, or other type of language or combinations of languages may be used to implement the teachings contained herein in software to be used in programming or configuring a computing device.
Embodiments of the methods disclosed herein may be performed in the operation of such computing devices. The order of the blocks presented in the examples above can be varied—for example, blocks can be re-ordered, combined, and/or broken into sub-blocks. Certain blocks or processes can be performed in parallel.
The use of “configured to” or “based on” herein is meant as open and inclusive language that does not foreclose devices adapted to or configured to perform additional tasks or steps. The endpoints of comparative limits are intended to encompass the notion of quality. Thus, expressions such as “more than” should be interpreted to mean “more than or equal to.”
Where devices, systems, components or modules are described as being configured to perform certain operations or functions, such configuration can be accomplished, for example, by designing electronic circuits to perform the operation, by programming programmable electronic circuits (such as microprocessors) to perform the operation such as by executing computer instructions or code, or processors or cores programmed to execute code or instructions stored on a non-transitory memory medium, or any combination thereof. Processes can communicate using a variety of techniques including but not limited to conventional techniques for inter-process communications, and different pairs of processes may use different techniques, or the same pair of processes may use different techniques at different times. Headings, lists, and numbering included herein are for ease of explanation only and are not meant to be limiting.
While the present subject matter has been described in detail with respect to specific embodiments thereof, it will be appreciated that those skilled in the art, upon attaining an understanding of the foregoing, may readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, it should be understood that the present disclosure has been presented for purposes of example rather than limitation and does not preclude inclusion of such modifications, variations, and/or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art.