There is a growing wave towards personalized decision-making, where the aim is to select the optimal intervention for each unit from a collection of interventions. In policy evaluation, for instance, one may want to design a governmental policy (intervention) that is particularly suited to the socio-economic realities of a geographic location (unit). The key challenge in doing so—and indeed the fundamental problem of causal inference—is that one often only get to observe a unit undergo a single intervention or stay under control (i.e., no intervention). This is true not only in observational studies such as policy evaluation, but also in experimental settings such as clinical trials or A/B testing in e-commerce.
Ideally, one would like to infer the counterfactual outcome of each unit under any intervention. The sub-problem of estimating what would have happened to a “treated” unit (i.e., one that undergoes an intervention) under control has a rich literature within econometrics and beyond. A prominent framework to do so within the panel data setting, where one gets repeated measurements of a unit across time, is synthetic controls (SC). At its core, SC builds a synthetic model of a treated unit as a weighted combination of control units to estimate this counterfactual outcome of interest. As an example, consider a canonical case study within the SC literature that evaluates the impact of tobacco legislations (interventions) on tobacco consumption within states in the United States (units). In particular, to assess the effect of California's Proposition 99, a large-scale tobacco control program that included raising taxes on cigarettes by 25 cents and other non-fiscal measures, the authors ask the question: “what would have happened to California in the absence of any tobacco control legislation (control)?” By expressing California as a weighted combination of control states (e.g., Colorado, Nevada, etc.), the authors find that there was a marked fall in tobacco consumption in California following Proposition 99, relative to the constructed “synthetic California”. Given its broad applicability, the SC framework has been methodologically extended, analyzed, and applied to answer similar questions in numerous, diverse settings. Notably, SC has even been regarded as “one of the most important innovations in the policy evaluation literature in the last 15 years.”
However, towards the broader goal of personalized decision-making, one needs to answer counterfactual questions beyond what would have happened under control. Continuing with the example above, these may include “what would have happened to Colorado (a control state) had it implemented a program similar to Proposition 99?” or “what would have happened to California had it instead raised cigarette taxes by 50 cents or more as in New York?”. In essence, this boils down to answering what would have happened to a unit under any intervention, rather than just under control. Indeed, extending SC to overcome this challenge has been posed as an open question in. Embodiments of the present disclosure provide a meaningful answer to this question.
Embodiments of the present disclosure extend the SC framework to the multiple intervention setting, where unit-specific potential outcomes are estimated under any intervention averaged over the post-intervention period. In doing so, they can provide novel identification and inference results for SC. SI also presents a new perspective on learning with distribution shifts, an active area of research in econometrics and machine learning.
The structures and techniques disclosed herein find application in many different technical fields. For example, one application is in the context of healthcare and the life sciences, where it is feasible to perform multiple experiments on the same unit (e.g., cell type or patient subgroup), but the number of experiments is severely constrained by financial and ethical considerations. Thus, embodiment of the present disclosure can be used to generate synthetic interventions data for use in, for example, making treatment decisions. Another example includes “physical” experiments such as what works better for different stores within a chain of retail stores. Another example includes testing different user features in online computer systems such as E-commerce systems. As such, disclosed embodiments can find practical application in the improvement of computer systems by facilitating, for example, A/B testing of such systems to identify not only an optimal set of user features to provide within the system, but also identify a set of features that can improve the performance and efficiency of such systems (e.g., by identifying those features which reduce user interactions and thereby reduce computer network and processor usage).
According to one aspect of the disclosure, a method implemented on a computing device for generating synthetic data for a target unit had the target unit undergone a subject intervention can include: identifying, from first and second data, interventions common to the target unit and one or more of a plurality of donor units as filtered donor units, the first data corresponding to the target unit under one or more interventions, the second data corresponding to the plurality of donor units each under one or more interventions; identifying, from the first data, third data corresponding to the target unit under the common interventions; identifying, from the second data, fourth data corresponding to the filtered donor units under the common interventions; identifying, from the second data, fifth data corresponding to the filtered donor units under the subject intervention; generating, from the third and fourth data, a learned model representing to a relationship between the target unit and the filtered donor units; applying the learned model to the fifth data to generate the synthetic data; and outputting, by the computing device, the synthetic data.
In some embodiments, the method can further include retrieving the first and second data from a database. In some embodiments, identifying the interventions common to the target unit and one or more of a plurality of donor units can include filtering the first and second data to identify observations associated with common interventions between the target unit and the plurality of donor units that maximizes the minimum between a number of satisfactory observations and a number of the filtered donor units. In some embodiments, generating the learned model can include performing principal component regression (PCR) between the third and fourth data to generate a learned model that defines the unique minimum-norm linear relationship between the target unit and the filtered donor units.
In some embodiments, outputting the synthetic data can include storing the synthetic data in a database. In some embodiments, outputting the synthetic data can include transmitting the synthetic data to another computing device. In some embodiments, the another computing device may be part of a system configured to automate one or more decision-making processes using the synthetic data.
According to another aspect of the disclosure, a system for generating synthetic data for a target unit had the target unit undergone a subject intervention can include: a database configured to store first data corresponding to the target unit under one or more interventions and second data corresponding to a plurality of donor units each under one or more interventions; and a computing device comprising instructions. When executed, the instructions can cause the computing device to execute a process that corresponds to any embodiments of the aforementioned method.
The manner of making and using the disclosed subject matter may be appreciated by reference to the detailed description in connection with the drawings, in which like reference numerals identify like elements.
1)
The drawings are not necessarily to scale, or inclusive of all elements of a system, emphasis instead generally being placed upon illustrating the concepts, structures, and techniques sought to be protected herein.
Consider a panel data setting with observations of N units over T time periods. Each unit undergoes one of D interventions at time period T0, with 1≤T0<T, prior to which all units are under control. Presented herein is synthetic interventions (SI), a framework to estimate counterfactual outcomes of each unit under each of the D interventions, averaged over the post-intervention time periods. The present disclosure proves identification of this causal parameter under a latent factor model across time, units, and interventions. The present disclosure furnishes an estimator for this causal parameter and establish its consistency and asymptotic normality. In doing so, novel identification and inference results for the synthetic controls (SC) literature is provided. The present disclosure introduces a hypothesis test to validate when to use SI (and thereby SC). Through simulations and an empirical case-study, the efficacy of the SI framework is demonstrated.
Turning to
First database (or “observations database”) 104 can store observation data associated with for one or more units (e.g., geographic areas) having undergone one or more interventions/treatments. That is, observations database 104 can be configured to store, for one or more units, data associated with all interventions that unit went through. The observation data may be organized such that database 104 can be queried for (a) all observations associated with a particular unit (e.g., a so-called “target unit”); and (b) all observations associated with any units (e.g., so-called “donor units”) that underwent a particular intervention. Second database (or “SI database”) 106 can store SI data generated by the SI module 102. Third database (or “intermediate database”) 116 can store intermediate or temporary data prepared by SI module 102 and used in the process of generating SI data. In the example shown, intermediate database 116 includes a first intermediate table 116a, a second intermediate table 116b, and a third intermediate table 116b.
SI module 102 and/or a given submodule 108, 110, 112, 114 can include electronic circuitry that performs the functions and operations described herein in conjunction therewith. The functions and operations can be hard coded into the electronic circuit or soft coded by way of instructions held in a memory device. The functions and operations can be performed using digital values or using analog signals. In some embodiments, a SI module 102 and/or a submodule 108, 110, 112, 114 can be embodied in an application specific integrated circuit (ASIC), which can be an analog ASIC or a digital ASIC, in a microprocessor with associated program memory and/or in a discrete electronic circuit, which can be analog or digital. In some embodiments, SI module 102 may be implemented as computer program instructions to perform functions and operations described herein. System 100 can include a memory (not shown) to store said computer program instructions and a computer processor (also not shown) to execute said computer program instructions.
A brief overview of the operation of system 100 is provided next. A more complete and detailed description of structures and techniques that can be implemented within system 100 for generating SI data is provided further below.
System 100 can generate synthetic data for a particular unit (“target unit”) had that unit undergone a particular intervention (“subject intervention”) using one or more of the following steps. Observations database 104 can be prepared to store, for each of one or more units, data associated with one or more interventions that unit went through. These one or more units can include, for example, the target unit and various other units.
Data preparation submodule 108 can query observations database 104 for observations associated with the target unit and also for observations associated with other units that underwent the subject intervention, referred to herein as “donor units.” Data preparation submodule 108 can filter the data so as to identify the collection of observations associated with common interventions between target unit and the relevant donor units. For example, data preparation submodule 108 can filter the data so as to identify the collection of observations associated with common interventions between target unit and the relevant donor units that maximizes the minimum between (i) the number of satisfactory observations and (ii) the number of filtered donor units. As used herein, the term “satisfactory observation” refers to an observation of a donor unit at a point along a given dimension where a measurement/observation also exists for the target unit at the same point along the same dimension. As one example, assuming observations are taken over the dimension of time (i.e., measured over time), if a donor unit has observations taken at times t0 to t100 and the target unit has observations taken at times t0 to t99, then only the donor unit's observations between times to t0 t99 are said to be satisfactory for the purpose of identifying observations associated with common interventions between target unit and the relevant donor units. Next, data preparation submodule 108 can populate first intermediate table 116a to include data pertaining to the common interventions associated with the target unit and filtered donor units from the previous step. Data preparation submodule 108 can also populate second intermediate table 116b to include data pertaining to the interventions associated with the filtered donor units under the subject intervention. In some embodiments, data preparation submodule 108 can create intermediate tables 116a, 116b prior to populating them. In other embodiments, intermediate tables 116a, 116b may be preexisting.
Data validation submodule 110 can validate the data prepared by data preparation submodule 108 using one or more diagnostic tests. For example, data validation submodule 110 may create/populate a third intermediate table 116c that concatenates data prepared by data preparation submodule 108 (e.g., data within the first and second intermediate tables 116a, 116b) in a column-wise manner and then perform a singular value decomposition of the concatenated data and inspect its spectral profile. If the data in the third intermediate table 116c does not exhibit low-dimensional structure, then it may need to be pre-processed, e.g., by applying an autoencoder identify a new low-dimensional representation of the data in the third intermediate table 116c. In some embodiments, data validation submodule 110 may perform a subspace inclusion hypothesis test on data prepared by data preparation submodule 108. If the hypothesis test passes, then data validation submodule 110 may determine whether or not accurate synthetic data can be generated for the target unit under the subject intervention. If it is determined that accurate synthetic data can be generated, SI module 102 may proceed to generate and output synthetic data as described next. Otherwise, SI module 102 may output an indication (e.g., an error signal or message) that accurate synthetic data cannot be generated based on the data available in observations database 104.
Synthetic data generation submodule 112 may generate SI data by choosing a pre-defined training error tolerance ε and then following the synthetic data generation procedure described below Sections 3 and 9. Briefly, synthetic data generation submodule 112 can generate a learned model using data in the first intermediate table 116a. The learned model may represent to a relationship (e.g., a unique minimum-norm linear relationship) between the target unit and the filtered donor units. Various types of machine learning (ML) techniques can be used to generate the learned model as described below. Next, data generation submodule 112 can apply the learned model to the data in the second intermediate table 116b to generate SI data corresponding to an outcome of the target unit under the subject intervention. The generated SI data can be stored in SI database 106. In some embodiments, generated SI data may be transmitted, exported, or otherwise provided to another system (e.g., a remote system) where it can be used to assist with, or automate, various decision-making processes. For example, system 100 can transmit generated SI data to one or more online commerce platforms where the data can be used to decide which types or promotions should be targeted to which types of customers. As another example, system 100 can transmit generated SI data to personalized medicine system/platform where the data can be used to decide which among different drug therapies is better for a given patient or group of patients. In some embodiments, observations database 104 may be populated from the same other system(s) such that both the input and output of system 100 are connected to the other system(s). In some embodiments, system 100 may include an application programming interface (API) via which the other system(s) can send observations data to system 100 and/or retrieve generated SI data from system 100.
Accuracy diagnostics submodule 114, which may be omitted in some embodiments, can determine, generally, the accuracy steps performed by submodules 108-112 and/or, in particular, the accuracy of the SI data generated by submodule 112 in conjunction with submodules 108, 110. In some embodiments, accuracy diagnostics submodule 114 can perform “cross-validation” to investigate whether the steps described above are successful in recreating one or more observed datasets from observations database 104. More formally, each donor unit may be iteratively assigned to be the target unit, and the remaining donor units may then be used to form the donor group for that particular iteration. That is, in this case, the temporary target unit's observations can be observed under the subject intervention (i.e., have access to the “synthetic” data one is trying to reproduce). The same procedure described above can then be carried out with the extra validation of measuring the prediction error between the generated SI data (i.e., the data generated by applying the learned model to the second intermediate table 116b populated using the temporary donor group, and the observations associated with the temporary target unit under the subject intervention.
A formal description of SI structures and techniques is now provided. Such structures and techniques can be implemented, for example, within the system 100 of
Some standard notation used herein is now discussed. For a matrix A∈a×b, denote its transpose as A′∈b×a. Denote the operator (spectral) and Frobenius norms of A as ∥A∥op and ∥A∥F, respectively. The columnspace (or range) of A is the span of its columns, which is denoted as (A)={v∈a:v=Ax, x∈b}. The rowspace of A, given by (A′), is the span of its rows. Recall that the nullspace of A is the set of vectors that are mapped to zero under A. For any vector v∈a, let∥v∥p denote its p-norm. Further, define the inner product between vectors v, x∈a as v, x==1 If v is a random variable, define its sub-Gaussian (Orlicz) norm as ∥v∥ψ2. Let [a]={1, . . . , a} for any integer a.
Let f and g be two functions defined on the same space. In this disclosure, it is said that f(n)=O(g(n)) if and only if there exists a positive real number M and a real number n0 such that for all n≥n0, |f(n)|≤M|g(n)|. Analogously one can say: f(n)=Θ(g(n)) if and only if there exists positive real numbers m, M such that for all n≥n0, m|g(n)|≤|f(n)|≤M|g(n)|; f(n)=o(g(n)) if for any m>0, there exists n0 such that for all n≥n0, |f(n)|≤m|g(n)|.
The following formal description adopts the standard notations and definitions for stochastic convergences. As such, denote
as convergences in distribution and probability, respectively. This disclosure also makes use of Op and op, which are probabilistic versions of the commonly used deterministic O and o notations. More formally, for any sequence of random vectors Xn, here it is said Xn=Op(an) if for every ε>0, there exists constants and such that (∥Xn∥2>an)<ε for every n≥; equivalently, here it is said that (1/an)Xn is “uniformly tight” or “bounded in probability”. Similarly, Xn=op (an) if for all ε, ε′>0, there exists such that (∥Xn∥2>ε′an)<ε for every n≥. Therefore,
Additionally, this disclosure uses the “plim” probability limit operator: plim
Here it is said that a sequence of events εn, indexed by n, holds “with high probability” (w.h.p.) if (εn)→1 as n→∞, i.e., for any ε>0, there exists a nε such that for all n>nε, (εp)>1−ε. More generally, a multi-indexed sequence of events εn
Consider a panel data setting with N≥1 units, T>1 time periods, and D≥1 interventions (or treatments). Throughout the following discussion, units are indexed with n∈E [N]{1, . . . , N}, time is indexed with t∈[T], and interventions are indexed with d∈{0, . . . , D−1}. The causal framework of Neyman, J. (1923), “Sur les applications de la theorie des probabilites aux experiences agricoles: Essai des principes,” and Rubin, D. B. (1974), “Estimating causal effects of treatments in randomized and nonrandomized studies,” Journal of Educational Psychology, 66:688-701 can be followed where the random variable (r.v.)Ytn(d)∈ is denoted as the potential outcome of unit n at time t under intervention d. Denote d=0 as control, i.e., Ytn(0) is the potential outcome for unit n at time t if no intervention occurs.
Regarding pre- and post-intervention observations, consider a data setup similar to that considered in the SC literature. In particular, consider T0 with 1≤T0<T as the intervention point, i.e., prior to T0, all N units are under control, and after T0, each unit receives exactly one of the D interventions (including control); define T1=T−T0. This partitions the time horizon into a pre- and post-intervention period.
Let the r.v. D(n)∈{0, . . . , D−1} denote the intervention assignment of unit n, and let the r.v. ={D(1), . . . , D(N)} denote the collection of intervention assignments across units. Group the units by the intervention they receive during the post-intervention period, i.e., the r.v. (d)={n:D(n)=dfort>T0} denotes the subgroup of units that receive intervention d, and Nd=|(d)| denotes its size. For all d, observe that both (d) and Nd are deterministic conditioned on . The observation of unit n at time t, denoted as Ytn, obeys the following distributional assumption.
Assumption 1 (SUTVA): For each unit n, define Ytn=Ytn(0) for t≤T0, and Ytn=Ytn(d) for t>T0 if D(n)=d.
A goal of the present disclosure is to estimate the causal parameter θn(d), which represents unit n's potential outcomes under intervention d averaged over the post-intervention period. θn(d) is formally defined in (2), given below.
A challenge in estimating θn(d) vs. θn(0) is the difference in data availability. Identification of θn(d) across all (n,d) complements the existing SC literature, which focuses on estimating {θn(0):n∉(0)}, i.e., the counterfactual average potential outcome under control for a treated unit, otherwise referred to as the treatment effect on the treated. In contrast, within the SI framework, even if unit n remains under control for t>T0, it is of interest to produce its counterfactuals under any intervention d≠0. A question that arises is whether θn(d) across (n,d) can be estimated by simply fitting N×D separate SC estimators, one for each pair. This is not possible since pre-intervention outcomes for each unit are only observed under control—refer to the data setup summarized in Assumption 1 and
Next, the SI causal framework is presented, the causal parameter of interest is formally introduced, and an identification result for this parameter is established.
Certain causal assumptions made herein are stated below. One structural assumption is that the potential outcomes follow a latent factor model.
Assumption 2 (Latent factor model): For all (t, n, d),
Y
tn
(d)
=
u
t
(d)
,v
n
+εtn. (1)
Here, the r.v. ut(d)∈r is a latent factor specific to a particular time t and intervention d; the r.v. vn∈r is a latent factor specific to unit n (i.e., factor loading); r≥1 is the dimension of the latent space; and εtn∈ is the idiosyncratic shock for unit n at time t.
Of note, latent factors {ut(d),vn} are unobserved. Further, vn is specific to unit n, i.e., each unit can have a different latent factor. An analogous statement holds for ut(d) with respect to time t and intervention d.
Assumption 3 (Linear span inclusion): Given unit-intervention pair (n,d) and conditioned on , vn∈span({vj:j∈(d)}), i.e., there exists w(n,d)∈N
It can be seen that Assumption 2 “implies” Assumption 3. Given Assumption 2, Assumption 3 may be seen as rather weak. Consider the matrix of unit latent factors [vi:i∈ (d)∪{n}]∈(N
Assumption 4 (Conditional mean independence): For all (t,n), conditioned on a particular sampling of the latent time-intervention and unit factors, let εtn be mean independent of . Equivalently, [εtn|]=[εtn|ε]=0, where
:={ut(d),vn:t∈[T],n∈[N],d∈{0, . . . ,D−1}} and ε:={LF,D}. (2)
As an example, consider the following conditions: (i)[εtn|LF]=0; (ii) εtn D|. Together, they imply Assumption 4, i.e., Assumption 4 is weaker. These two conditions are herein discussed on εtn as they may be more interpretable than Assumption 4. Since (i) is self-explanatory, the focus here is on (ii) and compare it with a standard analogous assumption in the literature known as “selection on observables”. Conditioned on observable covariates corresponding to time, units, and/or interventions, “selection on observables” assumes that the potential outcomes are independent of the treatment assignments. Many identification arguments crucially rely on this conditional independence. Here, given Assumption 2 and conditioned on the latent factors , the only remaining randomness in the potential outcome Ytn(d) is due to εtn. Hence, (ii) implies that conditioned on the latent factors, the potential outcomes are independent of D. As such, these latent factors can be thought of as “unobserved covariates” and (ii) can be interpreted as “selection on latent factors”.
In summary, assumptions 1 to 4 suggest that the observations are generated as follows:
Below, the target causal parameter θn(d) is formally defined, and the key identification result in Theorem 2.1 is stated.
For each (n,d), one may be interested in identifying and estimating
unit n's expected potential outcomes under intervention d averaged over the post-intervention period. The expectation in (3) is taken over εtn for t>T0, and is conditioned on unit n's specific latent factor vn and the time-intervention latent factors {ut(d):t>T0}.
Regarding identification, it can be established that for each (n,d) pair, given knowledge of w(n,d)), it is feasible to identify the causal parameter θn(d) under Assumptions 1 to 4. Moreover, a stronger result can be proved—the identification of [Ytn(d)|ut(d), vn] for any t.
Theorem 2.1: For a given unit-intervention pair (n,d), let Assumptions 1 to 4 hold. Then, given knowledge of w(n,d), one has
As a consequence of Theorem 2.1, one has [Ytn(d)]=[ wj(n,d)Ytj], where the expectation is now taken over εtn and ε, i.e., no longer conditioned on the latent factors and the treatment assignment. This relation follows from (3) and the tower law for expectations, i.e., for any two r.v.'s A and B, one has [A]=[[A|B]].
Theorem 2.1 allows one to express θn(d) in terms of a linear combination of observed quantities {Ytj:t>T0,j∈(d)} in expectation. The coefficients associated with this linear combination are denoted by the vector w(n,d)). This is the key quantity that needs to be estimated, which in turn, allows θn(d) to be estimated. One such estimator is provided below; without loss of generality, a particular (n,d) pair is considered.
More notation is now discussed. Throughout this disclosure, let Ypre,n=[Ytn:t≤T0]∈T
==1 (6)
where M=min{T0, Nd}, ∈ are the singular values (arranged in decreasing order), and û∈ T
The SI estimator is a two-step procedure with only one hyper-parameter k∈[M] that quantifies the number of singular components of to retain.
ŵ
(n,d)=((1/)Ypre,n. (7)
Regarding choosing k, there exist a number of principled heuristics to select the hyper-parameter k, and a few are discussed here. One standard data-driven approach is simply to use cross-validation, where the pre-intervention data is the training set and the post-intervention data is the validation set. Another standard approach is to use a “universal” thresholding scheme that preserves the singular values above a precomputed threshold. Finally, a “human-in-the-loop” approach is to inspect the spectral characteristics of , and choose k to be the natural “elbow” point that partitions the singular values into those of large and small magnitudes.
To understand the third approach, recall that =[]+ where =[εtj:t≤T0,j∈(d)]. Under the factor model in Assumption 3, [] is low-rank. If the noise εtn is reasonably well-behaved (i.e., has sufficiently light tails), then random matrix theory informs us that the singular values corresponding to the noise matrix are much smaller in magnitude compared to those of the signal matrix []. Hence, it is likely that a “sharp” threshold or gap exists between the top singular values associated with [] and the remaining singular values induced by . For example, if the rows of are sub-Gaussian, then∥∥op=Op(√{square root over (T0)}+√{square root over (Nd)}. In comparison, if the entries of [] are Θ(1) and its nonzero singular values si are of the same magnitude, then si=Θ(√{square root over (T0Nd)}). For a more detailed exposition of assumptions on the spectra of [], refer to Assumption 7.
Regarding the interpretation of an SI Estimator, recall from Theorem 2.1 that a goal is to estimate w(n,d), for which []=[]w(n,d) holds. However, only the noisy instantiations Ypre,n and are accessible, where the noise is due to εtn. Nevertheless, it is appreciated herein that the approach given by (6), known as principal component regression (PCR), overcomes the challenge of estimating w(nd) under measurement error. (The problem of learning from noisy covariate observations is known as “error-in-variables” regression.)
PCR is a two-stage procedure that (i) first applies principal component analysis to extract the low-dimensional signal captured by the top k singular components of , and (ii) then performs linear regression in the subspace spanned by these k components. If k=([]) then the subspace spanned by the top k right singular vectors of will be “close” to the row space of [] via standard perturbation theory arguments. In particular, it is known that the distance between the two subspaces (distance between subspaces A and B is defined as ∥A−B∥op, where A is the projection matrix onto A) scales as Op (∥∥op/smin), where smin is the smallest nonzero singular value of []. In other words, PCR implicitly de-noises by exploiting the low-rank factor structure of [].
Further, even with [] were accessible, there remains an issue of identifiability of w(nd) in the underdetermined (high-dimensional) case where T0<Nd. It is known, however, that the projection of w(n,d) onto the rowspace of [], denoted as {tilde over (w)}(n,d), is not only unique, but is also the only relevant component for prediction. To see this, note that []{tilde over (w)}(n,d)=[]w(n,d). Further, it is noted that {tilde over (w)}(n,d) is the unique minimum -norm vector for which this equality holds. This further reinforces the motivation for PCR since (7) enforces ŵ(n,d) to lie within the subspace spanned by the top k right singular vectors of . Thus, ŵ(n,d) will be close to the unique {tilde over (w)}(n,d), provided k is aptly chosen. This argument is formalized in Lemma 4.3.
Below, the statistical accuracy (it is noted that all log factors within the results can be removed with careful analysis) of the estimate {circumflex over (θ)}n(d) is established. Section 4.1 lists additional assumptions needed to establish the results. In Section 4.2, the estimation error of the model parameter w(n,d)) is bounded. In Sections 4.3 and 4.4, consistency and asymptotic normality of estimate {circumflex over (θ)}n(d) are established, respectively. Notably, in the specialized case where d=0, the results described herein contribute new inference guarantees to the SC literature as well.
More notation is now described. For any vector v∈a, let C(∥v∥p)=max{1, ∥v∥p}, where ∥v∥p denotes its -norm. To simplify notation, dependencies on r and σ are henceforth absorbed into the constant within Op, which is defined above.
Additional assumptions required to establish guarantees for the estimation procedure can be stated. Strictly speaking, these assumptions are not needed to establish identification of θn(d) (Theorem 2.1), but are necessary for the estimator used in Section 3.1 to produce {circumflex over (θ)}n(d). These additional assumptions are in some sense “context-specific”, i.e., they depend on the assumptions made on εtn and the procedure chosen to estimate w(n,d). Since Assumptions 5 and 6 are standard and self-explanatory, focus is put on interpreting and justifying Assumptions 7 and 8 below. Towards this, recall the definition of ε, given in (2).
Assumption 5 (Sub-Gaussian shocks): Conditioned on ε, εtn are independent sub-Gaussian r.v.s with [εtn2]=σ2 and ∥εtn∥ψ2≤Cσ for some constant C>0.
Assumption 6 (Bounded support): Conditioned on ε, [Ytn(d) ]∈[−1,1]. (The precise bound [−1,1] is without loss of generality, i.e., it can be extended to [a,b] for a,b∈with a≤b.)
Assumption 7 (Well-balanced spectra): Conditioned on ε and given (d), the rpre nonzero singular values si of [] are well-balanced, i.e., si2=Θ(T0Nd/rpre).
Assumption 8 (Subspace inclusion): Conditioned on ε and given (d), the rowspace of [] lies within that of [], i.e., ([])⊆([]).
Note Θ(⋅) and (⋅) used in the Assumptions above were previously defined.
Assumption 7 requires that the nonzero singular values of the latent matrix, [], are well-balanced. A natural setting in which Assumption 7 holds is if its elements [Ytn(d)]=Θ(1) and nonzero singular values satisfy si2=Θ(ζ) for some ζ. Then, for some absolute constant C>0,
Θ(T0Nd)=∥[]∥F2=Σi=1r
In effect, Assumption 7, or analogous versions to it, are pervasive across many fields. Within the econometrics factor model analyses and matrix completion literature, Assumption 7 is analogous to incoherence-style conditions. Additionally, Assumption 7 has been shown to hold w.h.p. for the canonical probabilistic generating process used to analyze probabilistic principal component analysis; here, the observations are assumed to be a high-dimensional embedding of a low-rank matrix with independent sub-Gaussian entries. Below, Lemma 4.1 further establishes that under Assumption 5, the singular values of and [] must be close.
Lemma 4.1: Let Assumptions 1, 2, 4, 5 hold. Then conditioned on ε, for any t>0 and i≤min{T0,Nd}, si−ŝi≤Cσ(T√{square root over (T0)}+√{square root over (Nd)}+Nd+t) with probability at least 1−2exp(−t2), where C>0 is an absolute constant.
Hence, Assumption 7 can be empirically assessed by simply inspecting the spectra of ; refer to
Regarding Assumption 8, in the SI framework, potential outcomes from different interventions are likely to arise from different distributions. Under Assumption 2, this translates to the latent time-intervention factors ut(d
As will be seen, Assumption 8 is the key condition that allows ŵ(n,d), learnt during the pre-intervention (training) period, to “generalize” to the post-intervention (testing) period. In Section 7, simulations are performed to show that θn(d) is an accurate estimate of θn(d) even if the pre- and post-intervention outcomes come from different distributions, provided Assumption 8 holds; if it does not hold, then {circumflex over (θ)}n(d) is non-trivially biased. Given its importance, in Section 5, a data-driven hypothesis test is provided with provable guarantees to validate when Assumption 8 holds in practice. This hypothesis test can serve as an additional robustness check for the SC framework as well.
Assumption 8 enables the SI estimator to accurately learn the model parameter w(n,d). First, Lemma 4.2 establishes that θ(n,d) can be expressed via
where Vpre are the right singular vectors of [].
Lemma 4.2: For a given unit-intervention pair (n,d), let the setup of Theorem 2.1 and Assumption 8 hold. Then,
Next, it is established that ŵ(n,d), given by (6), is a consistent estimate of {tilde over (w)}(n,d).
Lemma 4.3: For a given unit-intervention pair (n,d), let Assumptions 1 to 8 hold. Further, suppose k=([]), where k is defined as in (6). Then, conditioned on ε,
The following finite-sample guarantee establishes that the estimator described in Section 3 yields a consistent estimate of the causal parameter for a given unit-intervention pair.
Theorem 4.1: For a given unit-intervention pair (n,d), let Assumptions 1 to 8 hold. Suppose k=([]), where k is defined as in (7). Then, conditioned on ε,
Below, it established that the estimate is asymptotically normal around the true causal parameter. Then it is showed how to use this result to construct confidence intervals.
Theorem 4.2: For a given unit-intervention pair (n,d), let the setup of Theorem 4.1 hold. Suppose (i) T0, T1, Nd→∞; (ii) (C(∥{tilde over (w)}(n,d)∥2))2log(T0,Nd)=o(min{T0,Nd}); (iii)
Then, conditioned on ε,
Ignoring dependencies on log factors and {tilde over (w)}(n,d), Theorem 4.2 establishes that if T1=o(min{√{square root over (T0)}, Nd}), then {circumflex over (θ)}n(d) is asymptotically normally distributed around θn(d). Effectively, this condition restricts the number of post-intervention (testing) measurements T1 from growing too quickly with respect to the number of pre-intervention (training) measurements T0 and the size of the donor subgroup Nd.
It is noted that the asymptotic variance scales with σ2 and ∥{tilde over (w)}(n,d)∥22. Recall that the former represents the variance of εtn across all (t,n), while the latter measures the size of the projection of w(n,d) onto the rowspace of []. Thus, the asymptotic variance scales with the underlying “model complexity” of the latent factors, given by ([]), which can be much smaller than the ambient dimension Nd.
To construct confidence intervals, an estimate for the asymptotic variance may be required. Consistent with standard practice, σ2 can be estimated using the pre-intervention error:
This estimator can be justified through the following lemma.
Lemma 4.4: Let Assumptions 1 to 7 hold. Suppose k=([]), where k is defined as in (6). Then conditioned on ε,
Recall from Lemma 4.3 that ∥{tilde over (w)}(n,d)∥2 can be estimated from ∥{tilde over (w)}(n,d)∥2. Hence, {circumflex over (σ)}∥ŵ(n,d)∥2 is an accurate estimate of σ∥{tilde over (w)}(n,d)∥2. Coupling this estimator with Theorem 4.2, confidence intervals for the causal parameter can be crated in a straightforward manner. For example, a 95% confidence interval is given by
The “tightness” of this confidence interval is empirically assessed in Section 6.2.
A key condition that enables the theoretical results in Section 4 is Assumption 8. Below, a data-driven hypothesis test to check when this condition holds is proven.
Additional notation is now discussed. Recall rpre=([]), and let rpost=([]). Recall Vpre∈N
Consider the following two hypotheses:
Define the test statistic as {circumflex over (t)}=∥(I−{circumflex over (V)}pre{circumflex over (V)}′pre){circumflex over (V)}post∥F2. This yields the following test: for any significance level α∈(0,1),
Here, τ(α) is the critical value, which is herein defined for some absolute constant C≥0:
where ϕpre(a)=√{square root over (T0)}+√{square root over (Nd)}+√{square root over (log(1/a))}; ϕpost(a)=√{square root over (T1)}+√{square root over (Nd)}+√{square root over (log(1/a))}; and , are the -th singular values of [] and [], respectively.
Given the choice of {circumflex over (τ)} and τ(α), both Type I and Type II errors of the test are controlled.
Theorem 5.1: Let Assumptions 1, 2, 4, 5 hold. Fix any α∈(0,1). Then conditioned on ε, the Type I error is bounded as ({circumflex over (τ)}>τ(α)H0)≤α. To bound the Type II error, suppose that the choice of C, given in (11), satisfies
which implies H1 must hold. Then, the Type II error is bounded as ({circumflex over (τ)}≤τ(α)|H1)≤α.
It is noted that the choice of C depends on the underlying distribution of εtn, and can be made explicit for certain classes of distributions. As an example, Corollary 5.1 specializes Theorem 5.1 to when εtn are normally distributed.
Corollary 5.1: Consider the setup of Theorem 5.1 with C=4. Let εtn be normally distributed. Then, ({circumflex over (τ)}>τ(α)|H0)≤α and ({circumflex over (τ)}≤τ(α)|H1)≤α.
Corollary 5.2 specializes Theorem 5.1 under Assumption 7.
Corollary 5.2: Let the setup of Theorem 5.1 hold. Suppose Assumption 7 holds. Further, suppose that conditioned on ε, the rpost nonzero singular values ζi of [] are well-balanced, i.e., ζi2=Θ(T1Nd/rpost). Then,
{circumflex over (τ)} can be interpreted as follows. Consider the noiseless case (i.e., εtn=0), where Vpre and Vpost are perfectly observed. Conditioned on H0, note that ∥(I−VpreV′pre)Vpost∥F=0, while conditioned on H1, ∥(I−VpreV′pre)Vpost∥>0. Hence, ∥(I−VpreV′pre)Vpost∥F serves as a natural test statistic. Since these quantities are not observed, one can use {circumflex over (V)}pre and {circumflex over (V)}post as proxies.
τ(α) can be interpreted as follows. Again, considering the noiseless case, it is noted that τ(α)=0. More generally, if the spectra of [] and [] are well-balanced, then Corollary 5.2 establishes that τ(α)=o(1), even in the presence of noise. Of note, Corollary 5.1 allows for exact constants in the definition of τ(α) under the Gaussian noise model.
Condition in (12) can be interpreted as follows. It is noted that (12) is not a restrictive condition. Conditioned on H1, observe that rpost>∥VpreV′preVpost∥F2 always holds. If Assumption 7 holds and the nonzero singular values of [] are well-balanced, then the latter two terms on the right-hand side of (12) decay to zero as T0, T1, Nd grow.
Computing τ(α) requires estimating (i) σ2; (ii) rpre, rpost; (iii) sr
A practical heuristic is now discussed. Here, a complementary approach to computing τ(α) as used in the art is provided. To build intuition, observe that {circumflex over (τ)} represents the remaining spectral energy of {circumflex over (V)}post not contained within span ({circumflex over (V)}pre). Further, note {circumflex over (τ)} is trivially bounded by rpost since the columns of {circumflex over (V)}post are orthonormal. Thus, one can fix some fraction ρ∈(0,1) and reject H0 if {circumflex over (τ)}≤ρ·rpost. In words, if more than p fraction of the spectral energy of {circumflex over (V)}post lies outside the span of {circumflex over (V)}pre, then the alternative test rejects H0.
In this section, illustrative simulations are presented to reinforce the theoretical results disclosed herein.
This section demonstrates that the disclosed estimator is consistent and the rate of convergence matches the theoretical results (Theorem 4.1). In the process, the importance of Assumption 8 is shown in enabling successful estimation of θ2(d) in the SI causal framework.
Without loss of generality, consider the binary D=2 setting, i.e., d∈{0,1}, where the aim is to estimate θn(1). Let N1=|(1)|=200 and r=15, where r is defined in Assumption 2. Define the latent unit factors associated with (1) as ∈N
Regarding pre-intervention data, choose T0=200 and rpre=10. Define the latent pre-intervention time factors under control (d=0) as Upre(0)∈T
Regarding post-intervention data, sample two sets of post-intervention time factors under d=1: one will obey Assumption 8, while the other will violate it. To begin, let T1=200. Sample the former set as follows: (i) let P∈T
A motivation for the construction of [], [], [], and [] is now presented. Recall that the SI framework allows potential outcomes from different interventions to be sampled from different distributions. As such, construct [] and [] such that they follow a different distribution to that of []. This allows for studying when the model learnt using pre-intervention data “generalizes” to a post-intervention regime generated from a different distribution.
To highlight the impact of Assumption 8, it can be noted that by construction, Assumption 8 holds w.h.p. between [(ρ)] and [] for every ρ. In contrast, by construction, [=r w.h.p. Notably, since rpre<r, Assumption 8 fails w.h.p. between [(ρ)] and [] for every ρ. All three conditions are empirically verified.
Generate Ypre,n and by adding independent noise entries from a normal distribution with mean zero and variance σ2=0.3 to [Ypre,n] and [], respectively. For every ρ, generate (ρ) and (ρ) by applying the same additive noise model to (ρ)] and [(ρ)], respectively.
It is noted that the described data generating process ensures that Assumptions 1, 4, 5 hold. In addition, Assumptions 6 and 7 are empirically verified. Further, for Assumption 2, it is noted that the pre- and post-intervention (expected) potential outcomes associated with (1) were all generated using ; thus, their variations only arise due to the sampling procedure for their respective latent time-intervention factors. Given that [Ypre,n], θn(1)(ρ), and (ρ) were all defined using w(n,1), Assumption 3 holds.
In the simulation, 100 iterations are performed for each ρ. The potential outcomes, [Ypre,n], [], [(ρ)], [(ρ)], are fixed, but the idiosyncratic shocks are re-sampled every iteration to yield new (random) outcomes. For each iteration, use (Ypre,n, ) to learn ŵ(n,1), as given by (7). Then, use (ρ) and ŵ(n,1) to yield θn(1)(ρ), as given by (8). Similarly, use (ρ) and ŵ(n,1) to yield (ρ). The mean absolute errors (MAEs), |{circumflex over (θ)}n(1) (ρ)−θn(1)(ρ)| and |(ρ)-(ρ)|, are plotted in
As
In this section, it is demonstrated that estimate {circumflex over (θ)}n(d) is well-approximated by a Gaussian or Normal distribution centered around θn(d), even if the pre- and post-intervention potential outcomes follow different distributions.
The binary D=2 intervention model is again considered. Of interest is estimating θn(d) for d∈{0,1}. The data generating process will be such that when d=0, the pre- and post-intervention potential outcomes will obey the same distribution, and when d=1, they will obey different distributions. Nevertheless, a single learned model can be used to “generalize” to both post-intervention settings. As such, a single donor subgroup is be considered, but they are allowed to “undergo” two different interventions during the post-intervention phase. That is, (0)=(1) with N0=N1=400. Choosing r=15, define V==∈N
Regarding pre-intervention data, choose T0=400, and define the latent pre-intervention time factors under control as Upre(0)∈T
Regarding post-intervention data, choose T1=20, and generate post-intervention time factors under d=0 and d=1. For d=0, define Upost(0)∈T
Regarding the interpretation of the data generating process, it is noted that [] and [] obey the same distribution to reflect that both sets of potential outcomes are associated with control. In contrast, [] and [] follow different distributions to reflect that the pre- and post-intervention potential outcomes are associated with different interventions; the former with d=0 and the latter with d=1. However, by construction, both [259] and [] satisfy Assumption 8 with respect to [] IE w.h.p., which is empirically verified. Further, note the mean and variance of [], [] are identical.
Regarding observations, generate Ypre,n and = by adding independent noise from a normal distribution with mean zero and variance σ2=0.5 to [Ypre,n] and []=[], respectively. Generate Y by applying the same additive noise model to [] for d∈{0,1}.
As before, Assumptions 1 to 7 hold by construction.
For each d∈{0,1}, perform 5000 iterations, where [Ypre,n], [], [], and [] are fixed throughout, but the idiosyncratic shocks are re-sampled to generate new (random) outcomes. Within each iteration, first use (Ypre,n, ) to fit ŵ(n,0), as in (7). Since the pre-intervention observations are identical across d, it is highlighted that ŵ(n,1)=ŵ(n,0). Next, use and ŵ(n,0) to yield {circumflex over (θ)}n(d) for each d∈{0,1}, as in (8). In other words, learn a single model ŵ(n,0) and apply it to two different post-intervention outcomes, and .
Histogram of estimates are displayed in
Next, it is shown {circumflex over (θ)}n(d) is non-trivially biased when Assumption 8 fails.
Continuing analysis of the binary D=2 intervention model, let N1=400, r=15, and generate ∈N
Regarding pre-intervention data, choose T0=400 and rpre=12. Construct [] using identically to that in Section 6.1.1, such that ([])=rpre w.h.p., which is empirically verified. As before, generate w(n,1)∈N
Regarding post-intervention data, choose T1=20, and define the post-intervention time factors under d=1 as Upost(1)∈T
Regarding observations, generate Ypre,n and by adding independent noise from a normal distribution with mean zero and variance σ2=0.5 to [Ypre,n] and [], respectively. Generate by applying the same noise model to [].
As before, Assumptions 1 to 7 hold by construction.
For this simulation, 5000 iterations are performed, where [Ypre,n], [], [] are fixed, but the shocks are re-sampled. In each iteration, use (Ypre,n, ) to fit ŵ(n,1), and then use and ŵ(n,1) to yield {circumflex over (θ)}n(1). The resulting histogram 706 is displayed in
Now considered is a case study exploring the effect of different discount strategies to increase user engagement in an A/B testing framework for a large e-commerce company, which can be anonymized due to privacy considerations. The results suggest the SI causal framework offers a useful perspective towards designing data-efficient A/B testing or randomized control trial (RCT). In particular, the SI framework allows one to selectively run a small number of experiments, yet estimate a personalized policy for each unit enrolled in the RCT.
For this case study, the company segmented its users into 25 groups (units), with approximately 10000 individual users per group, based on the historical time and money spent by a user on the platform. The aim of the company was to learn how different discount strategies (interventions) changed the engagement of each of the 25 user groups. The strategies were 10%, 30%, and 50% discounts over the regular subscription cost (control).
The A/B test was performed by randomly partitioning users in each of the 25 user groups into 4 subgroups; these subgroups corresponded to either one of the 3 discount strategies or a control group that received a 0% discount. User engagement in these 100 subgroups (25 user groups multiplied by 4 discount strategies) was measured daily over 8 days.
Of note, this web A/B testing case study is particularly suited to validate the SI framework as one can observe the engagement levels of each customer group under each of the three discounts and control, i.e., for each customer group, all four “counterfactual” trajectories are observed. As such, one has access to all potential outcomes, which can be encoded into a tensor Y=[Ytn(d)]∈8×25×4, where each slice Y(d)∈8×25 is a matrix of potential outcomes under discount d.
It can be seen that the operating assumptions in this work likely hold for this case study. First, Assumption 1 likely holds since the same discount d was applied to every subgroup of users within (d) and the discount one customer receives is unlikely to have an effect on another customer. Next, it is re-emphasized that the discounts were randomly assigned; hence, Assumption 4 holds (see the discussion in Section 2 for details). Moreover, the engagement levels were bounded, which implies Assumptions 5 and 6 hold.
To ensure Assumptions 2, 3, and 7 hold, the spectral profile of the tensor Y are studied. Specifically, one can inspect the spectra of the mode-1 and mode-2 unfoldings of Y, shown in
Regarding pre- and post-intervention periods, for each of the 25 user groups, denote the 8 day user engagement trajectories of the subgroups associated with control as the pre-intervention period. Correspondingly, for each of the 25 user groups, denote the 8 day user engagement trajectories associated with the 10%, 30%, and 50% discount coupons as the post-intervention period.
Regarding the choice of donor groups for each intervention, to simulate a data-efficient RCT, randomly partition the 25 user groups into three clusters, denoted as user groups 1-8, 9-16, and 17-25. For the 10% discount coupon strategy, choose user groups 1-8 as the donor pool, and user groups 9-25 as the collection of targets, i.e., observe the post-intervention data under said discount for groups 1-8, but hold out the corresponding data for groups 9-25. In other words, the SI estimator does not get to observe the trajectories of groups 9-25 under a 10% discount. Using the observed trajectories of user groups 1-8, separately apply the SI estimator to create unique synthetic user engagement trajectories for each of the 9-25 user groups under a 10% discount. Use the metric given by (13) below to compare the estimates against the actual trajectories for user groups 9-25 under a 10% discount. Analogously, one can utilize user groups 9-16 and 17-25 as the donor groups for the 30% and 50% discounts, respectively.
To quantify the accuracy of SI's counterfactual predictions, meaningful baseline is needed to compare against. To that end, one can use the following squared error metric for any (n,d) pair:
The numerator on the right-hand side of (13) represents the squared error associated with the SI estimate {circumflex over (θ)}n(d). The denominator, (1/Nd)Ytj, is referred to herein as the “RCT estimator”; this is the average outcome across all units within subgroup (d) over the post-intervention period. If the units in a RCT are indeed homogeneous (i.e., they react similarly to intervention d), then the RCT estimator will be a good predictor of θn(d). Therefore, SEn(d)>0 indicates the success of the SI estimator over the RCT baseline. Therefore, (13) can be interpreted as a modified R2 statistic with respect to a RCT baseline. In effect, the SEn(d) captures the gain by “personalizing” the prediction to the target unit using SI over the natural baseline of taking the average outcome over (d).
Under the setup above, apply SI to produce the synthetic “counterfactual” trajectories under the three discounts. Evaluate the accuracy under the 10% discount using only the estimated trajectories of user groups 9-25 (since user groups 1-8 are donors). Similarly, use the estimated trajectories of user groups 1-8 and 17-25 for the 30% discount, and user groups 1-16 for the 50% discount. To mitigate the effects of randomness, this procedure is repeated 100 times. Within each iteration, randomly sample a permutation of the 25 user groups to create different sets of donors (i.e., groups 1-8, 9-16, 17-25).
TABLE 1 shows the hypothesis test results for the three discount strategies, and the median SEn(d) across the 25 user groups averaged over all 100 iterations, denoted as SE(d). The hypothesis test passes for each discount at a significance level of α=0.05 in every iteration, which indicates Assumption 8 likely holds. Across the three discounts, SI achieves an average SE(d) of at least 0.98 and a standard deviation of at most 0.04. In words, the SI estimator far outperforms the RCT estimator. This indicates significant heterogeneity amongst the user groups in how they respond to discounts, and thus warrants the e-commerce company running separate A/B tests for each of the 25 groups.
In this A/B testing framework, the e-commerce company implemented a total of 100 distinct experiments—one experiment for each of the 25 user groups under each of the 4 interventions (0%, 10%, 30%, and 50% discounts). In contrast, the SI framework only required observations from 50 experiments to produce the estimated post-intervention trajectories. In particular, SI utilized two experiments for each of the 25 user groups: one in the pre-intervention period (under 0% discount) and one in the post-intervention period (under exactly one of the 10%, 30%, 50% discounts). See
More generally, with N units and D interventions, an RCT requires N×D experiments to estimate the optimal “personalized” intervention for every unit. Meanwhile, under Assumptions 1 to 8, SI requires 2N experiments as follows: in the first N experiments, all units are under the same intervention regime, say control (d=0); next, divide all N units into D partitions each of size N/D, and assign intervention d to units in the d-th partition, which leads to another N experiments. Crucially, the number of required experiments does not scale with D, which becomes significant as the number of interventions, i.e, level of personalization, grows. Also, if pre-intervention data is already being collected, which is common in many settings (e.g., this e-commerce case study), then SI only requires running N experiments. This can be significant when experimentation is costly (e.g., clinical trials).
In this section, classical potential outcomes frameworks are reinterpreted through the lens of tensors. Specifically, an order-3 tensor with axes that correspond to time, units, and interventions is considered. Each entry of this tensor is associated with the potential outcome for a specific time, unit, and intervention. Recall
Section 8.1 discusses the connection between these two fields in greater detail. More specifically, Section 8.1 points out how important concepts in causal inference have a related notion in tensor estimation. Section 8.2 shows how low-rank tensor factor models, prevalent in the TE literature, provide an elegant way to encode the structure between time, units, and interventions, while making minimal parametric assumptions. Low-rank tensor factor models may lead to the identification of new causal parameters and guide the design of novels algorithms to estimate said causal parameters. Section 8.3 poses what algorithmic advances are required in the TE literature to allow it to more directly connect with causal inference.
Different observational and experimental studies that are prevalent in causal inference can be equivalently posed as different sparsity patterns within a tensor. Continuing with the notation used in this disclosure, consider the setting with T measurements (which may refer to different metrics or time points), N units, and D interventions. A common thread of these studies is that each of the N units can only experience one, or a small subset, of the possible D interventions, e.g., if a unit is an individual, then it is only feasible to observe her under one intervention. This constraint naturally induces a block sparsity pattern, as exhibited
Now discussed is the relationship between causal inference with different target causal parameters and TE under different error metrics. The first step in causal inference is to define a target causal parameter, while the first step in tensor estimation is to define an error metric between the underlying and estimated tensors. Below are discuss a few important connections between these two concepts. To begin, consider as the causal parameter the average potential outcome under intervention d across all T measurements and N units (if there is a pre- and post-intervention period, then the target causal parameter is typically restricted to the T1 post-intervention measurements). Then, estimating this parameter can equivalently be posed as requiring a tensor estimation method with a Frobenius-norm error guarantee for the d-th slice of the potential outcomes tensor with dimension T×N, normalized by 1/√{square root over (TN)}. As such, a uniform bound for this causal parameter over all D interventions would in turn require a guarantee over the max (normalized) Frobenius-norm error for each of the D slices. Another causal parameter is unit n's potential outcome under intervention d averaged over all T measurements (recall, this is θn(d)). Analogously, this translates to the 2-norm error guarantee of the n-th column of the d-th tensor slice, normalized by 1/√{square root over (T)}. A uniform bound over all N units for the d-th intervention would then correspond to a2∞-norm error for the d-th tensor slice. As a final example, let the target causal parameter be the unit potential outcome under intervention d and measurement t. This would require a TE method with a max-norm (entry-wise) error of the d-th matrix slice. Similar as above, a uniform bound over all measurements, units, and interventions corresponds to a max-norm error over the entire tensor.
Regarding a tensor factor model, one can start by introducing a low-rank tensor factor model, which is natural generalization of the traditional factor model considered in the panel data literature. More formally, let Y=[Ytn(d)]∈T×N×D denote an order-3 tensor of potential outcomes. A tensor factor model over Y admits the following decomposition:
Y
tn
(d)=+εtn, (14)
where r is the canonical polyadic (CP) rank, ut∈r is a latent time (or more generally, measurement) factor, vn∈r is a latent unit factor, and λd∈r is a latent intervention factor. Note that if attention is restricted to a matrix (e.g., restricting the potential outcomes to a particular time, unit, or intervention slice), then the CP rank and standard matrix rank match. Importantly, the factorization in Assumption 2 is implied by the factorization assumed by a low-rank tensor as given in (14). In particular, Assumption 2 does not require the additional factorization of the (time, intervention) factor ut(d) as ut, λd, where ut is a time specific factor and λd is an intervention specific factor.
The implicit factorization of [Ytn(d)] given in (1) allows the SI causal framework to identify and estimate θn(d) for any d, i.e., beyond d=0. In particular, this factorization enables a model to be learned under control yet transferred to another intervention regime. An added benefit, as previously stated, is that this also precludes the need for covariate information with respect to time, units, or interventions. That is, directly learning on the observed outcomes (and appropriately de-noising) exploits this latent structure between the dimensions of the tensor to impute missing potential outcomes. In contrast, traditional methods that learn across interventions require access to meaningful covariate information.
Described next is a procedure for generating synthetic interventions data. Embodiments of the procedure can be implemented within the system 100 of
During a data preparation phase of the procedure, one or more of the following steps may be performed. First, prepare a database where for each unit, data associated with all interventions that unit goes through is collected or otherwise prepared. Next, query a database (e.g., observations database 104 of
During a data validation phase of the procedure, one or more of the following steps may be performed. First, create a new table that concatenates X1 and X2, i.e., X=[X1, X2] in a column-wise manner. Next, perform the singular value decomposition of X and inspect its spectral profile. Check to see if X exhibits low-dimensional structure (e.g., using the techniques described above in Section 3). If X does not exhibit low-dimensional structure, then optionally pre-process X e.g., apply an autoencoder to X to identify a new low-dimensional representation of X). Next, perform the subspace inclusion hypothesis test on X1, X2 detailed in Section 5, above. If the hypothesis test passes, then there may be sufficiently strong diagnostic evidence (as established in Section 4, above), that one can accurately create synthetic data for target unit i under subject intervention d. In some embodiments, one or more steps of the data validation phase may be omitted.
During a synthetic data generation phase of the procedure, one or more of the following steps may be performed. First, choose a pre-defined training error tolerance E. Next, follow the synthetic data generation procedure detailed in Section 3, above. One or more ML techniques can be used to generate a learned model representing to a relationship between the target unit and the filtered donor units. In some embodiments, perform principal component regression (PCR) between y1 and X1 to yield {circumflex over (β)}, which defines the unique minimum-norm linear relationship between the target unit i and filtered donor units. Note that Section 3, above, lists numerous principled methods to choose the sole hyper-parameter of PCR. More generally, one could use a different machine learning (ML) algorithm (parametric or nonparametric) to learn a relationship between y1 and X1 (e.g., using a neural network or a random forest). Denote the learned model between y1 and X1 as {circumflex over (f)}.
Next, intermediary synthetic data model validation can be performed. For example, compute the training error between the observations y1 and estimates X1{circumflex over (β)}; if the error is below ε, then proceed to the next step as the learnt model between the target unit and donor units is satisfactory, with respect to prediction, for the practitioner (this also demonstrates that the underlying data likely satisfies the desirable properties for this estimation procedure, as described above in Sections 3 and 4). In some embodiments, the intermediary synthetic data model validation may be omitted.
Next, synthetic data associated with target unit i under subject intervention d can be generated as X2{circumflex over (β)}; from here, the practitioner can apply further procedures on top of the estimates, X2{circumflex over (β)}(or more generally, {circumflex over (f)}(X2)) e.g., by computing the mean, which expresses the average counterfactual outcomes of target unit i under subject intervention d.
In some embodiments, the procedure can further include diagnosing the accuracy of the generated synthetic data. For example, one can perform “cross-validation” studies to investigate whether the steps listed above are successful in recreating the observed dataset. More formally, each donor unit is iteratively assigned to be the target unit, and the remaining donor units then form the donor group for that particular iteration. That is, in this case, one can observe the temporary target unit's observations under the subject intervention d (i.e., have access to the “synthetic” data one is trying to reproduce). The same procedure described above is then carried out with the extra validation of measuring the prediction error between X2{circumflex over (β)}(more generally, {circumflex over (f)}(X2)), where X2 is now defined over the temporary donor group, and the observations associated with the temporary target unit under subject intervention d.
At block 1102, interventions common to the target unit and one or more of a plurality of donor units can be identified from first and second data. The first data may correspond to the target unit under one or more interventions and the second data may correspond to the plurality of donor units each under one or more interventions. The first and second data can be retrieved, for example, from a database (e.g., observations database 104 of
At block 1104, third data corresponding to the target unit under the common interventions can be obtained or otherwise identified from the first data. At block 1106, fourth data corresponding to the filtered donor units under the common interventions can be obtained/identified from the second data. At block 1108, fifth data corresponding to the filtered donor units under the subject intervention can be obtained/identified from the second data. The data obtained/identified at blocks 1104, 1106, 1108 can be stored, for example, within a database (e.g., intermediate database 116 of
At block 1110, a learned model can be generated based on the third and fourth data (from blocks 1104, 1106). The learned model may represent a relationship between the target unit and the filtered donor units. In some embodiments, principal component regression (PCR) can be performed between the third and fourth data to generate a learned model that defines the unique minimum-norm linear relationship between the target unit and the filtered donor units. Other ML techniques can be used as previously discussed.
At block 1112, the learned model can be applied to the fifth data (from block 1108) to generate the synthetic data. Techniques for applying the learned model to generate synthetic data are described above in Section 9. The generated synthetic data can then be stored (e.g., within database 106 of
The subject matter described herein can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structural means disclosed in this specification and structural equivalents thereof, or in combinations of them. The subject matter described herein can be implemented as one or more computer program products, such as one or more computer programs tangibly embodied in an information carrier (e.g., in a machine-readable storage device), or embodied in a propagated signal, for execution by, or to control the operation of, data processing apparatus (e.g., a programmable processor, a computer, or multiple computers). A computer program (also known as a program, software, software application, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or another unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file. A program can be stored in a portion of a file that holds other programs or data, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.
The processes and logic flows described in this specification, including the method steps of the subject matter described herein, can be performed by one or more programmable processors executing one or more computer programs to perform functions of the subject matter described herein by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus of the subject matter described herein can be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processor of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random-access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. Information carriers suitable for embodying computer program instructions and data include all forms of nonvolatile memory, including by ways of example semiconductor memory devices, such as EPROM, EEPROM, flash memory device, or magnetic disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
In the foregoing detailed description, various features are grouped together in one or more individual embodiments for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that each claim requires more features than are expressly recited therein. Rather, inventive aspects may lie in less than all features of each disclosed embodiment.
References in the specification to “one embodiment,” “an embodiment,” “some embodiments,” or variants of such phrases indicate that the embodiment(s) described can include a particular feature, structure, or characteristic, but every embodiment can include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment(s). Further, when a particular feature, structure, or characteristic is described in connection knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
The disclosed subject matter is not limited in its application to the details of construction and to the arrangements of the components set forth in the following description or illustrated in the drawings. The disclosed subject matter is capable of other embodiments and of being practiced and carried out in various ways. As such, those skilled in the art will appreciate that the conception, upon which this disclosure is based, may readily be utilized as a basis for the designing of other structures, methods, and systems for carrying out the several purposes of the disclosed subject matter. Therefore, the claims should be regarded as including such equivalent constructions insofar as they do not depart from the spirit and scope of the disclosed subject matter.
Although the disclosed subject matter has been described and illustrated in the foregoing exemplary embodiments, it is understood that the present disclosure has been made only by way of example, and that numerous changes in the details of implementation of the disclosed subject matter may be made without departing from the spirit and scope of the disclosed subject matter.
All publications and references cited herein are expressly incorporated herein by reference in their entirety.
This application claims the benefit under 35 U.S.C. § 119 of U.S. Provisional Patent Application No. 63/209,567 filed on Jun. 11, 2021, which is hereby incorporated by reference herein in its entirety.
This invention was made with Government support under Grant No. CMMI1462158 and CNS1523546 awarded by the National Science Foundation. The Government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63209567 | Jun 2021 | US |