**Summary:** In a block-randomized controlled trial, the individuals in a study are divided based on important characteristics (for example age group, sex, or smoking status). By stratifying the data in this way prior to randomization, the researchers can reduce the heterogeneity of their treatment and control groups, thereby reducing the variance of their estimate of the effect of their treatment. The `stratamatch`

package is designed to extend this approach to the observational setting by providing functions to allow users to easily divide an observational data set into strata of observations with similar baseline characteristics. The treated and control individuals within each stratum of the data can then by matched (for example, based on propensity score), in order to recapitulate the block-randomized trial from observational data. In order to select an effective stratification scheme for the data, `stratamatch`

applies a *pilot design* approach to estimate a quantity called the prognostic score, and divides the data into blocks of individuals with similar scores. The potential benefits of such an approach are twofold. First, stratifying the data effectively allows for more computationally efficient matching of large data sets. Second, early studies of the prognostic score show that using the prognostic score to inform the matching process may help reduce the variance in the estimated treatment effect and increase the robustness of an observational study to bias from unmeasured confounding factors.

This document contains the following sections

- Introduction and Background
- Generating an Example Data Set
- How to Stratify a Data Set
- Diagnostics for Stratified Data
- Matching Stratified Data

After briefly introducing the goals of stratamatch and the approaches it implements on a statistical level, this document will walk through an example of how stratamatch might be used on a toy data set. We will show how to stratify a data set, how to diagnose the quality of the stratification, and how to match a data set within strata.

In a fully-randomized controlled experiment, treatment assignments are made independently of each individual’s baseline covariates, allowing for unbiased estimation of the treatment effect. In observational studies, researchers may attempt to account for differences in their subject’s baseline covariates by matching treated and control subjects with similar characteristics. In particular, the popular approach of propenisty score matching pairs individuals who had similar estimated probabilities of recieving the treatment based on their baseline characteristics. By balancing the treated and control groups in this way, the propensity score matching approach attempts to coerce the data set into a form that resembles a randomized controlled trial. Importantly, however, propensity score matching can only address bias to due *measured* baseline covariates. Imbalances in unmeasured baseline covariates may still bias the effect estimate after propensity score matching. For this reason, researchers often carry out a sensitivity analysis to address concerns about unmeasured confounding.

However, the fully-randomized experiment is not the only experimental study design. In a *block-randomized controlled experiment*, subjects are stratified based on important covariates (e.g. sex, age group, smoking status) before randomization into treatment groups. This helps reduce the heterogeneity between treatment and control groups. In the experimental context, reducing the heterogeneity between compared groups helps to reduce the variance of the effect estimator. In observational settings however, this reduction in heterogeneity has the added benefit of reducing the sensitivity of the study results to unobserved confounding (see Rosenbaum (2005), for a great discussion on this point). Moreover, if the sample size of an observational study is very large, stratifying the sample and matching separately within the strata in this way may be much faster than matching the entire sample at once. Stratamatch helps to facilitate this process by supplying tools stratify an observational dataset and match within each stratum, thereby emulating an observational version of the block-randomized controlled trial.

Once we have decided to stratify our data set, the question becomes: how should strata be determined? One option is to select the covariates on which to stratify by hand based on expert knowledge. Another option is to use knowledge from previous experiments to select the best substrata. In the experimental setting, this may be done with a *pilot study*. Using this approach, researchers set aside some of their resources before running the main experiment for the purpose of running a smaller, ‘pilot’ experiment. By examining the outcomes of the pilot study, they can gather information which they can use to inform the design of the main experiment. Importantly, after the pilot study is run, the individuals in the pilot experiment are not reused in the main experiment, so the observations of the outcomes from this earlier analysis are not allowed to bias the results of the main study.

Aikens, Greaves, and Baiocchi (2020) extend the idea of the pilot study to the observational setting. Using an observational *pilot design*, the researchers may set aside a ‘pilot set’ of their data. Outcome information on this pilot set can be observed, and the information gained can be used to inform the study design. Subsequently, in order to preserve the separation of the study design from the study analysis, the observations from the pilot set are omitted from the main analysis.

Stratamatch uses a pilot design to estimate a quantity called the prognostic score (Hansen (2008)), defined here as an individual’s expected outcome under the control assignment, based on their baseline covariates. Balancing observational data sets based on the prognostic score can the reduce heterogeneity between matched individuals, decreasing variance and diminishing the sensitivity of the study results to unobserved confounding (See Aikens, Greaves, and Baiocchi (2020), Antonelli et al. (2018), and Leacy and Stuart (2014)). Moreover, since the prognostic score is often continuous, strata can be easily determined using prognostic score quantiles to select evenly sized bins for the data. This circumvents common problems with stratification based on expert knowledge, since that process often generates strata which are too large, too small, or too poorly balanced between treatment and control observations.

The stratamatch function, `auto_stratify`

, carries out the prognostic score stratification pilot design described above. Although there are many additional options available when running this function, the most basic procedure does the following:

Partition the data set into a pilot data set and an analysis data set

Fit a model for the prognostic score from the observations in the pilot set

Estimate prognostic scores for the analysis set using the prognostic model

Stratify the analysis set based on prognostic score quantiles.

Once the data set has been stratified, `stratamatch`

suggests several diagnostic plots and tables that can be used to assess the size and balance of the strata produced. If the strata are satisfactory, the treatment and control individuals can then be matched. At this point, the reader may choose to match the data set within each stratum using the `strata_match`

function, or they may select and implement their own matching scheme.

We demonstrate the functionality of stratamatch with a toy example. The `stratamatch`

package contains its own function, `make_sample_data`

, for just this purpose.

```
library(stratamatch)
set.seed(125)
# make sample data set of 5000 observations
dat <- make_sample_data(n = 5000)
# print the first few rows of the sample_data
head(dat)
```

```
## X1 X2 B1 B2 C1 treat outcome
## 1 0.93332697 0.22256993 1 1 b 0 0
## 2 -0.52503178 1.07623706 1 0 c 1 0
## 3 1.81443979 2.83989785 1 1 b 0 0
## 4 0.08304562 0.04686229 0 0 a 1 0
## 5 0.39571880 0.42609036 0 0 b 0 0
## 6 -2.19366962 1.28452442 1 1 c 0 1
```

Let’s assume that the data in `dat`

are an observational data set, and we would like to estimate the effect of some binary treatment on a binary outcome. Suppose the column `treat`

in this data gives the treatment assigment (where a `0`

means untreated and a `1`

means treated), and the column `outcome`

gives information on who had the outcome of interest and who didn’t (similarly, let’s call `0`

the negative outcome and `1`

the positive outcome). However, since this is an observational data set, we didn’t get to choose who got the treatment and who didn’t. Instead, individuals self-selected into treatment or control groups, perhaps in some way which is influenced by their background characteristics. Additionally, an individual’s probability of having the positive outcome may be influenced by the treatment or by other baseline factors, but we don’t know which. In addition to treatment assignments and outcomes, we have measured five baseline covariates for each individual: `X1`

, `X2`

, `B1`

, `B2`

, and `C1`

. `X1`

and `X2`

are continuous, `B1`

and `B2`

are binary, and `C1`

is a categorical variable which takes on possible values `"a"`

, `"b"`

, and `"c"`

.

We should also note that `dat`

is a fairly large data set (5000 observations), but only about 1/5 of the individuals in the data set recieved the treatment.

Suppose we wanted to stratify the data manually based on covariates that our experts have selected. Importantly, we can only stratify the data based on categorical or binary variables. This means that we cannot use `X1`

or `X2`

directly (this would cause `manual_stratify`

to throw an error). Below, I stratify the data set based on `C1`

, `B1`

, and `B2`

.

```
# manually stratify dat based on categorial and binary variables
m.strat <- manual_stratify(data = dat, treat ~ B1 + B2 + C1)
# try printing the result
m.strat
```

```
## manual_strata object from package stratamatch.
##
## Function call:
## manual_stratify(data = dat, strata_formula = treat ~ B1 + B2 +
## C1)
##
## Analysis set dimensions: 5000 X 8
```

```
## Number of strata: 12
##
## Min size: 33 Max size: 1533
##
## Strata with Potential Issues: 8
```

Above, we see that `manual_stratify`

returns a `manual strata`

object. The printed summary above shows that `manual_stratify`

has divided our dataset according to our instructions and produced 12 strata.

We can extract important information from `m.strat`

with `$`

.

The most important piece of `m.strat`

is the analysis set. This is the stratified data which we can later match and use for effect estimation.

```
## # A tibble: 6 x 8
## X1 X2 B1 B2 C1 treat outcome stratum
## <dbl> <dbl> <int> <int> <chr> <int> <int> <int>
## 1 0.933 0.223 1 1 b 0 0 11
## 2 -0.525 1.08 1 0 c 1 0 9
## 3 1.81 2.84 1 1 b 0 0 11
## 4 0.0830 0.0469 0 0 a 1 0 1
## 5 0.396 0.426 0 0 b 0 0 2
## 6 -2.19 1.28 1 1 c 0 1 12
```

Since we didn’t use a pilot design for this example, our analysis set just has all of the same data in `dat`

that we put in to `manual_stratify`

. The only difference is that now there is an additional column, `stratum`

, which says which stratum each individual has been sorted into.

In contrast to `manual_stratify`

, the `auto_stratify`

function automatically creates the stratified data set based on estimated prognostic scores. In this process, we specify what model we want to use for the prognostic score (the `prognosis`

argument) and what percent of the control reserve we want to use to do the fitting (the `pilot_fraction`

) argument. We also need to tell `auto_stratify`

which column of our data set designates the treatment assignment (`treat`

). The final argument, `size,`

specifies about how large we would like our strata to be. I pick `size = 400`

here so that we get roughly the same number of strata as when we manually stratified our data set.

In practice, `pilot_fraction`

and `size`

have defaults, so we don’t need to specify them every time. Instead of `pilot_fraction`

, we could specify a `pilot_size`

to get a pilot set of approximately that number of observations. Run `help(auto_stratify)`

for more information.

```
a.strat <- auto_stratify(dat, treat = "treat", prognosis = outcome ~ X1 + X2,
pilot_fraction = 0.1, size = 400)
```

`## Constructing a pilot set by subsampling 10% of controls.`

`## Fitting prognostic model via logistic regression: outcome ~ X1 + X2`

```
## auto_strata object from package stratamatch.
##
## Function call:
## auto_stratify(data = dat, treat = "treat", prognosis = outcome ~
## X1 + X2, size = 400, pilot_fraction = 0.1)
##
## Analysis set dimensions: 4619 X 8
##
## Pilot set dimensions: 381 X 7
##
## Prognostic Score Formula:
## outcome ~ X1 + X2
```

```
## Number of strata: 12
##
## Min size: 384 Max size: 385
##
## Strata with Potential Issues: 2
```

Our call to `auto_stratify`

has created `auto_strata`

object. An `auto_strata`

object is much like a `manual_strata`

object, except that it contains somewhat more information, since the stratification process was done differently. Importantly, `auto_strata`

has partitioned 10% of the individuals in the control group to be in the pilot set. These individuals were used to build the prognostic score, and were extracted from the analysis set. In order to prevent overfitting, the observations in the pilot set should not be included in later analyses, and should not be reused when making the final effect estimate.

We can access the pilot set and the analysis set using `$`

with `a.strat`

. The command `a.strat$analysis_set`

gives the analysis set and `a.strat$pilot_set`

gives the pilot set. We can also view the prognostic score estimates for the individuals in the analysis set with `a.strat$prognostic_scores`

.

Suppose you’re interested in partitioning your data into a pilot set and an analysis set, but you’d like to do other things with it after that - for example perhaps you’d like to fit a more complicated model for prognostic score than a logistic regression. The `split_pilot_set`

will split your pilot and analysis set for you based on specified preferences. For more information, run `help(split_pilot_set)`

.

Above, we showed the default method used by `auto_stratify`

. By providing other arguments, it is possible to construct the strata using various other methods. For example, one could fit the prognostic scores for the analysis set separately and provide them as an argument to `auto_stratify`

. Alternatively, the researcher could specify their own pilot set that they would like to use to build the prognostic model. Run `help(auto_stratify)`

to see all of the possible inputs to this function.

Both `manual_stratify`

and `auto_stratify`

objects have a strata table, which shows the rules defining each stratum. Just like when we extracted the analysis set, we can extract the `strata_table`

with `$`

. Below, I show the strata table for `m.strat`

, our manually stratified data. Each row describes the definitions used to make one stratum. For example, below we see that stratum 1 contains all individuals who have `B1 = 0`

and `B2 = 0`

, and `C1 = "a"`

:

```
## # A tibble: 12 x 5
## # Groups: B1, B2 [4]
## B1 B2 C1 stratum size
## <int> <int> <chr> <int> <int>
## 1 0 0 a 1 359
## 2 0 0 b 2 360
## 3 0 0 c 3 1492
## 4 0 1 a 4 39
## 5 0 1 b 5 51
## 6 0 1 c 6 150
## 7 1 0 a 7 175
## 8 1 0 b 8 40
## 9 1 0 c 9 33
## 10 1 1 a 10 1533
## 11 1 1 b 11 383
## 12 1 1 c 12 385
```

The strata table for an automatically stratified data set is somewhat different:

```
## # A tibble: 12 x 3
## stratum quantile_bin size
## <int> <chr> <int>
## 1 1 [0.0461,0.241) 385
## 2 2 [0.2409,0.307) 385
## 3 3 [0.3068,0.352) 385
## 4 4 [0.3524,0.400) 385
## 5 5 [0.4004,0.445) 385
## 6 6 [0.4454,0.482) 385
## 7 7 [0.4825,0.526) 385
## 8 8 [0.5264,0.569) 385
## 9 9 [0.5685,0.615) 385
## 10 10 [0.6149,0.668) 385
## 11 11 [0.6676,0.735) 385
## 12 12 [0.7353,0.946] 384
```

The strata table for `a.strat`

shows what prognostic score quantiles were used to divide the strata. For example, individuals in stratum 1 have the lowest prognostic scores. This means that, if the individuals in stratum 1 were in the control group, we would predict them to have a very low probability of having the positive outcome, based solely on their values of `X1`

and `X2`

. The a vector of the cutoff values used to divide the strata can be extracted from `a.strat`

with `extract_cut_points`

.

Another important diagnostic for both manually and automatically stratified data is the `issue_table`

.

```
## # A tibble: 12 x 6
## Stratum Treat Control Total Control_Proportion Potential_Issues
## <int> <int> <int> <int> <dbl> <chr>
## 1 1 125 234 359 0.652 none
## 2 2 105 255 360 0.708 none
## 3 3 451 1041 1492 0.698 none
## 4 4 2 37 39 0.949 Too few samples; Small treat:…
## 5 5 3 48 51 0.941 Too few samples; Small treat:…
## 6 6 12 138 150 0.92 Small treat:control ratio
## 7 7 104 71 175 0.406 none
## 8 8 25 15 40 0.375 Too few samples
## 9 9 18 15 33 0.455 Too few samples
## 10 10 241 1292 1533 0.843 Small treat:control ratio
## 11 11 51 332 383 0.867 Small treat:control ratio
## 12 12 55 330 385 0.857 Small treat:control ratio
```

This table shows each of the strata that we obtained through manual stratification and some of the potential issues we may face when trying to match within these strata. In particular, this table highlights some strata which are too small and/or have many more treated individuals than controls. The `Total`

column shows the total number of observations in each stratum. Some of the strata here are quite a bit larger than others. In practice, data sets of about 4000 start to take a long time to match, so that’s about the maximum size we would like for our strata. In data sets of fewer than 100 individuals, it can be hard to find good matches between our treated and control individuals.

One benefit of automatic stratification is that it tends to create strata of roughly equal size, so that no strata are extremely small or extremely large. We can see this by comparing the `Total`

column in the issue tables of the manually and automatically stratified data sets:

```
## # A tibble: 12 x 6
## Stratum Treat Control Total Control_Proportion Potential_Issues
## <int> <int> <int> <int> <dbl> <chr>
## 1 1 133 252 385 0.655 none
## 2 2 131 254 385 0.660 none
## 3 3 110 275 385 0.714 none
## 4 4 112 273 385 0.709 none
## 5 5 106 279 385 0.725 none
## 6 6 124 261 385 0.678 none
## 7 7 78 307 385 0.797 none
## 8 8 75 310 385 0.805 Small treat:control ratio
## 9 9 97 288 385 0.748 none
## 10 10 81 304 385 0.790 none
## 11 11 84 301 385 0.782 none
## 12 12 61 323 384 0.841 Small treat:control ratio
```

Plots are an important way that we can check how our statification went. There are two types of plots that we can make automatically with either a `manual_strata`

object or an `auto_strata`

object. The first is a size-ratio plot. This plots each stratum in the analysis set based on its size and the percentage of control observations. If a stratum is too small or is very imbalanced in its treat:control ratio, finding quality matches may be difficult (shaded yellow areas). If a stratum is too large, matching may be very computationally taxing (shaded red areas). We can see here that a few strata from our manual stratification are quite small, and some have nearly entirely control individuals.

```
# plot(m.strat, type = "SR") will give the same output
# plot(m.strat, label = TRUE) will allow the user to click points to label them
```

We can compare this with a size-ratio plot from automatic stratification:

This plot clearly shows that the strata generated by auto_stratify are all about the same size. However, some of the strata still have a treat:control ratio much less than 1:1. Some of this is unavoidable because our data set naturally had a treat:control ratio of about 1:5. However, we should try to avoid having very extreme treat:control ratios in any of our strata.

The second plot we can make to diagnose issues both manually and automatically stratified data is an overlap histogram. Below, I’ve made an overlap histogram for all strata, and an overlap histogram for stratum 3. This shows the propensity score distributions of the treated and control populations. In order to make this histogram, we must tell the `plot`

function what the propensity scores are for our data set by specifying the `propensity`

argument. We have three options: `propensity`

can be a propensity score formula (which will then be fit on the analysis set), a `glm`

modeling propensity score, or a vector of propensity scores. Here, I just pass in a formula.

```
# propensity score histograms for all strata from manually stratified data
plot(m.strat, type = "hist", propensity = treat ~ X2 + X1 + B1 + B2)
```

```
# propensity score histograms for all strata from manually stratified data
plot(m.strat, type = "hist", propensity = treat ~ X2 + X1 + B1 + B2, stratum = 3)
```

A similar command works for automatically stratified data. :

In addition to the two plot types above, there are some plots that can be used to visualize `auto_strata`

objects only: `"residual"`

and `"AC"`

plots.

Setting `type = "AC"`

in the plot command produces a Assignment-Control plot. This shows each individual in terms of their estimated propensity score and their estimated prognostic score (for more details, see Aikens, Greaves, and Baiocchi (2020)). This allows us to check how well the treated and control samples overlap not only in their probability for treatment but in their expected outcome under the control assignment. Just as with the overlap histograms, we need to provide information on the propensity score. Below, I use the same propensity formula as in the histogram above. The first plot shows all of the strata, with lines indicating the barriers between strata, while the second plot looks specifically at stratum 3 by specifying the `stratum`

argument. Apparent striations or groupings in assignment control plots like the ones below are common; they appear when some binary or categorical covariate has a strong association with either the outcome or the treatment assignment.

Finally, if we fit a prognostic model, then we can run diagnostics on that prognostic model with `type = "residual"`

. This is essentially just a wrapper for the `glm`

plotting method. It produces all the familiar diagnostic plots that we can obtain from calling `plot`

on a fitted model from `glm`

.

The `plot`

command with `type = "residual"`

gives us a one-line command that shows us standard diagnostic plots for the prognostic score model that we fit on the pilot set. If we want to run extra diagnostics on our prognostic model, we can do that too. Our `a.strat`

stores a copy of the prognostic score model, which we can extract in the usual way with `$`

. By extracting the prognostic model (`prognostic_model`

), we can run any of the usual diagnostics for our fitted model.

```
# extract prognostic model from a.strat
progmod <- a.strat$prognostic_model
# as an example, summarize model coefficients
summary(progmod)
```

```
##
## Call:
## glm(formula = prognostic_formula, family = "binomial", data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7627 -1.0509 -0.6498 1.0808 1.9692
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.17106 0.15821 -1.081 0.280
## X1 -0.77962 0.12341 -6.318 2.66e-10 ***
## X2 0.09660 0.09498 1.017 0.309
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 527.73 on 380 degrees of freedom
## Residual deviance: 480.76 on 378 degrees of freedom
## AIC: 486.76
##
## Number of Fisher Scoring iterations: 4
```

Based on the coefficients from the prognostic model, we can tell that our prognostic score estimates are primarily determined by each individual’s `X2`

baseline value.

Now that we have stratified the data, we can match the individuals within each stratum.

The most flexible strategy - although occaisionally more complex - is for the researcher to do this step by hand, using the strata designations in the analysis set returned by either one of the stratification methods. If the researcher plans to do something very specific or nuanced in the matching process, this may be the best approach. For example, users proficient with the matching package `optmatch`

(Hansen and Klopfer (2006)) will note that adding `+ strata(stratum)`

to the matching formula supplied to optmatch will request that optmatch perform the matching within each stratum in the analysis set (see the optmatch documentation for further details). Another approach is to divide the `analysis_set`

into separate data frames and match on those individually, perhaps distributing over serval computing clusters or using a different matching scheme in each stratum (for example different \(k\) in \(1:k\) matching)

If the goal is something more simple, e.g. a 1-to-1 or 1-to-\(k\) optimal propensity score matching within strata, the `strata_match`

function can do that automatically. Below, I match two control individuals to each treatment individual within strata in my analysis set provided by a.strat. I use the propensity score formula `treat ~ X1 + X2 + B1 + B2`

.

```
# match the automatically stratified data
mymatch <- strata_match(a.strat, treat ~ X1 + X2 + B1 + B2, k = 1)
```

`## This function makes essential use of the optmatch package, which has an academic license.`

`## For more information, run optmatch::relaxinfo()`

`## Fitting propensity model: treat ~ X1 + X2 + B1 + B2`

`## `

```
## Structure of matched sets:
## 1:1 0:1
## 1192 2235
## Effective Sample Size: 1192
## (equivalent number of matched pairs).
```

This function in turn calls `optmatch`

to do the matching, stratified by the strata assignments. It returns an optmatch matching. Inessence, each individual is given an identification code for its match. `<NA>`

indicates that the individual was not matched, and an identifier such as `3.15`

indicates that this individual was in stratum 3 and it was placed in match 15.

```
# add match information as a column in the data set
matched_data <- a.strat$analysis_set
matched_data$match <- as.character(mymatch)
head(matched_data)
```

```
## X1 X2 B1 B2 C1 treat outcome stratum match
## 1 0.93332697 0.22256993 1 1 b 0 0 2 <NA>
## 2 -0.52503178 1.07623706 1 0 c 1 0 9 9.22
## 3 1.81443979 2.83989785 1 1 b 0 0 1 <NA>
## 4 0.08304562 0.04686229 0 0 a 1 0 5 5.82
## 5 -0.36031653 1.69169297 0 0 c 0 1 8 <NA>
## 6 0.14285392 0.68549042 1 1 a 0 0 6 <NA>
```

Effect estimation can then be done using the matched analysis set via researcher’s method of choice.

Aikens, Rachael C, Dylan Greaves, and Michael Baiocchi. 2020. “A Pilot Design for Observational Studies: Using Abundant Data Thoughtfully.” *Statistics in Medicine*. Wiley Online Library.

Antonelli, Joseph, Matthew Cefalu, Nathan Palmer, and Denis Agniel. 2018. “Doubly Robust Matching Estimators for High Dimensional Confounding Adjustment.” *Biometrics* 74 (4). Wiley Online Library: 1171–9.

Hansen, Ben B. 2008. “The Prognostic Analogue of the Propensity Score.” *Biometrika* 95 (2). Oxford University Press: 481–88.

Hansen, Ben B., and Stephanie Olsen Klopfer. 2006. “Optimal Full Matching and Related Designs via Network Flows.” *Journal of Computational and Graphical Statistics* 15 (3): 609–27.

Leacy, Finbarr P, and Elizabeth A Stuart. 2014. “On the Joint Use of Propensity and Prognostic Scores in Estimation of the Average Treatment Effect on the Treated: A Simulation Study.” *Statistics in Medicine* 33 (20). Wiley Online Library: 3488–3508.

Rosenbaum, Paul R. 2005. “Heterogeneity and Causality: Unit Heterogeneity and Design Sensitivity in Observational Studies.” *The American Statistician* 59 (2). Taylor & Francis: 147–52.