This vignette shows how to perform Bayesian leave-one-out cross-validation (LOO-CV) using the mixture estimators proposed in the paper Silva and Zanella (2022). These estimators have shown to be useful in presence of outliers but also, and especially, in high-dimensional settings where the model features many parameters. In these contexts it can happen that a large portion of observations lead to high values of Pareto-\(k\) diagnostics and potential instability of PSIS-LOO estimators.

For this illustration we consider a high-dimensional Bayesian
Logistic regression model applied to the *Voice* dataset.

This is the Stan code for a logistic regression model with regularized horseshoe prior. The code includes an if statement to include a code line needed later for the MixIS approach.

```
# Note: some syntax used in this program requires RStan >= 2.26 (or CmdStanR)
# To use an older version of RStan change the line declaring `y` to:
# int<lower=0,upper=1> y[N];
stancode_horseshoe <- "
data {
int <lower=0> N;
int <lower=0> P;
array[N] int <lower=0, upper=1> y;
matrix [N,P] X;
real <lower=0> scale_global;
int <lower=0,upper=1> mixis;
}
transformed data {
real<lower=1> nu_global=1; // degrees of freedom for the half-t priors for tau
real<lower=1> nu_local=1; // degrees of freedom for the half-t priors for lambdas
// (nu_local = 1 corresponds to the horseshoe)
real<lower=0> slab_scale=2;// for the regularized horseshoe
real<lower=0> slab_df=100; // for the regularized horseshoe
}
parameters {
vector[P] z; // for non-centered parameterization
real <lower=0> tau; // global shrinkage parameter
vector <lower=0>[P] lambda; // local shrinkage parameter
real<lower=0> caux;
}
transformed parameters {
vector[P] beta;
{
vector[P] lambda_tilde; // 'truncated' local shrinkage parameter
real c = slab_scale * sqrt(caux); // slab scale
lambda_tilde = sqrt( c^2 * square(lambda) ./ (c^2 + tau^2*square(lambda)));
beta = z .* lambda_tilde*tau;
}
}
model {
vector[N] means=X*beta;
vector[N] log_lik;
target += std_normal_lpdf(z);
target += student_t_lpdf(lambda | nu_local, 0, 1);
target += student_t_lpdf(tau | nu_global, 0, scale_global);
target += inv_gamma_lpdf(caux | 0.5*slab_df, 0.5*slab_df);
for (n in 1:N) {
log_lik[n]= bernoulli_logit_lpmf(y[n] | means[n]);
}
target += sum(log_lik);
if (mixis) {
target += log_sum_exp(-log_lik);
}
}
generated quantities {
vector[N] means=X*beta;
vector[N] log_lik;
for (n in 1:N) {
log_lik[n] = bernoulli_logit_lpmf(y[n] | means[n]);
}
}
"
```

The *LSVT Voice Rehabilitation Data Set* (see link
for details) has \(p=312\) covariates
and \(n=126\) observations with binary
response. We construct data list for Stan.

```
data(voice)
y <- voice$y
X <- voice[2:length(voice)]
n <- dim(X)[1]
p <- dim(X)[2]
p0 <- 10
scale_global <- 2*p0/(p-p0)/sqrt(n-1)
standata <- list(N = n, P = p, X = as.matrix(X), y = c(y), scale_global = scale_global, mixis = 0)
```

Note that in our prior specification we divide the prior variance by the number of covariates \(p\). This is often done in high-dimensional contexts to have a prior variance for the linear predictors \(X\beta\) that remains bounded as \(p\) increases.

LOO-CV computations are challenging in this context due to high-dimensionality of the parameter space. To show that, we compute PSIS-LOO estimators, which require sampling from the posterior distribution, and inspect the associated Pareto-\(k\) diagnostics.

```
chains <- 4
n_iter <- 2000
warm_iter <- 1000
stanmodel <- stan_model(model_code = stancode_horseshoe)
fit_post <- sampling(stanmodel, data = standata, chains = chains, iter = n_iter, warmup = warm_iter, refresh = 0)
loo_post <-loo(fit_post)
```

```
Computed from 4000 by 126 log-likelihood matrix.
Estimate SE
elpd_loo -42.2 7.0
p_loo 23.7 5.1
looic 84.4 14.0
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.4, 1.0]).
Pareto k diagnostic values:
Count Pct. Min. ESS
(-Inf, 0.7] (good) 99 78.6% 56
(0.7, 1] (bad) 20 15.9% <NA>
(1, Inf) (very bad) 7 5.6% <NA>
See help('pareto-k-diagnostic') for details.
```

As we can see the diagnostics signal either “bad” or “very bad” Pareto-\(k\) values for roughly \(15-30\%\) of the observations which is a significant portion of the dataset.

We now compute the mixture estimators proposed in Silva and Zanella
(2022). These require to sample from the following mixture of
leave-one-out posteriors \[\begin{equation}
q_{mix}(\theta) = \frac{\sum_{i=1}^n
p(y_{-i}|\theta)p(\theta)}{\sum_{i=1}^np(y_{-i})}\propto
p(\theta|y)\cdot \left(\sum_{i=1}^np(y_i|\theta)^{-1}\right).
\end{equation}\] The code to generate a Stan model for the above
mixture distribution is the same to the one for the posterior, just
enabling one line of code with a *LogSumExp* contribution to
account for the last term in the equation above.

```
if (mixis) {
target += log_sum_exp(-log_lik);
}
```

We sample from the mixture and collect the log-likelihoods term.

```
standata$mixis <- 1
fit_mix <- sampling(stanmodel, data = standata, chains = chains, iter = n_iter, warmup = warm_iter, refresh = 0, pars = "log_lik")
log_lik_mix <- extract(fit_mix)$log_lik
```

We now compute the mixture estimators, following the numerically stable implementation in Appendix A.2 of Silva and Zanella (2022). The code below makes use of the package “matrixStats”.

To evaluate the performance of mixture estimators (MixIS) we also
generate *benchmark values*, i.e. accurate approximations of the
LOO predictives \(\{p(y_i|y_{-i})\}_{i=1,\dots,n}\), obtained
by brute-force sampling from the leave-one-out posteriors directly,
getting \(90k\) samples from each and
discarding the first \(10k\) as warmup.
This is computationally heavy, hence we have saved the results and we
just load them in the current vignette.

We can then compute the root mean squared error (RMSE) of the PSIS and mixture estimators relative to such benchmark values.

```
elpd_psis <- loo_post$pointwise[,1]
print(paste("RMSE(PSIS) =",round( sqrt(mean((elpd_loo-elpd_psis)^2)) ,2)))
```

`[1] "RMSE(PSIS) = 0.08"`

`[1] "RMSE(MixIS) = 0.04"`

Here mixture estimator provides a reduction in RMSE. Note that this value would increase with the number of samples drawn from the posterior and mixture, since in this example the RMSE of MixIS will exhibit a CLT-type decay while the one of PSIS will converge at a slower rate (this can be verified by running the above code with a larger sample size; see also Figure 3 of Silva and Zanella (2022) for analogous results).

We then compare the overall ELPD estimates with the brute force one.

`[1] "ELPD (PSIS)= -42.18"`

`[1] "ELPD (MixIS)= -45.03"`

`[1] "ELPD (brute force)= -45.63"`

In this example, MixIS provides a more accurate ELPD estimate closer to the brute force estimate, while PSIS severely overestimates the ELPD. Note that low accuracy of the PSIS ELPD estimate is expected in this example given the large number of large Pareto-\(k\) values. In this example, the accuracy of MixIS estimate will also improve with bigger MCMC sample size.

More generally, mixture estimators can be useful in situations where standard PSIS estimators struggle and return many large Pareto-\(k\) values. In these contexts MixIS often provides more accurate LOO-CV and ELPD estimates with a single sampling routine (i.e. with a cost comparable to sampling from the original posterior).

Silva L. and Zanella G. (2022). Robust leave-one-out cross-validation for high-dimensional Bayesian models. Preprint at arXiv:2209.09190

Vehtari A., Gelman A., and Gabry J. (2017). Practical Bayesian model
evaluation using leave-one-out cross-validation and WAIC. *Statistics
and Computing*, 27(5), 1413–1432. Preprint at arXiv:1507.04544

Vehtari A., Simpson D., Gelman A., Yao Y., and Gabry J. (2022). Pareto smoothed importance sampling. Preprint at arXiv:1507.02646