Using Subsemble-Style Estimation to Model Spatial Parameters in GLMMs with Non-Disjoint Samples

Introduction

For large scale spatial data, estimation of spatial parameter estimates can be challenging.

  • Computational difficulty
  • Uncertainty of parameter estimates
  • Scale of the parameter estimates (non-normal response)
  • Estimation of covariates

Subsemble estimation

Subsemble estimation was proposed in 2014 as a method for fitting and prediction in statistical models for big data (Sapp, van der Laan, & Canny, 2014). The idea behind subsemble estimation is simple: partition the full dataset into subsets of observations, fit the statistical model to the subset, and use ensemble methods to combine fitted models and predictions. Subsemble estimation, and other subset estimation methods, are attractive possibilities for quick modeling of large spatially correlated data, provided the partitioning can be done in such a way to preserve the original spatial structure.

Subsemble estimation combines parameter estimates from subset models while simultaneously allowing adjustments for the quality of each subset model. Subsemble estimation improves on bagging (Breiman, 1996, 1998) by ensuring that every observation in the original data set is used in the partitioning. Subsemble techniques are also flexible enough to allow for a variety of model recombination algorithms, instead of just simple averaging.

The basic subsemble estimation algorithm is as follows (Sapp et al., 2014):

  1. Partition the \(n\) observations into \(J\) disjoint subsets or folds. Apply an estimation algorithm, \(\psi_j\), to each subset \(j\). For example, a simple linear regression model might be fit to each of the disjoint subsets. This results in \(J\) subset-specific estimators for each parameter in the model..
  2. Combine the \(J\) subset-specific estimators using \(k\)-fold cross validation.

Using subsemble estimation to fit spatial models could provide advantages in two areas:

  1. Computational complexity. Estimating the spatial structure parameters (range, sill) requires inverting the \(n \times n\) distance matrix. Sparsity techniques can help reduce the size of the problem, but this can still be difficult. Additionally, using disjoint subsets of the data allows the model parameters to be estimated using parallel computing, which drastically decreases computational time.
  2. Second, the effect of extreme values can be minimized, since these extremes will not appear in all subsets. This also applies to the parameter estimates. Spatial parameters like the range and sill are prone to overestimation and severe skew. If representative spatial samples are used, then this method would also provide some information about the variability of the spatial parameters in the model.

Subsampling

So, how best to generate such subsamples?

The simplest approach: simple random sampling may result in samples that:

  1. Do not preserve the spatial structure,
  2. Over-emphasize the spatial structure,
  3. Do not include enough data points to reliably estimate the spatial structure.

Spatially balanced samples

An alternative approach, spatially balanced sampling, was selected as a possible alternative to simple random sampling. Spatially balanced sampling (Grafström, 2012) uses the simple pivotal method (Deville and Tille, 1998) to build a negative correlation between the inclusion indicators for observations that are close in distance.

Spatially balanced sampling algorithm:

Let \(U\) be a spatial population of \(N\) points. Each point \(i\) has a known location and an inclusion probability of being selected for the sample of \(0 < \pi_i \le 1\). Let \(n\) be the desired sample size.

\[n = \sum_{i \in U} \pi_i\]

  1. Randomly choose a data point \(i\).
  2. Choose unit \(j\), a nearest neighbor to \(i\). If two or more units have the same distance to \(i\), then randomly choose between them with equal probability.
  3. If \(j\) has \(i\) as its nearest neighbor, update the inclusion probabilities as follows below. Otherwise, repeat.
  4. If all data points have been addressed, stop.

Inclusion rule: Let \(\pi_i\) and \(\pi_j\) represent the inclusion probabilities of two points, \(i\) and \(j\).

  • If \(\pi_i + \pi_j <1\),

\[ (\pi_{i}',\pi_{j}')=\begin{cases} (0,\pi_{i}+\pi_{j}) & with\ probability\ \frac{\pi_{j}}{\pi_{i}+\pi_{j}}\\ (\pi_{i}+\pi_{j},0) & with\ probability\ \frac{\pi_{i}}{\pi_{i}+\pi_{j}} \end{cases} \]

  • If \(pi_i + \pi_j \ge 1\)

\[ (\pi_{i}',\pi_{j}')=\begin{cases} (1,\pi_{i}+\pi_{j}-1) & with\ probability\ \frac{1-\pi_{j}}{2-(\pi_{i}+\pi_{j})}\\ (\pi_{i}+\pi_{j}-1,1) & with\ probability\ \frac{1-\pi_{i}}{2-(\pi_{i}+\pi_{j})} \end{cases} \]

Spatially balanced sampling is more computationally intensive (can take several hours for \(N > 100,000\)), however estimates obtained using spatially balanced samples may be more accurate than those using simple random samples for spatial data.

However, spatially balanced sampling has the disadvantage that it is not a true subsemble sample. There is no guarantee that repeated spatially balanced samples will lead to disjoint partitions, in fact, it’s very likely that they will not. This may affect the true performance of the technique, since many of the previously proved results regarding the accuracy of subsemble estimation may not apply for overlapping samples.

Questions

There are two major questions that will be addressed in this work:

Question 1: Does subsemble-style estimation return accurate estimates of the spatial structure parameters and covariate coefficients, when samples are not disjoint?

Question 2: Does the accuracy of the parameter estimates depend on the:

  1. Sampling technique considered?
  2. Distribution of the response variable?
  3. Magnitude of the spatial parameters?

Experimental design

Data was simulated using six combinations of spatial parameters, two response distributions (Normal and Poisson), and two sampling schemes (simple random sampling and spatially balanced sampling), for a total of 24 simulation conditions.

Two covariates were used to generate the response variable, one assumed to follow a \(Uniform(0, 1)\) distribution and one assumed to follow a \(Bernoulli(p=0.5)\) distribution.

\[X_1 \sim Uniform(0, 1)\]

\[X_2 \sim Bernoulli(p=0.5)\]

\[\mu = \beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0.5 + 0.25 X_1 + 0.5 X_2\]

The spatial structure was modeled following an Exponential structure with spatial sill 1 or 2, and range 5, 10, or 20. An additional random error term was assumed to follow a \(Normal(0, 1)\) distribution. Using this specification, a response variable was simulated to follow either a normal or Poisson distribution.

Normal response

For a normal response model:

\[Y = Normal(\mu, \sigma^2=1) + z_i\]

where \(z_i\) is the spatial error term.

Poisson response

For a Poisson data:

\[\eta = \mu + z_i\]

\[Y = Poisson(ln(\eta))\]

where \(z_i\) is again the spatial error term.

This design structure was intended to result in two scenarios: a weak-structure, but norrmally distributed scenario; and a count-valued, highly right-skewed distribution.

Results: Normal response

Distribution of parameter estimates

95% Wald CI

Nominal coverage for 95% Wald confidence intervals:

Simulation Sampling True Sill True Range Beta_0 Beta_1 Beta_2 Sill Range Range_Sill
1 SBS 1 5 0.41 0.83 0.83 0.09 0.41 0.71
2 SRS 1 5 0.31 0.82 0.86 0.1 0.5 0.47
3 SBS 1 10 0.18 0.85 0.88 0.16 0.22 0.39
4 SRS 1 10 0.18 0.85 0.89 0.36 0.24 0.58
5 SBS 1 20 0.12 0.82 0.8 0.37 0.15 0.31
6 SRS 1 20 0.13 0.8 0.85 0.26 0.19 0.41
7 SBS 2 5 0.29 0.84 0.85 0.44 0.35 0.72
8 SRS 2 5 0.37 0.85 0.79 0.47 0.49 0.66
9 SBS 2 10 0.18 0.87 0.86 0.35 0.25 0.47
10 SRS 2 10 0.15 0.92 0.83 0.4 0.26 0.49
11 SBS 2 20 0.11 0.78 0.9 0.3 0.13 0.29
12 SRS 2 20 0.08 0.88 0.85 0.15 0.1 0.32

95% Wald CI with 5% trim

Nominal coverage for 95% Wald confidence intervals using 5% trimming:

Simulation Sampling True Sill True Range Beta_0 Beta_1 Beta_2 Sill Range Range_Sill
1 SBS 1 5 0.36 0.77 0.77 0.09 0.31 0.36
2 SRS 1 5 0.3 0.78 0.83 0.1 0.35 0.45
3 SBS 1 10 0.16 0.82 0.84 0.16 0.16 0.2
4 SRS 1 10 0.18 0.81 0.87 0.35 0.19 0.34
5 SBS 1 20 0.12 0.77 0.76 0.36 0.13 0.17
6 SRS 1 20 0.13 0.77 0.83 0.23 0.15 0.22
7 SBS 2 5 0.27 0.8 0.82 0.39 0.28 0.34
8 SRS 2 5 0.33 0.77 0.72 0.45 0.31 0.43
9 SBS 2 10 0.17 0.84 0.78 0.3 0.18 0.28
10 SRS 2 10 0.13 0.86 0.82 0.38 0.23 0.35
11 SBS 2 20 0.1 0.73 0.86 0.29 0.11 0.23
12 SRS 2 20 0.08 0.83 0.82 0.15 0.09 0.23

95% Wald CI with 10% trim

Nominal coverage for 95% Wald confidence intervals using 10% trimming:

Simulation Sampling True Sill True Range Beta_0 Beta_1 Beta_2 Sill Range Range_Sill
1 SBS 1 5 0.46 0.88 0.9 0.16 0.32 0.56
2 SRS 1 5 0.36 0.87 0.91 0.15 0.36 0.67
3 SBS 1 10 0.23 0.9 0.92 0.17 0.17 0.25
4 SRS 1 10 0.2 0.89 0.92 0.44 0.24 0.37
5 SBS 1 20 0.12 0.9 0.86 0.47 0.15 0.21
6 SRS 1 20 0.15 0.88 0.92 0.26 0.19 0.19
7 SBS 2 5 0.32 0.9 0.91 0.43 0.37 0.44
8 SRS 2 5 0.41 0.89 0.86 0.46 0.37 0.49
9 SBS 2 10 0.17 0.9 0.89 0.33 0.2 0.26
10 SRS 2 10 0.16 0.96 0.86 0.42 0.25 0.31
11 SBS 2 20 0.11 0.86 0.9 0.31 0.12 0.26
12 SRS 2 20 0.08 0.92 0.9 0.17 0.1 0.29

Quantile intervals

Nominal coverage for 90% percentile intervals:

Simulation Sampling True sill True Range Beta_0 Beta_1 Beta_2 Sill Range Range_Sill
1 SBS 1 5 1 1 1 1 1 1
2 SRS 1 5 1 1 1 1 1 1
3 SBS 1 10 0.97 1 1 1 0.92 0.99
4 SRS 1 10 0.99 1 1 1 0.99 1
5 SBS 1 20 0.76 1 1 0.96 0.76 0.96
6 SRS 1 20 0.85 1 1 0.96 0.81 1
7 SBS 2 5 1 1 1 1 1 1
8 SRS 2 5 1 1 1 1 1 1
9 SBS 2 10 0.96 1 1 0.99 0.94 1
10 SRS 2 10 0.99 1 1 1 0.96 0.99
11 SBS 2 20 0.66 1 1 0.94 0.69 0.96
12 SRS 2 20 0.63 1 1 0.93 0.74 0.98

Results: Poisson response

Distribution of parameter estimates

95% Wald CI

Nominal coverage for 95% Wald confidence intervals:

Simulation Sampling True Sill True Range Beta_0 Beta_1 Beta_2 Sill Range Range_Sill
1 SBS 1 5 0.38 0.9 0.84 0.2 0.19 0.35
2 SRS 1 5 0.31 0.91 0.84 0.24 0.36 0.54
3 SBS 1 10 0.22 0.87 0.83 0.12 0.22 0.3
4 SRS 1 10 0.16 0.86 0.86 0.1 0.2 0.39
5 SBS 1 20 0.1 0.79 0.85 0.03 0.15 0.3
6 SRS 1 20 0.11 0.85 0.87 0.06 0.09 0.34
7 SBS 2 5 0.27 0.84 0.8 0.12 0.26 0.38
8 SRS 2 5 0.3 0.86 0.89 0.24 0.34 0.56
9 SBS 2 10 0.12 0.81 0.79 0.07 0.13 0.18
10 SRS 2 10 0.19 0.96 0.83 0.18 0.16 0.31
11 SBS 2 20 0.08 0.86 0.91 0.03 0.07 0.31
12 SRS 2 20 0.06 0.84 0.89 0.06 0.11 0.34

95% Wald CI with 5% trim

Nominal coverage for 95% Wald confidence intervals using 5% trimming:

Simulation Sampling True Sill True Range Beta_0 Beta_1 Beta_2 Sill Range Range_Sill
1 SBS 1 5 0.36 0.82 0.86 0.16 0.16 0.29
2 SRS 1 5 0.3 0.78 0.83 0.1 0.35 0.45
3 SBS 1 10 0.19 0.83 0.8 0.12 0.21 0.28
4 SRS 1 10 0.16 0.83 0.82 0.08 0.17 0.33
5 SBS 1 20 0.1 0.76 0.83 0.03 0.11 0.29
6 SRS 1 20 0.09 0.84 0.83 0.06 0.07 0.29
7 SBS 2 5 0.25 0.78 0.8 0.11 0.25 0.31
8 SRS 2 5 0.25 0.85 0.85 0.23 0.33 0.54
9 SBS 2 10 0.12 0.77 0.76 0.07 0.07 0.15
10 SRS 2 10 0.16 0.91 0.8 0.28 0.16 0.29
11 SBS 2 20 0.07 0.84 0.86 0.01 0.06 0.26
12 SRS 2 20 0.07 0.83 0.85 0.04 0.1 0.31

95% Wald CI with 10% trim

Nominal coverage for 95% Wald confidence intervals using 10% trimming:

Simulation Sampling True Sill True Range Beta_0 Beta_1 Beta_2 Sill Range Range_Sill
1 SBS 1 5 0.43 0.94 0.87 0.21 0.2 0.33
2 SRS 1 5 0.36 0.87 0.91 0.15 0.36 0.67
3 SBS 1 10 0.23 0.91 0.87 0.15 0.24 0.34
4 SRS 1 10 0.19 0.93 0.92 0.11 0.2 0.35
5 SBS 1 20 0.12 0.86 0.86 0.07 0.12 0.39
6 SRS 1 20 0.12 0.9 0.9 0.08 0.11 0.36
7 SBS 2 5 0.33 0.84 0.89 0.15 0.29 0.41
8 SRS 2 5 0.29 0.91 0.92 0.28 0.33 0.57
9 SBS 2 10 0.15 0.86 0.82 0.08 0.09 0.2
10 SRS 2 10 0.21 0.95 0.91 0.22 0.22 0.34
11 SBS 2 20 0.09 0.89 0.93 0.02 0.08 0.28
12 SRS 2 20 0.08 0.9 0.92 0.05 0.1 0.35

Quantile intervals

Nominal coverage for 90% percentile intervals:

Simulation Sampling True sill True Range Beta_0 Beta_1 Beta_2 Sill Range Range_Sill
1 SBS 1 5 1 1 1 0.98 0.98 1
2 SRS 1 5 1 1 1 1 1 1
3 SBS 1 10 0.96 1 1 0.87 0.88 1
4 SRS 1 10 1 1 1 1 0.05 0.32
5 SBS 1 20 0.71 1 1 0.6 0.73 1
6 SRS 1 20 0.77 1 1 0.68 0.71 1
7 SBS 2 5 1 1 1 0.98 0.99 1
8 SRS 2 5 0.98 1 1 0.98 0.99 1
9 SBS 2 10 0.9 1 1 0.79 0.82 0.99
10 SRS 2 10 0.89 1 1 0.83 0.92 1
11 SBS 2 20 0.7 1 1 0.45 0.63 1
12 SRS 2 20 0.62 1 1 0.64 0.7 0.99

Discussion and limitations

  • Approximate confidence intervals do a really poor job estimating spatial structure using subsampling.
  • When spatial structure in the data is weak and the response variable is normally distributed, spatially balanced sampling leads to inflated estimates of the regression coefficients (\(\beta_0, \beta_1, \beta_2\)) and the spatial sill. The spatial range is accurately estimated even under weak structure.
  • When spatial structure in the data is weak and the response variable is Poisson distributed, spatially balanced sampling yields more accurate results for the regression coefficients, but underestimates the spatial structure.
  • For higher range values, the regression coefficients are better estimated using both spatially balanced sampling and simple random sampling.
  • Parameter estimates obtained using spatially balanced sampling are more variable than those from simple random sampling.
  • Quantile-based intervals provide the best coverage for the spatial parameters, but are overly conservative for the covariate parameters.

Limitations

Overall, both methods showed poor coverage for the parameters using mean weighting, inverse MSE weighting. Intervals built using trimmed estimates performed slightly better, but still failed to achieve the target confidence level.

A major assumption of subsemble estimation is that the subsamples are disjoint, which spatially balanced sampling fails to achieve. One possible technique for accomplishing this might be to “stratify” the spatial random field, and select observations within each strata to allocate to each of the subsamples.

Citations

Acknowledgements

This work was supported by the Dr. George F. Haddix President’s Faculty Research Fund at Creighton University.

Aimee Schwab-McCoy, Ph.D; Mark May
Creighton University
Department of Mathematics

Conference on Statistical Practice
February 22, 2020