Package 'SeBR' reference manual

Title:	Semiparametric Bayesian Regression Analysis
Description:	Monte Carlo sampling algorithms for semiparametric Bayesian regression analysis. These models feature a nonparametric (unknown) transformation of the data paired with widely-used regression models including linear regression, spline regression, quantile regression, and Gaussian processes. The transformation enables broader applicability of these key models, including for real-valued, positive, and compactly-supported data with challenging distributional features. The samplers prioritize computational scalability and, for most cases, Monte Carlo (not MCMC) sampling for greater efficiency. Details of the methods and algorithms are provided in Kowal and Wu (2024) <doi:10.1080/01621459.2024.2395586>.
Authors:	Dan Kowal [aut, cre, cph]
Maintainer:	Dan Kowal <[email protected]>
License:	MIT + file LICENSE
Version:	1.0.0
Built:	2025-03-10 21:27:36 UTC
Source:	https://github.com/drkowal/sebr

Bayesian bootstrap posterior sampler for the CDF

Description

Compute one Monte Carlo draw from the Bayesian bootstrap (BB) posterior distribution of the cumulative distribution function (CDF).

Usage

bb(y)
bb(y)

Arguments

`y`	the data from which to infer the CDF (preferably sorted)

Details

Assuming the data y are iid from an unknown distribution, the Bayesian bootstrap (BB) is a nonparametric model for this distribution. The BB is a limiting case of a Dirichlet process prior (without any hyperparameters) that admits direct Monte Carlo (not MCMC) sampling.

This function computes one draw from the BB posterior distribution for the CDF Fy.

Value

a function that can evaluate the sampled CDF at any argument(s)

Note

This code is inspired by ggdist::weighted_ecdf.

Examples

# Simulate data:
y = rnorm(n = 100)

# One draw from the BB posterior:
Fy = bb(y)

class(Fy) # this is a function
Fy(0) # some example use (for this one draw)
Fy(c(.5, 1.2))

# Plot several draws from the BB posterior distribution:
ys = seq(-3, 3, length.out=1000)
plot(ys, ys, type='n', ylim = c(0,1),
     main = 'Draws from BB posterior', xlab = 'y', ylab = 'F(y)')
for(s in 1:50) lines(ys, bb(y)(ys), col='gray')

# Add ECDF for reference:
lines(ys, ecdf(y)(ys), lty=2)

# Simulate data:
y = rnorm(n = 100)

# One draw from the BB posterior:
Fy = bb(y)

class(Fy) # this is a function
Fy(0) # some example use (for this one draw)
Fy(c(.5, 1.2))

# Plot several draws from the BB posterior distribution:
ys = seq(-3, 3, length.out=1000)
plot(ys, ys, type='n', ylim = c(0,1),
     main = 'Draws from BB posterior', xlab = 'y', ylab = 'F(y)')
for(s in 1:50) lines(ys, bb(y)(ys), col='gray')

# Add ECDF for reference:
lines(ys, ecdf(y)(ys), lty=2)

Bayesian Gaussian processes with a Box-Cox transformation

Description

MCMC sampling for Bayesian Gaussian process regression with a (known or unknown) Box-Cox transformation.

Usage

bgp_bc(
  y,
  locs,
  X = NULL,
  covfun_name = "matern_isotropic",
  locs_test = locs,
  X_test = NULL,
  nn = 30,
  emp_bayes = TRUE,
  lambda = NULL,
  sample_lambda = TRUE,
  nsave = 1000,
  nburn = 1000,
  nskip = 0
)
bgp_bc(
  y,
  locs,
  X = NULL,
  covfun_name = "matern_isotropic",
  locs_test = locs,
  X_test = NULL,
  nn = 30,
  emp_bayes = TRUE,
  lambda = NULL,
  sample_lambda = TRUE,
  nsave = 1000,
  nburn = 1000,
  nskip = 0
)

Arguments

`y`	`n x 1` response vector
`locs`	`n x d` matrix of locations
`X`	`n x p` design matrix; if unspecified, use intercept only
`covfun_name`	string name of a covariance function; see ?GpGp
`locs_test`	`n_test x d` matrix of locations at which predictions are needed; default is `locs`
`X_test`	`n_test x p` design matrix for test data; default is `X`
`nn`	number of nearest neighbors to use; default is 30 (larger values improve the approximation but increase computing cost)
`emp_bayes`	logical; if TRUE, use a (faster!) empirical Bayes approach for estimating the mean function
`lambda`	Box-Cox transformation; if NULL, estimate this parameter
`sample_lambda`	logical; if TRUE, sample lambda, otherwise use the fixed value of lambda above or the MLE (if lambda unspecified)
`nsave`	number of MCMC iterations to save
`nburn`	number of MCMC iterations to discard
`nskip`	number of MCMC iterations to skip between saving iterations, i.e., save every (nskip + 1)th draw

Details

This function provides Bayesian inference for transformed Gaussian processes. The transformation is parametric from the Box-Cox family, which has one parameter lambda. That parameter may be fixed in advanced or learned from the data. For computational efficiency, the Gaussian process parameters are fixed at point estimates, and the latent Gaussian process is only sampled when emp_bayes = FALSE.

Value

a list with the following elements:

coefficients the posterior mean of the regression coefficients
fitted.values the posterior predictive mean at the test points locs_test
fit_gp the fitted GpGp_fit object, which includes covariance parameter estimates and other model information
post_ypred: nsave x n_test samples from the posterior predictive distribution at locs_test
post_g: nsave posterior samples of the transformation evaluated at the unique y values
post_lambda nsave posterior samples of lambda
model: the model fit (here, bgp_bc)

as well as the arguments passed in.

Note

Box-Cox transformations may be useful in some cases, but in general we recommend the nonparametric transformation (with Monte Carlo, not MCMC sampling) in sbgp.

Examples


# Simulate some data:
n = 200 # sample size
x = seq(0, 1, length = n) # observation points

# Transform a noisy, periodic function:
y = g_inv_bc(
  sin(2*pi*x) + sin(4*pi*x) + rnorm(n, sd = .5),
             lambda = .5) # Signed square-root transformation

# Package we use for fast computing w/ Gaussian processes:
library(GpGp)

# Fit a Bayesian Gaussian process with Box-Cox transformation:
fit = bgp_bc(y = y, locs = x)
names(fit) # what is returned
coef(fit) # estimated regression coefficients (here, just an intercept)
class(fit$fit_gp) # the GpGp object is also returned
round(quantile(fit$post_lambda), 3) # summary of unknown Box-Cox parameter

# Plot the model predictions (point and interval estimates):
pi_y = t(apply(fit$post_ypred, 2, quantile, c(0.05, .95))) # 90% PI
plot(x, y, type='n', ylim = range(pi_y,y),
     xlab = 'x', ylab = 'y', main = paste('Fitted values and prediction intervals'))
polygon(c(x, rev(x)),c(pi_y[,2], rev(pi_y[,1])),col='gray', border=NA)
lines(x, y, type='p')
lines(x, fitted(fit), lwd = 3)


# Simulate some data:
n = 200 # sample size
x = seq(0, 1, length = n) # observation points

# Transform a noisy, periodic function:
y = g_inv_bc(
  sin(2*pi*x) + sin(4*pi*x) + rnorm(n, sd = .5),
             lambda = .5) # Signed square-root transformation

# Package we use for fast computing w/ Gaussian processes:
library(GpGp)

# Fit a Bayesian Gaussian process with Box-Cox transformation:
fit = bgp_bc(y = y, locs = x)
names(fit) # what is returned
coef(fit) # estimated regression coefficients (here, just an intercept)
class(fit$fit_gp) # the GpGp object is also returned
round(quantile(fit$post_lambda), 3) # summary of unknown Box-Cox parameter

# Plot the model predictions (point and interval estimates):
pi_y = t(apply(fit$post_ypred, 2, quantile, c(0.05, .95))) # 90% PI
plot(x, y, type='n', ylim = range(pi_y,y),
     xlab = 'x', ylab = 'y', main = paste('Fitted values and prediction intervals'))
polygon(c(x, rev(x)),c(pi_y[,2], rev(pi_y[,1])),col='gray', border=NA)
lines(x, y, type='p')
lines(x, fitted(fit), lwd = 3)

Bayesian linear model with a Box-Cox transformation

Description

MCMC sampling for Bayesian linear regression with a (known or unknown) Box-Cox transformation. A g-prior is assumed for the regression coefficients.

Usage

blm_bc(
  y,
  X,
  X_test = X,
  psi = length(y),
  lambda = NULL,
  sample_lambda = TRUE,
  nsave = 1000,
  nburn = 1000,
  nskip = 0,
  verbose = TRUE
)
blm_bc(
  y,
  X,
  X_test = X,
  psi = length(y),
  lambda = NULL,
  sample_lambda = TRUE,
  nsave = 1000,
  nburn = 1000,
  nskip = 0,
  verbose = TRUE
)

Arguments

`y`	`n x 1` vector of observed counts
`X`	`n x p` matrix of predictors
`X_test`	`n_test x p` matrix of predictors for test data; default is the observed covariates `X`
`psi`	prior variance (g-prior)
`lambda`	Box-Cox transformation; if NULL, estimate this parameter
`sample_lambda`	logical; if TRUE, sample lambda, otherwise use the fixed value of lambda above or the MLE (if lambda unspecified)
`nsave`	number of MCMC iterations to save
`nburn`	number of MCMC iterations to discard
`nskip`	number of MCMC iterations to skip between saving iterations, i.e., save every (nskip + 1)th draw
`verbose`	logical; if TRUE, print time remaining

Details

This function provides fully Bayesian inference for a transformed linear model via MCMC sampling. The transformation is parametric from the Box-Cox family, which has one parameter lambda. That parameter may be fixed in advanced or learned from the data.

Value

a list with the following elements:

coefficients the posterior mean of the regression coefficients
fitted.values the posterior predictive mean at the test points X_test
post_theta: nsave x p samples from the posterior distribution of the regression coefficients
post_ypred: nsave x n_test samples from the posterior predictive distribution at test points X_test
post_g: nsave posterior samples of the transformation evaluated at the unique y values
post_lambda nsave posterior samples of lambda
post_sigma nsave posterior samples of sigma
model: the model fit (here, blm_bc)

as well as the arguments passed in.

Note

Box-Cox transformations may be useful in some cases, but in general we recommend the nonparametric transformation (with Monte Carlo, not MCMC sampling) in sblm.

Examples

# Simulate some data:
dat = simulate_tlm(n = 100, p = 5, g_type = 'step')
y = dat$y; X = dat$X # training data
y_test = dat$y_test; X_test = dat$X_test # testing data

hist(y, breaks = 25) # marginal distribution

# Fit the Bayesian linear model with a Box-Cox transformation:
fit = blm_bc(y = y, X = X, X_test = X_test)
names(fit) # what is returned
round(quantile(fit$post_lambda), 3) # summary of unknown Box-Cox parameter

# Simulate some data:
dat = simulate_tlm(n = 100, p = 5, g_type = 'step')
y = dat$y; X = dat$X # training data
y_test = dat$y_test; X_test = dat$X_test # testing data

hist(y, breaks = 25) # marginal distribution

# Fit the Bayesian linear model with a Box-Cox transformation:
fit = blm_bc(y = y, X = X, X_test = X_test)
names(fit) # what is returned
round(quantile(fit$post_lambda), 3) # summary of unknown Box-Cox parameter

Bayesian quantile regression

Description

MCMC sampling for Bayesian quantile regression. An asymmetric Laplace distribution is assumed for the errors, so the regression models targets the specified quantile. A g-prior is assumed for the regression coefficients.

Usage

bqr(
  y,
  X,
  tau = 0.5,
  X_test = X,
  psi = length(y),
  nsave = 1000,
  nburn = 1000,
  nskip = 0,
  verbose = TRUE
)
bqr(
  y,
  X,
  tau = 0.5,
  X_test = X,
  psi = length(y),
  nsave = 1000,
  nburn = 1000,
  nskip = 0,
  verbose = TRUE
)

Arguments

`y`	`n x 1` vector of observed counts
`X`	`n x p` matrix of predictors
`tau`	the target quantile (between zero and one)
`X_test`	`n_test x p` matrix of predictors for test data; default is the observed covariates `X`
`psi`	prior variance (g-prior)
`nsave`	number of MCMC iterations to save
`nburn`	number of MCMC iterations to discard
`nskip`	number of MCMC iterations to skip between saving iterations, i.e., save every (nskip + 1)th draw
`verbose`	logical; if TRUE, print time remaining

Value

a list with the following elements:

coefficients the posterior mean of the regression coefficients
fitted.values the estimated tauth quantile at test points X_test
post_theta: nsave x p samples from the posterior distribution of the regression coefficients
post_ypred: nsave x n_test samples from the posterior predictive distribution at test points X_test
model: the model fit (here, bqr)

as well as the arguments passed

Note

The asymmetric Laplace distribution is advantageous because it links the regression model (X%*%theta) to a pre-specified quantile (tau). However, it is often a poor model for observed data, and the semiparametric version sbqr is recommended in general.

Examples

# Simulate some heteroskedastic data (no transformation):
dat = simulate_tlm(n = 100, p = 5, g_type = 'box-cox', heterosked = TRUE, lambda = 1)
y = dat$y; X = dat$X # training data
y_test = dat$y_test; X_test = dat$X_test # testing data

# Target this quantile:
tau = 0.05

# Fit the Bayesian quantile regression model:
fit = bqr(y = y, X = X, tau = tau, X_test = X_test)
names(fit) # what is returned

# Posterior predictive checks on testing data: empirical CDF
y0 = sort(unique(y_test))
plot(y0, y0, type='n', ylim = c(0,1),
     xlab='y', ylab='F_y', main = 'Posterior predictive ECDF')
temp = sapply(1:nrow(fit$post_ypred), function(s)
  lines(y0, ecdf(fit$post_ypred[s,])(y0), # ECDF of posterior predictive draws
        col='gray', type ='s'))
lines(y0, ecdf(y_test)(y0),  # ECDF of testing data
     col='black', type = 's', lwd = 3)

# The posterior predictive checks usually do not pass!
# try ?sbqr instead...

# Simulate some heteroskedastic data (no transformation):
dat = simulate_tlm(n = 100, p = 5, g_type = 'box-cox', heterosked = TRUE, lambda = 1)
y = dat$y; X = dat$X # training data
y_test = dat$y_test; X_test = dat$X_test # testing data

# Target this quantile:
tau = 0.05

# Fit the Bayesian quantile regression model:
fit = bqr(y = y, X = X, tau = tau, X_test = X_test)
names(fit) # what is returned

# Posterior predictive checks on testing data: empirical CDF
y0 = sort(unique(y_test))
plot(y0, y0, type='n', ylim = c(0,1),
     xlab='y', ylab='F_y', main = 'Posterior predictive ECDF')
temp = sapply(1:nrow(fit$post_ypred), function(s)
  lines(y0, ecdf(fit$post_ypred[s,])(y0), # ECDF of posterior predictive draws
        col='gray', type ='s'))
lines(y0, ecdf(y_test)(y0),  # ECDF of testing data
     col='black', type = 's', lwd = 3)

# The posterior predictive checks usually do not pass!
# try ?sbqr instead...

Bayesian spline model with a Box-Cox transformation

Description

MCMC sampling for Bayesian spline regression with a (known or unknown) Box-Cox transformation.

Usage

bsm_bc(
  y,
  x = NULL,
  x_test = NULL,
  psi = NULL,
  lambda = NULL,
  sample_lambda = TRUE,
  nsave = 1000,
  nburn = 1000,
  nskip = 0,
  verbose = TRUE
)
bsm_bc(
  y,
  x = NULL,
  x_test = NULL,
  psi = NULL,
  lambda = NULL,
  sample_lambda = TRUE,
  nsave = 1000,
  nburn = 1000,
  nskip = 0,
  verbose = TRUE
)

Arguments

`y`	`n x 1` vector of observed counts
`x`	`n x 1` vector of observation points; if NULL, assume equally-spaced on [0,1]
`x_test`	`n_test x 1` vector of testing points; if NULL, assume equal to `x`
`psi`	prior variance (inverse smoothing parameter); if NULL, sample this parameter
`lambda`	Box-Cox transformation; if NULL, estimate this parameter
`sample_lambda`	logical; if TRUE, sample lambda, otherwise use the fixed value of lambda above or the MLE (if lambda unspecified)
`nsave`	number of MCMC iterations to save
`nburn`	number of MCMC iterations to discard
`nskip`	number of MCMC iterations to skip between saving iterations, i.e., save every (nskip + 1)th draw
`verbose`	logical; if TRUE, print time remaining

Details

This function provides fully Bayesian inference for a transformed spline model via MCMC sampling. The transformation is parametric from the Box-Cox family, which has one parameter lambda. That parameter may be fixed in advanced or learned from the data.

Value

a list with the following elements:

coefficients the posterior mean of the regression coefficients
fitted.values the posterior predictive mean at the test points x_test
post_theta: nsave x p samples from the posterior distribution of the regression coefficients
post_ypred: nsave x n_test samples from the posterior predictive distribution at x_test
post_g: nsave posterior samples of the transformation evaluated at the unique y values
post_lambda nsave posterior samples of lambda
model: the model fit (here, sbsm_bc)

as well as the arguments passed in.

Note

Box-Cox transformations may be useful in some cases, but in general we recommend the nonparametric transformation (with Monte Carlo, not MCMC sampling) in sbsm.

Examples

# Simulate some data:
n = 100 # sample size
x = sort(runif(n)) # observation points

# Transform a noisy, periodic function:
y = g_inv_bc(
  sin(2*pi*x) + sin(4*pi*x) + rnorm(n, sd = .5),
             lambda = .5) # Signed square-root transformation

# Fit the Bayesian spline model with a Box-Cox transformation:
fit = bsm_bc(y = y, x = x)
names(fit) # what is returned
round(quantile(fit$post_lambda), 3) # summary of unknown Box-Cox parameter

# Plot the model predictions (point and interval estimates):
pi_y = t(apply(fit$post_ypred, 2, quantile, c(0.05, .95))) # 90% PI
plot(x, y, type='n', ylim = range(pi_y,y),
     xlab = 'x', ylab = 'y', main = paste('Fitted values and prediction intervals'))
polygon(c(x, rev(x)),c(pi_y[,2], rev(pi_y[,1])),col='gray', border=NA)
lines(x, y, type='p')
lines(x, fitted(fit), lwd = 3)

# Simulate some data:
n = 100 # sample size
x = sort(runif(n)) # observation points

# Transform a noisy, periodic function:
y = g_inv_bc(
  sin(2*pi*x) + sin(4*pi*x) + rnorm(n, sd = .5),
             lambda = .5) # Signed square-root transformation

# Fit the Bayesian spline model with a Box-Cox transformation:
fit = bsm_bc(y = y, x = x)
names(fit) # what is returned
round(quantile(fit$post_lambda), 3) # summary of unknown Box-Cox parameter

# Plot the model predictions (point and interval estimates):
pi_y = t(apply(fit$post_ypred, 2, quantile, c(0.05, .95))) # 90% PI
plot(x, y, type='n', ylim = range(pi_y,y),
     xlab = 'x', ylab = 'y', main = paste('Fitted values and prediction intervals'))
polygon(c(x, rev(x)),c(pi_y[,2], rev(pi_y[,1])),col='gray', border=NA)
lines(x, y, type='p')
lines(x, fitted(fit), lwd = 3)

Estimate the remaining time in the MCMC based on previous samples

Description

Estimate the remaining time in the MCMC based on previous samples

Usage

computeTimeRemaining(nsi, timer0, nsims, nrep = 1000)
computeTimeRemaining(nsi, timer0, nsims, nrep = 1000)

Arguments

`nsi`	Current iteration
`timer0`	Initial timer value, returned from `proc.time()[3]`
`nsims`	Total number of simulations
`nrep`	Print the estimated time remaining every `nrep` iterations

Value

Table of summary statistics using the function summary

Posterior sampling algorithm for the HBB concentration hyperparameters

Description

Compute Monte Carlo draws from the (marginal) posterior distribution of the concentration hyperparameters of the hierarchical Bayesian bootstrap (hbb). The HBB is a nonparametric model for group-specific distributions; each group has a concentration parameter, where larger values encourage more shrinkage toward a common distribution.

Usage

concen_hbb(
  groups,
  shape_alphas = NULL,
  rate_alphas = NULL,
  nsave = 1000,
  ngrid = 500
)
concen_hbb(
  groups,
  shape_alphas = NULL,
  rate_alphas = NULL,
  nsave = 1000,
  ngrid = 500
)

Arguments

`groups`	the group assignments in the observed data
`shape_alphas`	(optional) shape parameter for the Gamma prior
`rate_alphas`	(optional) rate parameter for the Gamma prior
`nsave`	(optional) number of Monte Carlo simulations
`ngrid`	(optional) number of grid points

Details

The concentration hyperparameters are assigned independent Gamma(shape_alphas, rate_alphas) priors. This function uses a grid approximation to the marginal posterior with the goal of producing a simple algorithm. Because this is a *marginal* posterior sampler, it can be used with the hbb sampler (which conditions on alphas) to provide a joint Monte Carlo (not MCMC) sampling algorithm for the concentration hyperparameters, the group-specific CDFs, and the common CDF. Note that diffuse priors on alphas tend to put posterior mass on large values, which leads to more aggressive shrinkage toward the common distribution (complete pooling). For moderate shrinkage, we use the default values shape_alphas = 30*K and rate_alphas = 1, where K is the number of groups.

Value

nsave x K samples of the concentration hyperparameters corresponding to the K groups

References

Oganisian et al. (https://doi.org/10.1515/ijb-2022-0051)

Examples

# Dimensions:
n = 500 # number of observations
K = 3 # number of groups

# Assign groups w/ unequal probabilities:
ugroups = paste('g', 1:K, sep='') # groups
groups = sample(ugroups,
                size = n,
                replace = TRUE,
                prob = 1:K) # unequally weighted (unnormalized)

# Summarize:
table(groups)/n

# Marginal posterior sampling for alpha:
post_alpha = concen_hbb(groups)

# Summarize: posterior distributions
for(c in 1:K) {
  hist(post_alpha[,c],
       main = paste("Concentration parameter: group", ugroups[c]),
       xlim = range(post_alpha))
  abline(v = mean(post_alpha[,c]), lwd=3) # posterior mean
}

# Dimensions:
n = 500 # number of observations
K = 3 # number of groups

# Assign groups w/ unequal probabilities:
ugroups = paste('g', 1:K, sep='') # groups
groups = sample(ugroups,
                size = n,
                replace = TRUE,
                prob = 1:K) # unequally weighted (unnormalized)

# Summarize:
table(groups)/n

# Marginal posterior sampling for alpha:
post_alpha = concen_hbb(groups)

# Summarize: posterior distributions
for(c in 1:K) {
  hist(post_alpha[,c],
       main = paste("Concentration parameter: group", ugroups[c]),
       xlim = range(post_alpha))
  abline(v = mean(post_alpha[,c]), lwd=3) # posterior mean
}

Grid contraction

Description

Contract the grid if the evaluation points exceed some threshold. This removes the corresponding z values. We can add points back to achieve the same (approximate) length.

Usage

contract_grid(z, Fz, lower, upper, add_back = TRUE, monotone = TRUE)
contract_grid(z, Fz, lower, upper, add_back = TRUE, monotone = TRUE)

Arguments

`z`	grid points (ordered)
`Fz`	function evaluated at those grid points
`lower`	lower threshold at which to check Fz
`upper`	upper threshold at which to check Fz
`add_back`	logical; if true, expand the grid to (about) the original size
`monotone`	logical; if true, enforce monotonicity on the expanded grid

Value

a list containing the grid points z and the (interpolated) function Fz at those points

Compute the latent data CDF

Description

Assuming a Gaussian latent data distribution (given x), compute the CDF on a grid of points

Usage

Fz_fun(z, weights = NULL, mean_vec = NULL, sd_vec)
Fz_fun(z, weights = NULL, mean_vec = NULL, sd_vec)

Arguments

`z`	vector of points at which the CDF of z is evaluated
`weights`	`n`-dimensional vector of weights; if NULL, assume 1/n
`mean_vec`	`n`-dimensional vector of means; if NULL, assume mean zero
`sd_vec`	`n`-dimensional vector of standard deviations

Value

CDF of z evaluated at z

Box-Cox transformation

Description

Evaluate the Box-Cox transformation, which is a scaled power transformation to preserve continuity in the index lambda at zero. Negative values are permitted.

Usage

g_bc(t, lambda)
g_bc(t, lambda)

Arguments

`t`	argument(s) at which to evaluate the function
`lambda`	Box-Cox parameter

Value

The evaluation(s) of the Box-Cox function at the given input(s) t.

Note

Special cases include the identity transformation (lambda = 1), the square-root transformation (lambda = 1/2), and the log transformation (lambda = 0).

Examples

# Log-transformation:
g_bc(1:5, lambda = 0); log(1:5)

# Square-root transformation: note the shift and scaling
g_bc(1:5, lambda = 1/2); sqrt(1:5)

# Log-transformation:
g_bc(1:5, lambda = 0); log(1:5)

# Square-root transformation: note the shift and scaling
g_bc(1:5, lambda = 1/2); sqrt(1:5)

Compute the transformation

Description

Given the CDFs of z and y, compute a smoothed function to evaluate the transformation

Usage

g_fun(y, Fy_eval, z, Fz_eval)
g_fun(y, Fy_eval, z, Fz_eval)

Arguments

`y`	vector of points at which the CDF of y is evaluated
`Fy_eval`	CDF of y evaluated at `y`
`z`	vector of points at which the CDF of z is evaluated
`Fz_eval`	CDF of z evaluated at `z`

Value

A smooth monotone function which can be used for evaluations of the transformation.

Approximate inverse transformation

Description

Compute the inverse function of a transformation g based on a grid search.

Usage

g_inv_approx(g, t_grid)
g_inv_approx(g, t_grid)

Arguments

`g`	the transformation function
`t_grid`	grid of arguments at which to evaluate the transformation function

Value

A function which can be used for evaluations of the (approximate) inverse transformation function.

Inverse Box-Cox transformation

Description

Evaluate the inverse Box-Cox transformation. Negative values are permitted.

Usage

g_inv_bc(s, lambda)
g_inv_bc(s, lambda)

Arguments

`s`	argument(s) at which to evaluate the function
`lambda`	Box-Cox parameter

Value

The evaluation(s) of the inverse Box-Cox function at the given input(s) s.

Note

Special cases include the identity transformation (lambda = 1), the square-root transformation (lambda = 1/2), and the log transformation (lambda = 0).

Examples

# (Inverse) log-transformation:
g_inv_bc(1:5, lambda = 0); exp(1:5)

# (Inverse) square-root transformation: note the shift and scaling
g_inv_bc(1:5, lambda = 1/2); (1:5)^2

# (Inverse) log-transformation:
g_inv_bc(1:5, lambda = 0); exp(1:5)

# (Inverse) square-root transformation: note the shift and scaling
g_inv_bc(1:5, lambda = 1/2); (1:5)^2

Hierarchical Bayesian bootstrap posterior sampler

Description

Compute one Monte Carlo draw from the hierarchical Bayesian bootstrap (HBB) posterior distribution of the cumulative distribution function (CDF) for each group. The common (BB) and group-specific (HBB) weights are also returned.

Usage

hbb(
  y,
  groups,
  sample_alphas = FALSE,
  shape_alphas = NULL,
  rate_alphas = NULL,
  alphas = NULL,
  M = 30
)
hbb(
  y,
  groups,
  sample_alphas = FALSE,
  shape_alphas = NULL,
  rate_alphas = NULL,
  alphas = NULL,
  M = 30
)

Arguments

`y`	the data from which to infer the group-specific CDFs
`groups`	the group assignment for each element of `y`
`sample_alphas`	logical; if TRUE, sample the concentration hyperparameters from their marginal posterior distribution
`shape_alphas`	(optional) shape parameter for the Gamma prior on each `alphas` (if sampled)
`rate_alphas`	(optional) rate parameter for the Gamma prior on each `alphas` (if sampled)
`alphas`	(optional) vector of fixed concentration hyperparameters corresponding to the unique levels in `groups` (used when `sample_alphas = FALSE`)
`M`	a positive scaling term to set a default value of `alphas` when it is unspecified (`alphas = NULL`) and not sampled (`sample_alphas = FALSE`)

Details

Assuming the data y are independent with unknown, group-specific distributions, the hierarchical Bayesian bootstrap (HBB) from Oganisian et al. (https://doi.org/10.1515/ijb-2022-0051) is a nonparametric model for each distribution. The HBB includes hierarchical shrinkage across these groups toward a common distribution (the bb). The HBB admits direct Monte Carlo (not MCMC) sampling.

The shrinkage toward this common distribution is determined by the concentration hyperparameters alphas. Each component of alphas corresponds to one of the groups. Larger values encourage more shrinkage toward the common distribution, while smaller values allow more substantial deviations for that group.

When sample_alphas=TRUE, each component of alphas is sampled from its marginal posterior distribution, assuming independent Gamma(shape_alphas, rate_alphas) priors. This step uses a simple grid approximation to enable efficient sampling that preserves joint Monte Carlo sampling with the group-specific and common distributions. See concen_hbb for details. Note that diffuse priors on alphas tends to produce more aggressive shrinkage toward the common distribution (complete pooling). For moderate shrinkage, we use the default values shape_alphas = 30*K and rate_alphas = 1 where K is the number of groups.

When sample_alphas=FALSE, these concentration hyperparameters are fixed at user-specified values. That can be done by specifying alphas directly. Alternatively, if alphas is left unspecified (alphas = NULL), we adopt the default from Oganisian et al. which sets the cth entry to M*n/nc where M is user-specified and nc is the number of observations in group c. For further guidance on the choice of M:

M = 0.01/K approximates separate BB's by group (no pooling);
M between 10 and 100 gives moderate shrinkage (partial pooling); and
M = 100*max(nc) approximates a common BB (complete pooling).

Value

a list with the following elements:

Fyc: a list of functions where each entry corresponds to a group and that group-specific function can evaluate the sampled CDF at any argument(s)
weights_y: sampled weights from the common (BB) distribution (n-dimensional)
weights_yc: sampled weights from each of the K groups (K x n)
alphas: the (fixed or sampled) concentration hyperparameters

Note

If supplying alphas with distinct entries, make sure that the groups are ordered properly; these entries should match sort(unique(groups)).

References

Oganisian et al. (https://doi.org/10.1515/ijb-2022-0051)

Examples

# Sample size and number of groups:
n = 500
K = 3

# Define the groups, then assign:
ugroups = paste('g', 1:K, sep='') # groups
groups = sample(ugroups, n, replace = TRUE) # assignments

# Simulate the data: iid normal, then add group-specific features
y = rnorm(n = n) # data
for(g in ugroups)
  y[groups==g] = y[groups==g] + 3*rnorm(1) # group-specific

# One draw from the HBB posterior of the CDF:
samp_hbb = hbb(y, groups)

names(samp_hbb) # items returned
Fyc = samp_hbb$Fyc # list of CDFs
class(Fyc) # this is a list
class(Fyc[[1]]) # each element is a function

c = 1 # try: vary in 1:K
Fyc[[c]](0) # some example use (for this one draw)
Fyc[[c]](c(.5, 1.2))

# Plot several draws from the HBB posterior distribution:
ys = seq(min(y), max(y), length.out=1000)
plot(ys, ys, type='n', ylim = c(0,1),
     main = 'Draws from HBB posteriors', xlab = 'y', ylab = 'F_c(y)')
for(s in 1:50){ # some draws

  # BB CDF:
  Fy = bb(y)
  lines(ys, Fy(ys), lwd=3) # plot CDF

  # HBB:
  Fyc = hbb(y, groups)$Fyc

  # Plot CDFs by group:
  for(c in 1:K) lines(ys, Fyc[[c]](ys), col=c+1, lwd=3)
}

# For reference, add the ECDFs by group:
for(c in 1:K) lines(ys, ecdf(y[groups==ugroups[c]])(ys), lty=2)

legend('bottomright', c('BB', paste('HBB:', ugroups)), col = 1:(K+1), lwd=3)

# Sample size and number of groups:
n = 500
K = 3

# Define the groups, then assign:
ugroups = paste('g', 1:K, sep='') # groups
groups = sample(ugroups, n, replace = TRUE) # assignments

# Simulate the data: iid normal, then add group-specific features
y = rnorm(n = n) # data
for(g in ugroups)
  y[groups==g] = y[groups==g] + 3*rnorm(1) # group-specific

# One draw from the HBB posterior of the CDF:
samp_hbb = hbb(y, groups)

names(samp_hbb) # items returned
Fyc = samp_hbb$Fyc # list of CDFs
class(Fyc) # this is a list
class(Fyc[[1]]) # each element is a function

c = 1 # try: vary in 1:K
Fyc[[c]](0) # some example use (for this one draw)
Fyc[[c]](c(.5, 1.2))

# Plot several draws from the HBB posterior distribution:
ys = seq(min(y), max(y), length.out=1000)
plot(ys, ys, type='n', ylim = c(0,1),
     main = 'Draws from HBB posteriors', xlab = 'y', ylab = 'F_c(y)')
for(s in 1:50){ # some draws

  # BB CDF:
  Fy = bb(y)
  lines(ys, Fy(ys), lwd=3) # plot CDF

  # HBB:
  Fyc = hbb(y, groups)$Fyc

  # Plot CDFs by group:
  for(c in 1:K) lines(ys, Fyc[[c]](ys), col=c+1, lwd=3)
}

# For reference, add the ECDFs by group:
for(c in 1:K) lines(ys, ecdf(y[groups==ugroups[c]])(ys), lty=2)

legend('bottomright', c('BB', paste('HBB:', ugroups)), col = 1:(K+1), lwd=3)

Plot point and interval predictions on testing data

Description

Given posterior predictive samples at X_test, plot the point and interval estimates and compare to the actual testing data y_test.

Usage

plot_pptest(post_ypred, y_test, alpha_level = 0.1)
plot_pptest(post_ypred, y_test, alpha_level = 0.1)

Arguments

`post_ypred`	`nsave x n_test` samples from the posterior predictive distribution at test points `X_test`
`y_test`	`n_test` testing points
`alpha_level`	alpha-level for prediction intervals

Value

plot of the testing data, point and interval predictions, and a summary of the empirical coverage

Examples


# Simulate some data:
dat = simulate_tlm(n = 100, p = 5, g_type = 'step')

# Fit a semiparametric Bayesian linear model:
fit = sblm(y = dat$y, X = dat$X, X_test = dat$X_test)

# Evaluate posterior predictive means and intervals on the testing data:
plot_pptest(fit$post_ypred, dat$y_test,
            alpha_level = 0.10) # coverage should be about 90%

# Simulate some data:
dat = simulate_tlm(n = 100, p = 5, g_type = 'step')

# Fit a semiparametric Bayesian linear model:
fit = sblm(y = dat$y, X = dat$X, X_test = dat$X_test)

# Evaluate posterior predictive means and intervals on the testing data:
plot_pptest(fit$post_ypred, dat$y_test,
            alpha_level = 0.10) # coverage should be about 90%

Rank-based estimation of the linear regression coefficients

Description

For a transformed Gaussian linear model, compute point estimates of the regression coefficients. This approach uses the ranks of the data and does not require the transformation, but must expand the sample size to n^2 and thus can be slow.

Usage

rank_approx(y, X)
rank_approx(y, X)

Arguments

`y`	`n x 1` response vector
`X`	`n x p` matrix of predictors (should not include an intercept!)

Value

the estimated linear coefficients

Examples

# Simulate some data:
dat = simulate_tlm(n = 200, p = 10, g_type = 'step')

# Point estimates for the linear coefficients:
theta_hat = suppressWarnings(
  rank_approx(y = dat$y,
              X = dat$X[,-1]) # remove intercept
) # warnings occur from glm.fit (fitted probabilities 0 or 1)

# Check: correlation with true coefficients
cor(dat$beta_true[-1], # excluding the intercept
    theta_hat)

# Simulate some data:
dat = simulate_tlm(n = 200, p = 10, g_type = 'step')

# Point estimates for the linear coefficients:
theta_hat = suppressWarnings(
  rank_approx(y = dat$y,
              X = dat$X[,-1]) # remove intercept
) # warnings occur from glm.fit (fitted probabilities 0 or 1)

# Check: correlation with true coefficients
cor(dat$beta_true[-1], # excluding the intercept
    theta_hat)

Semiparametric Bayesian Gaussian processes

Description

Monte Carlo sampling for Bayesian Gaussian process regression with an unknown (nonparametric) transformation.

Usage

sbgp(
  y,
  locs,
  X = NULL,
  covfun_name = "matern_isotropic",
  locs_test = locs,
  X_test = NULL,
  nn = 30,
  emp_bayes = TRUE,
  approx_g = FALSE,
  nsave = 1000,
  ngrid = 100
)
sbgp(
  y,
  locs,
  X = NULL,
  covfun_name = "matern_isotropic",
  locs_test = locs,
  X_test = NULL,
  nn = 30,
  emp_bayes = TRUE,
  approx_g = FALSE,
  nsave = 1000,
  ngrid = 100
)

Arguments

`y`	`n x 1` response vector
`locs`	`n x d` matrix of locations
`X`	`n x p` design matrix; if unspecified, use intercept only
`covfun_name`	string name of a covariance function; see ?GpGp
`locs_test`	`n_test x d` matrix of locations at which predictions are needed; default is `locs`
`X_test`	`n_test x p` design matrix for test data; default is `X`
`nn`	number of nearest neighbors to use; default is 30 (larger values improve the approximation but increase computing cost)
`emp_bayes`	logical; if TRUE, use a (faster!) empirical Bayes approach for estimating the mean function
`approx_g`	logical; if TRUE, apply large-sample approximation for the transformation
`nsave`	number of Monte Carlo simulations
`ngrid`	number of grid points for inverse approximations

Details

This function provides Bayesian inference for a transformed Gaussian process model using Monte Carlo (not MCMC) sampling. The transformation is modeled as unknown and learned jointly with the regression function (unless approx_g = TRUE, which then uses a point approximation). This model applies for real-valued data, positive data, and compactly-supported data (the support is automatically deduced from the observed y values). The results are typically unchanged whether laplace_approx is TRUE/FALSE; setting it to TRUE may reduce sensitivity to the prior, while setting it to FALSE may speed up computations for very large datasets. For computational efficiency, the Gaussian process parameters are fixed at point estimates, and the latent Gaussian process is only sampled when emp_bayes = FALSE. However, the uncertainty from this term is often negligible compared to the observation errors, and the transformation serves as an additional layer of robustness.

Value

a list with the following elements:

coefficients the estimated regression coefficients
fitted.values the posterior predictive mean at the test points locs_test
fit_gp the fitted GpGp_fit object, which includes covariance parameter estimates and other model information
post_ypred: nsave x ntest samples from the posterior predictive distribution at locs_test
post_g: nsave posterior samples of the transformation evaluated at the unique y values
model: the model fit (here, sbgp)

as well as the arguments passed in.

Examples


# Simulate some data:
n = 200 # sample size
x = seq(0, 1, length = n) # observation points

# Transform a noisy, periodic function:
y = g_inv_bc(
  sin(2*pi*x) + sin(4*pi*x) + rnorm(n, sd = .5),
             lambda = .5) # Signed square-root transformation

# Package we use for fast computing w/ Gaussian processes:
library(GpGp)

# Fit the semiparametric Bayesian Gaussian process:
fit = sbgp(y = y, locs = x)
names(fit) # what is returned
coef(fit) # estimated regression coefficients (here, just an intercept)
class(fit$fit_gp) # the GpGp object is also returned

# Plot the model predictions (point and interval estimates):
pi_y = t(apply(fit$post_ypred, 2, quantile, c(0.05, .95))) # 90% PI
plot(x, y, type='n', ylim = range(pi_y,y),
     xlab = 'x', ylab = 'y', main = paste('Fitted values and prediction intervals'))
polygon(c(x, rev(x)),c(pi_y[,2], rev(pi_y[,1])),col='gray', border=NA)
lines(x, y, type='p')
lines(x, fitted(fit), lwd = 3)

# Simulate some data:
n = 200 # sample size
x = seq(0, 1, length = n) # observation points

# Transform a noisy, periodic function:
y = g_inv_bc(
  sin(2*pi*x) + sin(4*pi*x) + rnorm(n, sd = .5),
             lambda = .5) # Signed square-root transformation

# Package we use for fast computing w/ Gaussian processes:
library(GpGp)

# Fit the semiparametric Bayesian Gaussian process:
fit = sbgp(y = y, locs = x)
names(fit) # what is returned
coef(fit) # estimated regression coefficients (here, just an intercept)
class(fit$fit_gp) # the GpGp object is also returned

# Plot the model predictions (point and interval estimates):
pi_y = t(apply(fit$post_ypred, 2, quantile, c(0.05, .95))) # 90% PI
plot(x, y, type='n', ylim = range(pi_y,y),
     xlab = 'x', ylab = 'y', main = paste('Fitted values and prediction intervals'))
polygon(c(x, rev(x)),c(pi_y[,2], rev(pi_y[,1])),col='gray', border=NA)
lines(x, y, type='p')
lines(x, fitted(fit), lwd = 3)

Semiparametric Bayesian linear model

Description

Monte Carlo sampling for Bayesian linear regression with an unknown (nonparametric) transformation. A g-prior is assumed for the regression coefficients.

Usage

sblm(
  y,
  X,
  X_test = X,
  psi = length(y),
  laplace_approx = TRUE,
  approx_g = FALSE,
  nsave = 1000,
  ngrid = 100,
  verbose = TRUE
)
sblm(
  y,
  X,
  X_test = X,
  psi = length(y),
  laplace_approx = TRUE,
  approx_g = FALSE,
  nsave = 1000,
  ngrid = 100,
  verbose = TRUE
)

Arguments

`y`	`n x 1` response vector
`X`	`n x p` matrix of predictors
`X_test`	`n_test x p` matrix of predictors for test data; default is the observed covariates `X`
`psi`	prior variance (g-prior)
`laplace_approx`	logical; if TRUE, use a normal approximation to the posterior in the definition of the transformation; otherwise the prior is used
`approx_g`	logical; if TRUE, apply large-sample approximation for the transformation
`nsave`	number of Monte Carlo simulations
`ngrid`	number of grid points for inverse approximations
`verbose`	logical; if TRUE, print time remaining

Details

This function provides fully Bayesian inference for a transformed linear model using Monte Carlo (not MCMC) sampling. The transformation is modeled as unknown and learned jointly with the regression coefficients (unless approx_g = TRUE, which then uses a point approximation). This model applies for real-valued data, positive data, and compactly-supported data (the support is automatically deduced from the observed y values). The results are typically unchanged whether laplace_approx is TRUE/FALSE; setting it to TRUE may reduce sensitivity to the prior, while setting it to FALSE may speed up computations for very large datasets.

Value

a list with the following elements:

coefficients the posterior mean of the regression coefficients
fitted.values the posterior predictive mean at the test points X_test
post_theta: nsave x p samples from the posterior distribution of the regression coefficients
post_ypred: nsave x n_test samples from the posterior predictive distribution at test points X_test
post_g: nsave posterior samples of the transformation evaluated at the unique y values
model: the model fit (here, sblm)

as well as the arguments passed in.

Examples


# Simulate some data:
dat = simulate_tlm(n = 100, p = 5, g_type = 'step')
y = dat$y; X = dat$X # training data
y_test = dat$y_test; X_test = dat$X_test # testing data

hist(y, breaks = 25) # marginal distribution

# Fit the semiparametric Bayesian linear model:
fit = sblm(y = y, X = X, X_test = X_test)
names(fit) # what is returned

# Note: this is Monte Carlo sampling, so no need for MCMC diagnostics!

# Evaluate posterior predictive means and intervals on the testing data:
plot_pptest(fit$post_ypred, y_test,
            alpha_level = 0.10) # coverage should be about 90%

# Check: correlation with true coefficients
cor(dat$beta_true[-1],
    coef(fit)[-1]) # excluding the intercept

# Summarize the transformation:
y0 = sort(unique(y)) # posterior draws of g are evaluated at the unique y observations
plot(y0, fit$post_g[1,], type='n', ylim = range(fit$post_g),
     xlab = 'y', ylab = 'g(y)', main = "Posterior draws of the transformation")
temp = sapply(1:nrow(fit$post_g), function(s)
  lines(y0, fit$post_g[s,], col='gray')) # posterior draws
lines(y0, colMeans(fit$post_g), lwd = 3) # posterior mean

# Add the true transformation, rescaled for easier comparisons:
lines(y,
      scale(dat$g_true)*sd(colMeans(fit$post_g)) + mean(colMeans(fit$post_g)), type='p', pch=2)
legend('bottomright', c('Truth'), pch = 2) # annotate the true transformation

# Posterior predictive checks on testing data: empirical CDF
y0 = sort(unique(y_test))
plot(y0, y0, type='n', ylim = c(0,1),
     xlab='y', ylab='F_y', main = 'Posterior predictive ECDF')
temp = sapply(1:nrow(fit$post_ypred), function(s)
  lines(y0, ecdf(fit$post_ypred[s,])(y0), # ECDF of posterior predictive draws
        col='gray', type ='s'))
lines(y0, ecdf(y_test)(y0),  # ECDF of testing data
     col='black', type = 's', lwd = 3)

# Simulate some data:
dat = simulate_tlm(n = 100, p = 5, g_type = 'step')
y = dat$y; X = dat$X # training data
y_test = dat$y_test; X_test = dat$X_test # testing data

hist(y, breaks = 25) # marginal distribution

# Fit the semiparametric Bayesian linear model:
fit = sblm(y = y, X = X, X_test = X_test)
names(fit) # what is returned

# Note: this is Monte Carlo sampling, so no need for MCMC diagnostics!

# Evaluate posterior predictive means and intervals on the testing data:
plot_pptest(fit$post_ypred, y_test,
            alpha_level = 0.10) # coverage should be about 90%

# Check: correlation with true coefficients
cor(dat$beta_true[-1],
    coef(fit)[-1]) # excluding the intercept

# Summarize the transformation:
y0 = sort(unique(y)) # posterior draws of g are evaluated at the unique y observations
plot(y0, fit$post_g[1,], type='n', ylim = range(fit$post_g),
     xlab = 'y', ylab = 'g(y)', main = "Posterior draws of the transformation")
temp = sapply(1:nrow(fit$post_g), function(s)
  lines(y0, fit$post_g[s,], col='gray')) # posterior draws
lines(y0, colMeans(fit$post_g), lwd = 3) # posterior mean

# Add the true transformation, rescaled for easier comparisons:
lines(y,
      scale(dat$g_true)*sd(colMeans(fit$post_g)) + mean(colMeans(fit$post_g)), type='p', pch=2)
legend('bottomright', c('Truth'), pch = 2) # annotate the true transformation

# Posterior predictive checks on testing data: empirical CDF
y0 = sort(unique(y_test))
plot(y0, y0, type='n', ylim = c(0,1),
     xlab='y', ylab='F_y', main = 'Posterior predictive ECDF')
temp = sapply(1:nrow(fit$post_ypred), function(s)
  lines(y0, ecdf(fit$post_ypred[s,])(y0), # ECDF of posterior predictive draws
        col='gray', type ='s'))
lines(y0, ecdf(y_test)(y0),  # ECDF of testing data
     col='black', type = 's', lwd = 3)

Semiparametric Bayesian quantile regression

Description

MCMC sampling for Bayesian quantile regression with an unknown (nonparametric) transformation. Like in traditional Bayesian quantile regression, an asymmetric Laplace distribution is assumed for the errors, so the regression models targets the specified quantile. However, these models are often woefully inadequate for describing observed data. We introduce a nonparametric transformation to improve model adequacy while still providing inference for the regression coefficients and the specified quantile. A g-prior is assumed for the regression coefficients.

Usage

sbqr(
  y,
  X,
  tau = 0.5,
  X_test = X,
  psi = length(y),
  laplace_approx = TRUE,
  approx_g = FALSE,
  nsave = 1000,
  nburn = 100,
  ngrid = 100,
  verbose = TRUE
)
sbqr(
  y,
  X,
  tau = 0.5,
  X_test = X,
  psi = length(y),
  laplace_approx = TRUE,
  approx_g = FALSE,
  nsave = 1000,
  nburn = 100,
  ngrid = 100,
  verbose = TRUE
)

Arguments

`y`	`n x 1` response vector
`X`	`n x p` matrix of predictors
`tau`	the target quantile (between zero and one)
`X_test`	`n_test x p` matrix of predictors for test data; default is the observed covariates `X`
`psi`	prior variance (g-prior)
`laplace_approx`	logical; if TRUE, use a normal approximation to the posterior in the definition of the transformation; otherwise the prior is used
`approx_g`	logical; if TRUE, apply large-sample approximation for the transformation
`nsave`	number of MCMC iterations to save
`nburn`	number of MCMC iterations to discard
`ngrid`	number of grid points for inverse approximations
`verbose`	logical; if TRUE, print time remaining

Details

This function provides fully Bayesian inference for a transformed quantile linear model. The transformation is modeled as unknown and learned jointly with the regression coefficients (unless approx_g = TRUE, which then uses a point approximation). This model applies for real-valued data, positive data, and compactly-supported data (the support is automatically deduced from the observed y values). The results are typically unchanged whether laplace_approx is TRUE/FALSE; setting it to TRUE may reduce sensitivity to the prior, while setting it to FALSE may speed up computations for very large datasets.

Value

a list with the following elements:

coefficients the posterior mean of the regression coefficients
fitted.values the estimated tauth quantile at test points X_test
post_theta: nsave x p samples from the posterior distribution of the regression coefficients
post_ypred: nsave x n_test samples from the posterior predictive distribution at test points X_test
post_qtau: nsave x n_test samples of the tauth conditional quantile at test points X_test
post_g: nsave posterior samples of the transformation evaluated at the unique y values
model: the model fit (here, sbqr)

as well as the arguments passed in.

Examples


# Simulate some heteroskedastic data (no transformation):
dat = simulate_tlm(n = 200, p = 10, g_type = 'box-cox', heterosked = TRUE, lambda = 1)
y = dat$y; X = dat$X # training data
y_test = dat$y_test; X_test = dat$X_test # testing data

# Target this quantile:
tau = 0.05

# Fit the semiparametric Bayesian quantile regression model:
fit = sbqr(y = y, X = X, tau = tau, X_test = X_test)
names(fit) # what is returned

# Posterior predictive checks on testing data: empirical CDF
y0 = sort(unique(y_test))
plot(y0, y0, type='n', ylim = c(0,1),
     xlab='y', ylab='F_y', main = 'Posterior predictive ECDF')
temp = sapply(1:nrow(fit$post_ypred), function(s)
  lines(y0, ecdf(fit$post_ypred[s,])(y0), # ECDF of posterior predictive draws
        col='gray', type ='s'))
lines(y0, ecdf(y_test)(y0),  # ECDF of testing data
     col='black', type = 's', lwd = 3)


# Simulate some heteroskedastic data (no transformation):
dat = simulate_tlm(n = 200, p = 10, g_type = 'box-cox', heterosked = TRUE, lambda = 1)
y = dat$y; X = dat$X # training data
y_test = dat$y_test; X_test = dat$X_test # testing data

# Target this quantile:
tau = 0.05

# Fit the semiparametric Bayesian quantile regression model:
fit = sbqr(y = y, X = X, tau = tau, X_test = X_test)
names(fit) # what is returned

# Posterior predictive checks on testing data: empirical CDF
y0 = sort(unique(y_test))
plot(y0, y0, type='n', ylim = c(0,1),
     xlab='y', ylab='F_y', main = 'Posterior predictive ECDF')
temp = sapply(1:nrow(fit$post_ypred), function(s)
  lines(y0, ecdf(fit$post_ypred[s,])(y0), # ECDF of posterior predictive draws
        col='gray', type ='s'))
lines(y0, ecdf(y_test)(y0),  # ECDF of testing data
     col='black', type = 's', lwd = 3)

Semiparametric Bayesian spline model

Description

Monte Carlo sampling for Bayesian spline regression with an unknown (nonparametric) transformation.

Usage

sbsm(
  y,
  x = NULL,
  x_test = NULL,
  psi = NULL,
  laplace_approx = TRUE,
  approx_g = FALSE,
  nsave = 1000,
  ngrid = 100,
  verbose = TRUE
)
sbsm(
  y,
  x = NULL,
  x_test = NULL,
  psi = NULL,
  laplace_approx = TRUE,
  approx_g = FALSE,
  nsave = 1000,
  ngrid = 100,
  verbose = TRUE
)

Arguments

`y`	`n x 1` response vector
`x`	`n x 1` vector of observation points; if NULL, assume equally-spaced on [0,1]
`x_test`	`n_test x 1` vector of testing points; if NULL, assume equal to `x`
`psi`	prior variance (inverse smoothing parameter); if NULL, sample this parameter
`laplace_approx`	logical; if TRUE, use a normal approximation to the posterior in the definition of the transformation; otherwise the prior is used
`approx_g`	logical; if TRUE, apply large-sample approximation for the transformation
`nsave`	number of Monte Carlo simulations
`ngrid`	number of grid points for inverse approximations
`verbose`	logical; if TRUE, print time remaining

Details

This function provides fully Bayesian inference for a transformed spline regression model using Monte Carlo (not MCMC) sampling. The transformation is modeled as unknown and learned jointly with the regression function (unless approx_g = TRUE, which then uses a point approximation). This model applies for real-valued data, positive data, and compactly-supported data (the support is automatically deduced from the observed y values). The results are typically unchanged whether laplace_approx is TRUE/FALSE; setting it to TRUE may reduce sensitivity to the prior, while setting it to FALSE may speed up computations for very large datasets.

Value

a list with the following elements:

coefficients the posterior mean of the regression coefficients
fitted.values the posterior predictive mean at the test points x_test
post_theta: nsave x p samples from the posterior distribution of the regression coefficients
post_ypred: nsave x n_test samples from the posterior predictive distribution at x_test
post_g: nsave posterior samples of the transformation evaluated at the unique y values
model: the model fit (here, sbsm)

as well as the arguments passed in.

Examples


# Simulate some data:
n = 100 # sample size
x = sort(runif(n)) # observation points

# Transform a noisy, periodic function:
y = g_inv_bc(
  sin(2*pi*x) + sin(4*pi*x) + rnorm(n, sd = .5),
             lambda = .5) # Signed square-root transformation

# Fit the semiparametric Bayesian spline model:
fit = sbsm(y = y, x = x)
names(fit) # what is returned

# Note: this is Monte Carlo sampling, so no need for MCMC diagnostics!

# Plot the model predictions (point and interval estimates):
pi_y = t(apply(fit$post_ypred, 2, quantile, c(0.05, .95))) # 90% PI
plot(x, y, type='n', ylim = range(pi_y,y),
     xlab = 'x', ylab = 'y', main = paste('Fitted values and prediction intervals'))
polygon(c(x, rev(x)),c(pi_y[,2], rev(pi_y[,1])),col='gray', border=NA)
lines(x, y, type='p')
lines(x, fitted(fit), lwd = 3)

# Simulate some data:
n = 100 # sample size
x = sort(runif(n)) # observation points

# Transform a noisy, periodic function:
y = g_inv_bc(
  sin(2*pi*x) + sin(4*pi*x) + rnorm(n, sd = .5),
             lambda = .5) # Signed square-root transformation

# Fit the semiparametric Bayesian spline model:
fit = sbsm(y = y, x = x)
names(fit) # what is returned

# Note: this is Monte Carlo sampling, so no need for MCMC diagnostics!

# Plot the model predictions (point and interval estimates):
pi_y = t(apply(fit$post_ypred, 2, quantile, c(0.05, .95))) # 90% PI
plot(x, y, type='n', ylim = range(pi_y,y),
     xlab = 'x', ylab = 'y', main = paste('Fitted values and prediction intervals'))
polygon(c(x, rev(x)),c(pi_y[,2], rev(pi_y[,1])),col='gray', border=NA)
lines(x, y, type='p')
lines(x, fitted(fit), lwd = 3)

Simulate a transformed linear model

Description

Generate training data (X, y) and testing data (X_test, y_test) for a transformed linear model. The covariates are correlated Gaussian variables. Half of the true regression coefficients are zero and the other half are one. There are multiple options for the transformation, which define the support of the data (see below).

Usage

simulate_tlm(
  n,
  p,
  g_type = "beta",
  n_test = 1000,
  heterosked = FALSE,
  lambda = 1
)
simulate_tlm(
  n,
  p,
  g_type = "beta",
  n_test = 1000,
  heterosked = FALSE,
  lambda = 1
)

Arguments

`n`	number of observations in the training data
`p`	number of covariates
`g_type`	type of transformation; must be one of `beta`, `step`, or `box-cox`
`n_test`	number of observations in the testing data
`heterosked`	logical; if TRUE, simulate the latent data with heteroskedasticity
`lambda`	Box-Cox parameter (only applies for `g_type = 'box-cox'`)

Details

The transformations vary in complexity and support for the observed data, and include the following options: beta yields marginally Beta(0.1, 0.5) data supported on [0,1]; step generates a locally-linear inverse transformation and produces positive data; and box-cox refers to the signed Box-Cox family indexed by lambda, which generates real-valued data with examples including identity, square-root, and log transformations.

Value

a list with the following elements:

y: the response variable in the training data
X: the covariates in the training data
y_test: the response variable in the testing data
X_test: the covariates in the testing data
beta_true: the true regression coefficients
g_true: the true transformation, evaluated at y

Examples

# Simulate data:
dat = simulate_tlm(n = 100, p = 5, g_type = 'beta')
names(dat) # what is returned
hist(dat$y, breaks = 25) # marginal distribution

# Simulate data:
dat = simulate_tlm(n = 100, p = 5, g_type = 'beta')
names(dat) # what is returned
hist(dat$y, breaks = 25) # marginal distribution

Post-processing with importance sampling

Description

Given Monte Carlo draws from the surrogate posterior, apply sampling importance reweighting (SIR) to correct for the true model likelihood.

Usage

sir_adjust(fit, sir_frac = 0.3, nsims_prior = 100, verbose = TRUE)
sir_adjust(fit, sir_frac = 0.3, nsims_prior = 100, verbose = TRUE)

Arguments

`fit`	a fitted model object that includes `coefficients` the posterior mean of the regression coefficients `post_theta`: `nsave x p` samples from the posterior distribution of the regression coefficients `post_ypred`: `nsave x n_test` samples from the posterior predictive distribution at test points `X_test` `post_g`: `nsave` posterior samples of the transformation evaluated at the unique `y` values `model`: the model fit (`sblm` or `sbsm`)
`sir_frac`	fraction of draws to sample for SIR
`nsims_prior`	number of draws from the prior
`verbose`	logical; if TRUE, print time remaining

Details

The Monte Carlo sampling for sblm and sbsm uses a surrogate likelihood for posterior inference, which enables much faster and easier computing. SIR provides a correction for the actual (specified) likelihood. However, this correction step is quite slow and typically does not produce any noticeable discrepancies, even for small sample sizes.

Value

the fitted model object with the posterior draws subsampled based on the SIR adjustment

Note

SIR sampling is done WITHOUT replacement, so sir_frac is typically between 0.1 and 0.5. The nsims_priors draws are used to approximate a prior expectation, but larger values can significantly slow down this function.

Examples


# Simulate some data:
dat = simulate_tlm(n = 50, p = 5, g_type = 'step')
y = dat$y; X = dat$X # training data
y_test = dat$y_test; X_test = dat$X_test # testing data

hist(y, breaks = 10) # marginal distribution

# Fit the semiparametric Bayesian linear model:
fit = sblm(y = y, X = X, X_test = X_test)
names(fit) # what is returned

# Update with SIR:
fit_sir = sir_adjust(fit)

# Prediction: unadjusted vs. adjusted?

# Point estimates:
y_hat = fitted(fit)
y_hat_sir = fitted(fit_sir)
cor(y_hat, y_hat_sir) # similar

# Interval estimates:
pi_y = t(apply(fit$post_ypred, 2, quantile, c(0.05, .95))) # 90% PI
pi_y_sir = t(apply(fit_sir$post_ypred, 2, quantile, c(0.05, .95))) # 90% PI

# PI overlap (%):
overlaps = 100*sapply(1:length(y_test), function(i){
  # innermost part
  (min(pi_y[i,2], pi_y_sir[i,2]) - max(pi_y[i,1], pi_y_sir[i,1]))/
    # outermost part
    (max(pi_y[i,2], pi_y_sir[i,2]) - min(pi_y[i,1], pi_y_sir[i,1]))
})
summary(overlaps) # mostly close to 100%

# Coverage of PIs on testing data (should be ~ 90%)
mean((pi_y[,1] <= y_test)*(pi_y[,2] >= y_test)) # unadjusted
mean((pi_y_sir[,1] <= y_test)*(pi_y_sir[,2] >= y_test)) # adjusted

# Plot together with testing data:
plot(y_test, y_test, type='n', ylim = range(pi_y, pi_y_sir, y_test),
     xlab = 'y_test', ylab = 'y_hat', main = paste('Prediction intervals: testing data'))
abline(0,1) # reference line
suppressWarnings(
  arrows(y_test, pi_y[,1], y_test, pi_y[,2],
         length=0.15, angle=90, code=3, col='gray', lwd=2)
) # plot the PIs (unadjusted)
suppressWarnings(
  arrows(y_test, pi_y_sir[,1], y_test, pi_y_sir[,2],
         length=0.15, angle=90, code=3, col='darkgray', lwd=2)
) # plot the PIs (adjusted)
lines(y_test, y_hat, type='p', pch=2) # plot the means (unadjusted)
lines(y_test, y_hat_sir, type='p', pch=3) # plot the means (adjusted)

# Simulate some data:
dat = simulate_tlm(n = 50, p = 5, g_type = 'step')
y = dat$y; X = dat$X # training data
y_test = dat$y_test; X_test = dat$X_test # testing data

hist(y, breaks = 10) # marginal distribution

# Fit the semiparametric Bayesian linear model:
fit = sblm(y = y, X = X, X_test = X_test)
names(fit) # what is returned

# Update with SIR:
fit_sir = sir_adjust(fit)

# Prediction: unadjusted vs. adjusted?

# Point estimates:
y_hat = fitted(fit)
y_hat_sir = fitted(fit_sir)
cor(y_hat, y_hat_sir) # similar

# Interval estimates:
pi_y = t(apply(fit$post_ypred, 2, quantile, c(0.05, .95))) # 90% PI
pi_y_sir = t(apply(fit_sir$post_ypred, 2, quantile, c(0.05, .95))) # 90% PI

# PI overlap (%):
overlaps = 100*sapply(1:length(y_test), function(i){
  # innermost part
  (min(pi_y[i,2], pi_y_sir[i,2]) - max(pi_y[i,1], pi_y_sir[i,1]))/
    # outermost part
    (max(pi_y[i,2], pi_y_sir[i,2]) - min(pi_y[i,1], pi_y_sir[i,1]))
})
summary(overlaps) # mostly close to 100%

# Coverage of PIs on testing data (should be ~ 90%)
mean((pi_y[,1] <= y_test)*(pi_y[,2] >= y_test)) # unadjusted
mean((pi_y_sir[,1] <= y_test)*(pi_y_sir[,2] >= y_test)) # adjusted

# Plot together with testing data:
plot(y_test, y_test, type='n', ylim = range(pi_y, pi_y_sir, y_test),
     xlab = 'y_test', ylab = 'y_hat', main = paste('Prediction intervals: testing data'))
abline(0,1) # reference line
suppressWarnings(
  arrows(y_test, pi_y[,1], y_test, pi_y[,2],
         length=0.15, angle=90, code=3, col='gray', lwd=2)
) # plot the PIs (unadjusted)
suppressWarnings(
  arrows(y_test, pi_y_sir[,1], y_test, pi_y_sir[,2],
         length=0.15, angle=90, code=3, col='darkgray', lwd=2)
) # plot the PIs (adjusted)
lines(y_test, y_hat, type='p', pch=2) # plot the means (unadjusted)
lines(y_test, y_hat_sir, type='p', pch=3) # plot the means (adjusted)

Univariate Slice Sampler from Neal (2008)

Description

Compute a draw from a univariate distribution using the code provided by Radford M. Neal. The documentation below is also reproduced from Neal (2008).

Usage

uni.slice(x0, g, w = 1, m = Inf, lower = -Inf, upper = +Inf, gx0 = NULL)
uni.slice(x0, g, w = 1, m = Inf, lower = -Inf, upper = +Inf, gx0 = NULL)

Arguments

`x0`	Initial point
`g`	Function returning the log of the probability density (plus constant)
`w`	Size of the steps for creating interval (default 1)
`m`	Limit on steps (default infinite)
`lower`	Lower bound on support of the distribution (default -Inf)
`upper`	Upper bound on support of the distribution (default +Inf)
`gx0`	Value of g(x0), if known (default is not known)

Value

The point sampled, with its log density attached as an attribute.

Note

The log density function may return -Inf for points outside the support of the distribution. If a lower and/or upper bound is specified for the support, the log density function will not be called outside such limits.

Package 'SeBR'

Help Index

Bayesian bootstrap posterior sampler for the CDF

Description

Usage

Arguments

Details

Value

Note

Examples

Bayesian Gaussian processes with a Box-Cox transformation

Description

Usage

Arguments

Details

Value

Note

Examples

Bayesian linear model with a Box-Cox transformation

Description

Usage

Arguments

Details

Value

Note

Examples

Bayesian quantile regression

Description

Usage

Arguments

Value

Note

Examples

Bayesian spline model with a Box-Cox transformation

Description

Usage

Arguments

Details

Value

Note

Examples

Estimate the remaining time in the MCMC based on previous samples

Description

Usage

Arguments

Value

Posterior sampling algorithm for the HBB concentration hyperparameters

Description

Usage

Arguments

Details

Value

References

Examples

Grid contraction

Description

Usage

Arguments

Value

Compute the latent data CDF

Description

Usage

Arguments

Value

Box-Cox transformation

Description

Usage

Arguments

Value

Note

Examples

Compute the transformation

Description

Usage

Arguments

Value

Approximate inverse transformation

Description

Usage

Arguments