Title: | Semiparametric Bayesian Regression Analysis |
---|---|
Description: | Monte Carlo sampling algorithms for semiparametric Bayesian regression analysis. These models feature a nonparametric (unknown) transformation of the data paired with widely-used regression models including linear regression, spline regression, quantile regression, and Gaussian processes. The transformation enables broader applicability of these key models, including for real-valued, positive, and compactly-supported data with challenging distributional features. The samplers prioritize computational scalability and, for most cases, Monte Carlo (not MCMC) sampling for greater efficiency. Details of the methods and algorithms are provided in Kowal and Wu (2024) <doi:10.1080/01621459.2024.2395586>. |
Authors: | Dan Kowal [aut, cre, cph] |
Maintainer: | Dan Kowal <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.0 |
Built: | 2025-03-10 21:27:36 UTC |
Source: | https://github.com/drkowal/sebr |
Compute one Monte Carlo draw from the Bayesian bootstrap (BB) posterior distribution of the cumulative distribution function (CDF).
bb(y)
bb(y)
y |
the data from which to infer the CDF (preferably sorted) |
Assuming the data y
are iid from an unknown distribution,
the Bayesian bootstrap (BB) is a nonparametric model for this distribution. The
BB is a limiting case of a Dirichlet process prior (without
any hyperparameters) that admits direct Monte Carlo (not MCMC) sampling.
This function computes one draw from the BB posterior
distribution for the CDF Fy
.
a function that can evaluate the sampled CDF at any argument(s)
This code is inspired by ggdist::weighted_ecdf
.
# Simulate data: y = rnorm(n = 100) # One draw from the BB posterior: Fy = bb(y) class(Fy) # this is a function Fy(0) # some example use (for this one draw) Fy(c(.5, 1.2)) # Plot several draws from the BB posterior distribution: ys = seq(-3, 3, length.out=1000) plot(ys, ys, type='n', ylim = c(0,1), main = 'Draws from BB posterior', xlab = 'y', ylab = 'F(y)') for(s in 1:50) lines(ys, bb(y)(ys), col='gray') # Add ECDF for reference: lines(ys, ecdf(y)(ys), lty=2)
# Simulate data: y = rnorm(n = 100) # One draw from the BB posterior: Fy = bb(y) class(Fy) # this is a function Fy(0) # some example use (for this one draw) Fy(c(.5, 1.2)) # Plot several draws from the BB posterior distribution: ys = seq(-3, 3, length.out=1000) plot(ys, ys, type='n', ylim = c(0,1), main = 'Draws from BB posterior', xlab = 'y', ylab = 'F(y)') for(s in 1:50) lines(ys, bb(y)(ys), col='gray') # Add ECDF for reference: lines(ys, ecdf(y)(ys), lty=2)
MCMC sampling for Bayesian Gaussian process regression with a (known or unknown) Box-Cox transformation.
bgp_bc( y, locs, X = NULL, covfun_name = "matern_isotropic", locs_test = locs, X_test = NULL, nn = 30, emp_bayes = TRUE, lambda = NULL, sample_lambda = TRUE, nsave = 1000, nburn = 1000, nskip = 0 )
bgp_bc( y, locs, X = NULL, covfun_name = "matern_isotropic", locs_test = locs, X_test = NULL, nn = 30, emp_bayes = TRUE, lambda = NULL, sample_lambda = TRUE, nsave = 1000, nburn = 1000, nskip = 0 )
y |
|
locs |
|
X |
|
covfun_name |
string name of a covariance function; see ?GpGp |
locs_test |
|
X_test |
|
nn |
number of nearest neighbors to use; default is 30 (larger values improve the approximation but increase computing cost) |
emp_bayes |
logical; if TRUE, use a (faster!) empirical Bayes approach for estimating the mean function |
lambda |
Box-Cox transformation; if NULL, estimate this parameter |
sample_lambda |
logical; if TRUE, sample lambda, otherwise use the fixed value of lambda above or the MLE (if lambda unspecified) |
nsave |
number of MCMC iterations to save |
nburn |
number of MCMC iterations to discard |
nskip |
number of MCMC iterations to skip between saving iterations, i.e., save every (nskip + 1)th draw |
This function provides Bayesian inference for
transformed Gaussian processes. The transformation is
parametric from the Box-Cox family, which has one parameter lambda
.
That parameter may be fixed in advanced or learned from the data.
For computational efficiency, the Gaussian process parameters are
fixed at point estimates, and the latent Gaussian process is only sampled
when emp_bayes
= FALSE.
a list with the following elements:
coefficients
the posterior mean of the regression coefficients
fitted.values
the posterior predictive mean at the test points locs_test
fit_gp
the fitted GpGp_fit
object, which includes
covariance parameter estimates and other model information
post_ypred
: nsave x n_test
samples
from the posterior predictive distribution at locs_test
post_g
: nsave
posterior samples of the transformation
evaluated at the unique y
values
post_lambda
nsave
posterior samples of lambda
model
: the model fit (here, bgp_bc
)
as well as the arguments passed in.
Box-Cox transformations may be useful in some cases, but
in general we recommend the nonparametric transformation (with
Monte Carlo, not MCMC sampling) in sbgp
.
# Simulate some data: n = 200 # sample size x = seq(0, 1, length = n) # observation points # Transform a noisy, periodic function: y = g_inv_bc( sin(2*pi*x) + sin(4*pi*x) + rnorm(n, sd = .5), lambda = .5) # Signed square-root transformation # Package we use for fast computing w/ Gaussian processes: library(GpGp) # Fit a Bayesian Gaussian process with Box-Cox transformation: fit = bgp_bc(y = y, locs = x) names(fit) # what is returned coef(fit) # estimated regression coefficients (here, just an intercept) class(fit$fit_gp) # the GpGp object is also returned round(quantile(fit$post_lambda), 3) # summary of unknown Box-Cox parameter # Plot the model predictions (point and interval estimates): pi_y = t(apply(fit$post_ypred, 2, quantile, c(0.05, .95))) # 90% PI plot(x, y, type='n', ylim = range(pi_y,y), xlab = 'x', ylab = 'y', main = paste('Fitted values and prediction intervals')) polygon(c(x, rev(x)),c(pi_y[,2], rev(pi_y[,1])),col='gray', border=NA) lines(x, y, type='p') lines(x, fitted(fit), lwd = 3)
# Simulate some data: n = 200 # sample size x = seq(0, 1, length = n) # observation points # Transform a noisy, periodic function: y = g_inv_bc( sin(2*pi*x) + sin(4*pi*x) + rnorm(n, sd = .5), lambda = .5) # Signed square-root transformation # Package we use for fast computing w/ Gaussian processes: library(GpGp) # Fit a Bayesian Gaussian process with Box-Cox transformation: fit = bgp_bc(y = y, locs = x) names(fit) # what is returned coef(fit) # estimated regression coefficients (here, just an intercept) class(fit$fit_gp) # the GpGp object is also returned round(quantile(fit$post_lambda), 3) # summary of unknown Box-Cox parameter # Plot the model predictions (point and interval estimates): pi_y = t(apply(fit$post_ypred, 2, quantile, c(0.05, .95))) # 90% PI plot(x, y, type='n', ylim = range(pi_y,y), xlab = 'x', ylab = 'y', main = paste('Fitted values and prediction intervals')) polygon(c(x, rev(x)),c(pi_y[,2], rev(pi_y[,1])),col='gray', border=NA) lines(x, y, type='p') lines(x, fitted(fit), lwd = 3)
MCMC sampling for Bayesian linear regression with a (known or unknown) Box-Cox transformation. A g-prior is assumed for the regression coefficients.
blm_bc( y, X, X_test = X, psi = length(y), lambda = NULL, sample_lambda = TRUE, nsave = 1000, nburn = 1000, nskip = 0, verbose = TRUE )
blm_bc( y, X, X_test = X, psi = length(y), lambda = NULL, sample_lambda = TRUE, nsave = 1000, nburn = 1000, nskip = 0, verbose = TRUE )
y |
|
X |
|
X_test |
|
psi |
prior variance (g-prior) |
lambda |
Box-Cox transformation; if NULL, estimate this parameter |
sample_lambda |
logical; if TRUE, sample lambda, otherwise use the fixed value of lambda above or the MLE (if lambda unspecified) |
nsave |
number of MCMC iterations to save |
nburn |
number of MCMC iterations to discard |
nskip |
number of MCMC iterations to skip between saving iterations, i.e., save every (nskip + 1)th draw |
verbose |
logical; if TRUE, print time remaining |
This function provides fully Bayesian inference for a
transformed linear model via MCMC sampling. The transformation is
parametric from the Box-Cox family, which has one parameter lambda
.
That parameter may be fixed in advanced or learned from the data.
a list with the following elements:
coefficients
the posterior mean of the regression coefficients
fitted.values
the posterior predictive mean at the test points X_test
post_theta
: nsave x p
samples from the posterior distribution
of the regression coefficients
post_ypred
: nsave x n_test
samples
from the posterior predictive distribution at test points X_test
post_g
: nsave
posterior samples of the transformation
evaluated at the unique y
values
post_lambda
nsave
posterior samples of lambda
post_sigma
nsave
posterior samples of sigma
model
: the model fit (here, blm_bc
)
as well as the arguments passed in.
Box-Cox transformations may be useful in some cases, but
in general we recommend the nonparametric transformation (with
Monte Carlo, not MCMC sampling) in sblm
.
# Simulate some data: dat = simulate_tlm(n = 100, p = 5, g_type = 'step') y = dat$y; X = dat$X # training data y_test = dat$y_test; X_test = dat$X_test # testing data hist(y, breaks = 25) # marginal distribution # Fit the Bayesian linear model with a Box-Cox transformation: fit = blm_bc(y = y, X = X, X_test = X_test) names(fit) # what is returned round(quantile(fit$post_lambda), 3) # summary of unknown Box-Cox parameter
# Simulate some data: dat = simulate_tlm(n = 100, p = 5, g_type = 'step') y = dat$y; X = dat$X # training data y_test = dat$y_test; X_test = dat$X_test # testing data hist(y, breaks = 25) # marginal distribution # Fit the Bayesian linear model with a Box-Cox transformation: fit = blm_bc(y = y, X = X, X_test = X_test) names(fit) # what is returned round(quantile(fit$post_lambda), 3) # summary of unknown Box-Cox parameter
MCMC sampling for Bayesian quantile regression. An asymmetric Laplace distribution is assumed for the errors, so the regression models targets the specified quantile. A g-prior is assumed for the regression coefficients.
bqr( y, X, tau = 0.5, X_test = X, psi = length(y), nsave = 1000, nburn = 1000, nskip = 0, verbose = TRUE )
bqr( y, X, tau = 0.5, X_test = X, psi = length(y), nsave = 1000, nburn = 1000, nskip = 0, verbose = TRUE )
y |
|
X |
|
tau |
the target quantile (between zero and one) |
X_test |
|
psi |
prior variance (g-prior) |
nsave |
number of MCMC iterations to save |
nburn |
number of MCMC iterations to discard |
nskip |
number of MCMC iterations to skip between saving iterations, i.e., save every (nskip + 1)th draw |
verbose |
logical; if TRUE, print time remaining |
a list with the following elements:
coefficients
the posterior mean of the regression coefficients
fitted.values
the estimated tau
th quantile at test points X_test
post_theta
: nsave x p
samples from the posterior distribution
of the regression coefficients
post_ypred
: nsave x n_test
samples
from the posterior predictive distribution at test points X_test
model
: the model fit (here, bqr
)
as well as the arguments passed
The asymmetric Laplace distribution is advantageous because
it links the regression model (X%*%theta
) to a pre-specified
quantile (tau
). However, it is often a poor model for
observed data, and the semiparametric version sbqr
is
recommended in general.
# Simulate some heteroskedastic data (no transformation): dat = simulate_tlm(n = 100, p = 5, g_type = 'box-cox', heterosked = TRUE, lambda = 1) y = dat$y; X = dat$X # training data y_test = dat$y_test; X_test = dat$X_test # testing data # Target this quantile: tau = 0.05 # Fit the Bayesian quantile regression model: fit = bqr(y = y, X = X, tau = tau, X_test = X_test) names(fit) # what is returned # Posterior predictive checks on testing data: empirical CDF y0 = sort(unique(y_test)) plot(y0, y0, type='n', ylim = c(0,1), xlab='y', ylab='F_y', main = 'Posterior predictive ECDF') temp = sapply(1:nrow(fit$post_ypred), function(s) lines(y0, ecdf(fit$post_ypred[s,])(y0), # ECDF of posterior predictive draws col='gray', type ='s')) lines(y0, ecdf(y_test)(y0), # ECDF of testing data col='black', type = 's', lwd = 3) # The posterior predictive checks usually do not pass! # try ?sbqr instead...
# Simulate some heteroskedastic data (no transformation): dat = simulate_tlm(n = 100, p = 5, g_type = 'box-cox', heterosked = TRUE, lambda = 1) y = dat$y; X = dat$X # training data y_test = dat$y_test; X_test = dat$X_test # testing data # Target this quantile: tau = 0.05 # Fit the Bayesian quantile regression model: fit = bqr(y = y, X = X, tau = tau, X_test = X_test) names(fit) # what is returned # Posterior predictive checks on testing data: empirical CDF y0 = sort(unique(y_test)) plot(y0, y0, type='n', ylim = c(0,1), xlab='y', ylab='F_y', main = 'Posterior predictive ECDF') temp = sapply(1:nrow(fit$post_ypred), function(s) lines(y0, ecdf(fit$post_ypred[s,])(y0), # ECDF of posterior predictive draws col='gray', type ='s')) lines(y0, ecdf(y_test)(y0), # ECDF of testing data col='black', type = 's', lwd = 3) # The posterior predictive checks usually do not pass! # try ?sbqr instead...
MCMC sampling for Bayesian spline regression with a (known or unknown) Box-Cox transformation.
bsm_bc( y, x = NULL, x_test = NULL, psi = NULL, lambda = NULL, sample_lambda = TRUE, nsave = 1000, nburn = 1000, nskip = 0, verbose = TRUE )
bsm_bc( y, x = NULL, x_test = NULL, psi = NULL, lambda = NULL, sample_lambda = TRUE, nsave = 1000, nburn = 1000, nskip = 0, verbose = TRUE )
y |
|
x |
|
x_test |
|
psi |
prior variance (inverse smoothing parameter); if NULL, sample this parameter |
lambda |
Box-Cox transformation; if NULL, estimate this parameter |
sample_lambda |
logical; if TRUE, sample lambda, otherwise use the fixed value of lambda above or the MLE (if lambda unspecified) |
nsave |
number of MCMC iterations to save |
nburn |
number of MCMC iterations to discard |
nskip |
number of MCMC iterations to skip between saving iterations, i.e., save every (nskip + 1)th draw |
verbose |
logical; if TRUE, print time remaining |
This function provides fully Bayesian inference for a
transformed spline model via MCMC sampling. The transformation is
parametric from the Box-Cox family, which has one parameter lambda
.
That parameter may be fixed in advanced or learned from the data.
a list with the following elements:
coefficients
the posterior mean of the regression coefficients
fitted.values
the posterior predictive mean at the test points x_test
post_theta
: nsave x p
samples from the posterior distribution
of the regression coefficients
post_ypred
: nsave x n_test
samples
from the posterior predictive distribution at x_test
post_g
: nsave
posterior samples of the transformation
evaluated at the unique y
values
post_lambda
nsave
posterior samples of lambda
model
: the model fit (here, sbsm_bc
)
as well as the arguments passed in.
Box-Cox transformations may be useful in some cases, but
in general we recommend the nonparametric transformation (with
Monte Carlo, not MCMC sampling) in sbsm
.
# Simulate some data: n = 100 # sample size x = sort(runif(n)) # observation points # Transform a noisy, periodic function: y = g_inv_bc( sin(2*pi*x) + sin(4*pi*x) + rnorm(n, sd = .5), lambda = .5) # Signed square-root transformation # Fit the Bayesian spline model with a Box-Cox transformation: fit = bsm_bc(y = y, x = x) names(fit) # what is returned round(quantile(fit$post_lambda), 3) # summary of unknown Box-Cox parameter # Plot the model predictions (point and interval estimates): pi_y = t(apply(fit$post_ypred, 2, quantile, c(0.05, .95))) # 90% PI plot(x, y, type='n', ylim = range(pi_y,y), xlab = 'x', ylab = 'y', main = paste('Fitted values and prediction intervals')) polygon(c(x, rev(x)),c(pi_y[,2], rev(pi_y[,1])),col='gray', border=NA) lines(x, y, type='p') lines(x, fitted(fit), lwd = 3)
# Simulate some data: n = 100 # sample size x = sort(runif(n)) # observation points # Transform a noisy, periodic function: y = g_inv_bc( sin(2*pi*x) + sin(4*pi*x) + rnorm(n, sd = .5), lambda = .5) # Signed square-root transformation # Fit the Bayesian spline model with a Box-Cox transformation: fit = bsm_bc(y = y, x = x) names(fit) # what is returned round(quantile(fit$post_lambda), 3) # summary of unknown Box-Cox parameter # Plot the model predictions (point and interval estimates): pi_y = t(apply(fit$post_ypred, 2, quantile, c(0.05, .95))) # 90% PI plot(x, y, type='n', ylim = range(pi_y,y), xlab = 'x', ylab = 'y', main = paste('Fitted values and prediction intervals')) polygon(c(x, rev(x)),c(pi_y[,2], rev(pi_y[,1])),col='gray', border=NA) lines(x, y, type='p') lines(x, fitted(fit), lwd = 3)
Estimate the remaining time in the MCMC based on previous samples
computeTimeRemaining(nsi, timer0, nsims, nrep = 1000)
computeTimeRemaining(nsi, timer0, nsims, nrep = 1000)
nsi |
Current iteration |
timer0 |
Initial timer value, returned from |
nsims |
Total number of simulations |
nrep |
Print the estimated time remaining every |
Table of summary statistics using the function summary
Compute Monte Carlo draws from the (marginal) posterior distribution of the
concentration hyperparameters of the hierarchical Bayesian bootstrap
(hbb
). The HBB is a nonparametric model for group-specific
distributions; each group has a concentration parameter, where
larger values encourage more shrinkage toward a common distribution.
concen_hbb( groups, shape_alphas = NULL, rate_alphas = NULL, nsave = 1000, ngrid = 500 )
concen_hbb( groups, shape_alphas = NULL, rate_alphas = NULL, nsave = 1000, ngrid = 500 )
groups |
the group assignments in the observed data |
shape_alphas |
(optional) shape parameter for the Gamma prior |
rate_alphas |
(optional) rate parameter for the Gamma prior |
nsave |
(optional) number of Monte Carlo simulations |
ngrid |
(optional) number of grid points |
The concentration hyperparameters are assigned
independent Gamma(shape_alphas
, rate_alphas
) priors.
This function uses a grid approximation to the marginal posterior
with the goal of producing a simple algorithm. Because this is a
*marginal* posterior sampler, it can be used with the hbb
sampler (which conditions on alphas
) to provide a joint
Monte Carlo (not MCMC) sampling algorithm for the concentration
hyperparameters, the group-specific CDFs, and the common CDF.
Note that diffuse priors on alphas
tend to put posterior mass on
large values, which leads to more aggressive shrinkage toward the common distribution
(complete pooling). For moderate shrinkage, we use the default values
shape_alphas = 30*K
and rate_alphas = 1
, where K
is the
number of groups.
nsave x K
samples of the concentration hyperparameters
corresponding to the K
groups
Oganisian et al. (https://doi.org/10.1515/ijb-2022-0051)
# Dimensions: n = 500 # number of observations K = 3 # number of groups # Assign groups w/ unequal probabilities: ugroups = paste('g', 1:K, sep='') # groups groups = sample(ugroups, size = n, replace = TRUE, prob = 1:K) # unequally weighted (unnormalized) # Summarize: table(groups)/n # Marginal posterior sampling for alpha: post_alpha = concen_hbb(groups) # Summarize: posterior distributions for(c in 1:K) { hist(post_alpha[,c], main = paste("Concentration parameter: group", ugroups[c]), xlim = range(post_alpha)) abline(v = mean(post_alpha[,c]), lwd=3) # posterior mean }
# Dimensions: n = 500 # number of observations K = 3 # number of groups # Assign groups w/ unequal probabilities: ugroups = paste('g', 1:K, sep='') # groups groups = sample(ugroups, size = n, replace = TRUE, prob = 1:K) # unequally weighted (unnormalized) # Summarize: table(groups)/n # Marginal posterior sampling for alpha: post_alpha = concen_hbb(groups) # Summarize: posterior distributions for(c in 1:K) { hist(post_alpha[,c], main = paste("Concentration parameter: group", ugroups[c]), xlim = range(post_alpha)) abline(v = mean(post_alpha[,c]), lwd=3) # posterior mean }
Contract the grid if the evaluation points exceed some threshold. This removes the corresponding z values. We can add points back to achieve the same (approximate) length.
contract_grid(z, Fz, lower, upper, add_back = TRUE, monotone = TRUE)
contract_grid(z, Fz, lower, upper, add_back = TRUE, monotone = TRUE)
z |
grid points (ordered) |
Fz |
function evaluated at those grid points |
lower |
lower threshold at which to check Fz |
upper |
upper threshold at which to check Fz |
add_back |
logical; if true, expand the grid to (about) the original size |
monotone |
logical; if true, enforce monotonicity on the expanded grid |
a list containing the grid points z
and the (interpolated) function
Fz
at those points
Assuming a Gaussian latent data distribution (given x), compute the CDF on a grid of points
Fz_fun(z, weights = NULL, mean_vec = NULL, sd_vec)
Fz_fun(z, weights = NULL, mean_vec = NULL, sd_vec)
z |
vector of points at which the CDF of z is evaluated |
weights |
|
mean_vec |
|
sd_vec |
|
CDF of z evaluated at z
Evaluate the Box-Cox transformation, which is a scaled power transformation
to preserve continuity in the index lambda
at zero. Negative values are
permitted.
g_bc(t, lambda)
g_bc(t, lambda)
t |
argument(s) at which to evaluate the function |
lambda |
Box-Cox parameter |
The evaluation(s) of the Box-Cox function at the given input(s) t
.
Special cases include
the identity transformation (lambda = 1
),
the square-root transformation (lambda = 1/2
),
and the log transformation (lambda = 0
).
# Log-transformation: g_bc(1:5, lambda = 0); log(1:5) # Square-root transformation: note the shift and scaling g_bc(1:5, lambda = 1/2); sqrt(1:5)
# Log-transformation: g_bc(1:5, lambda = 0); log(1:5) # Square-root transformation: note the shift and scaling g_bc(1:5, lambda = 1/2); sqrt(1:5)
Given the CDFs of z and y, compute a smoothed function to evaluate the transformation
g_fun(y, Fy_eval, z, Fz_eval)
g_fun(y, Fy_eval, z, Fz_eval)
y |
vector of points at which the CDF of y is evaluated |
Fy_eval |
CDF of y evaluated at |
z |
vector of points at which the CDF of z is evaluated |
Fz_eval |
CDF of z evaluated at |
A smooth monotone function which can be used for evaluations of the transformation.
Compute the inverse function of a transformation g
based on a grid search.
g_inv_approx(g, t_grid)
g_inv_approx(g, t_grid)
g |
the transformation function |
t_grid |
grid of arguments at which to evaluate the transformation function |
A function which can be used for evaluations of the (approximate) inverse transformation function.
Evaluate the inverse Box-Cox transformation. Negative values are permitted.
g_inv_bc(s, lambda)
g_inv_bc(s, lambda)
s |
argument(s) at which to evaluate the function |
lambda |
Box-Cox parameter |
The evaluation(s) of the inverse Box-Cox function at the given input(s) s
.
Special cases include
the identity transformation (lambda = 1
),
the square-root transformation (lambda = 1/2
),
and the log transformation (lambda = 0
).
# (Inverse) log-transformation: g_inv_bc(1:5, lambda = 0); exp(1:5) # (Inverse) square-root transformation: note the shift and scaling g_inv_bc(1:5, lambda = 1/2); (1:5)^2
# (Inverse) log-transformation: g_inv_bc(1:5, lambda = 0); exp(1:5) # (Inverse) square-root transformation: note the shift and scaling g_inv_bc(1:5, lambda = 1/2); (1:5)^2
Compute one Monte Carlo draw from the hierarchical Bayesian bootstrap (HBB) posterior distribution of the cumulative distribution function (CDF) for each group. The common (BB) and group-specific (HBB) weights are also returned.
hbb( y, groups, sample_alphas = FALSE, shape_alphas = NULL, rate_alphas = NULL, alphas = NULL, M = 30 )
hbb( y, groups, sample_alphas = FALSE, shape_alphas = NULL, rate_alphas = NULL, alphas = NULL, M = 30 )
y |
the data from which to infer the group-specific CDFs |
groups |
the group assignment for each element of |
sample_alphas |
logical; if TRUE, sample the concentration hyperparameters from their marginal posterior distribution |
shape_alphas |
(optional) shape parameter for the Gamma prior on each |
rate_alphas |
(optional) rate parameter for the Gamma prior on each |
alphas |
(optional) vector of fixed concentration hyperparameters corresponding
to the unique levels in |
M |
a positive scaling term to set a default value of |
Assuming the data y
are independent with unknown,
group-specific distributions, the hierarchical Bayesian bootstrap (HBB) from
Oganisian et al. (https://doi.org/10.1515/ijb-2022-0051) is a nonparametric model
for each distribution. The HBB includes hierarchical shrinkage across these
groups toward a common distribution (the bb
). The HBB admits
direct Monte Carlo (not MCMC) sampling.
The shrinkage toward this common distribution is determined by the concentration
hyperparameters alphas
. Each component of alphas
corresponds to
one of the groups. Larger values encourage more shrinkage toward
the common distribution, while smaller values allow more substantial deviations for that group.
When sample_alphas=TRUE
, each component of alphas
is sampled from its marginal
posterior distribution, assuming independent Gamma(shape_alphas
, rate_alphas
)
priors. This step uses a simple grid approximation to enable efficient sampling that
preserves joint Monte Carlo sampling with the group-specific and common distributions.
See concen_hbb
for details. Note that diffuse priors on alphas
tends to produce more aggressive shrinkage toward the common distribution (complete pooling).
For moderate shrinkage, we use the default values shape_alphas = 30*K
and rate_alphas = 1
where K
is the number of groups.
When sample_alphas=FALSE
, these concentration hyperparameters are fixed
at user-specified values. That can be done by specifying alphas
directly.
Alternatively, if alphas
is left unspecified (alphas = NULL
),
we adopt the default from Oganisian et al. which sets the c
th entry to M*n/nc
where M
is user-specified and nc
is the number of observations in group c
.
For further guidance on the choice of M
:
M = 0.01/K
approximates separate BB's by group (no pooling);
M
between 10 and 100 gives moderate shrinkage (partial pooling); and
M = 100*max(nc)
approximates a common BB (complete pooling).
a list with the following elements:
Fyc
: a list of functions where each entry corresponds to a group
and that group-specific function can evaluate the sampled CDF at any argument(s)
weights_y
: sampled weights from the common (BB) distribution (n
-dimensional)
weights_yc
: sampled weights from each of the K
groups (K x n
)
alphas
: the (fixed or sampled) concentration hyperparameters
If supplying alphas
with distinct entries, make sure that the
groups are ordered properly; these entries should match sort(unique(groups))
.
Oganisian et al. (https://doi.org/10.1515/ijb-2022-0051)
# Sample size and number of groups: n = 500 K = 3 # Define the groups, then assign: ugroups = paste('g', 1:K, sep='') # groups groups = sample(ugroups, n, replace = TRUE) # assignments # Simulate the data: iid normal, then add group-specific features y = rnorm(n = n) # data for(g in ugroups) y[groups==g] = y[groups==g] + 3*rnorm(1) # group-specific # One draw from the HBB posterior of the CDF: samp_hbb = hbb(y, groups) names(samp_hbb) # items returned Fyc = samp_hbb$Fyc # list of CDFs class(Fyc) # this is a list class(Fyc[[1]]) # each element is a function c = 1 # try: vary in 1:K Fyc[[c]](0) # some example use (for this one draw) Fyc[[c]](c(.5, 1.2)) # Plot several draws from the HBB posterior distribution: ys = seq(min(y), max(y), length.out=1000) plot(ys, ys, type='n', ylim = c(0,1), main = 'Draws from HBB posteriors', xlab = 'y', ylab = 'F_c(y)') for(s in 1:50){ # some draws # BB CDF: Fy = bb(y) lines(ys, Fy(ys), lwd=3) # plot CDF # HBB: Fyc = hbb(y, groups)$Fyc # Plot CDFs by group: for(c in 1:K) lines(ys, Fyc[[c]](ys), col=c+1, lwd=3) } # For reference, add the ECDFs by group: for(c in 1:K) lines(ys, ecdf(y[groups==ugroups[c]])(ys), lty=2) legend('bottomright', c('BB', paste('HBB:', ugroups)), col = 1:(K+1), lwd=3)
# Sample size and number of groups: n = 500 K = 3 # Define the groups, then assign: ugroups = paste('g', 1:K, sep='') # groups groups = sample(ugroups, n, replace = TRUE) # assignments # Simulate the data: iid normal, then add group-specific features y = rnorm(n = n) # data for(g in ugroups) y[groups==g] = y[groups==g] + 3*rnorm(1) # group-specific # One draw from the HBB posterior of the CDF: samp_hbb = hbb(y, groups) names(samp_hbb) # items returned Fyc = samp_hbb$Fyc # list of CDFs class(Fyc) # this is a list class(Fyc[[1]]) # each element is a function c = 1 # try: vary in 1:K Fyc[[c]](0) # some example use (for this one draw) Fyc[[c]](c(.5, 1.2)) # Plot several draws from the HBB posterior distribution: ys = seq(min(y), max(y), length.out=1000) plot(ys, ys, type='n', ylim = c(0,1), main = 'Draws from HBB posteriors', xlab = 'y', ylab = 'F_c(y)') for(s in 1:50){ # some draws # BB CDF: Fy = bb(y) lines(ys, Fy(ys), lwd=3) # plot CDF # HBB: Fyc = hbb(y, groups)$Fyc # Plot CDFs by group: for(c in 1:K) lines(ys, Fyc[[c]](ys), col=c+1, lwd=3) } # For reference, add the ECDFs by group: for(c in 1:K) lines(ys, ecdf(y[groups==ugroups[c]])(ys), lty=2) legend('bottomright', c('BB', paste('HBB:', ugroups)), col = 1:(K+1), lwd=3)
Given posterior predictive samples at X_test
,
plot the point and interval estimates and compare
to the actual testing data y_test
.
plot_pptest(post_ypred, y_test, alpha_level = 0.1)
plot_pptest(post_ypred, y_test, alpha_level = 0.1)
post_ypred |
|
y_test |
|
alpha_level |
alpha-level for prediction intervals |
plot of the testing data, point and interval predictions, and a summary of the empirical coverage
# Simulate some data: dat = simulate_tlm(n = 100, p = 5, g_type = 'step') # Fit a semiparametric Bayesian linear model: fit = sblm(y = dat$y, X = dat$X, X_test = dat$X_test) # Evaluate posterior predictive means and intervals on the testing data: plot_pptest(fit$post_ypred, dat$y_test, alpha_level = 0.10) # coverage should be about 90%
# Simulate some data: dat = simulate_tlm(n = 100, p = 5, g_type = 'step') # Fit a semiparametric Bayesian linear model: fit = sblm(y = dat$y, X = dat$X, X_test = dat$X_test) # Evaluate posterior predictive means and intervals on the testing data: plot_pptest(fit$post_ypred, dat$y_test, alpha_level = 0.10) # coverage should be about 90%
For a transformed Gaussian linear model, compute point estimates
of the regression coefficients. This approach uses the ranks of the
data and does not require the transformation, but must expand the
sample size to n^2
and thus can be slow.
rank_approx(y, X)
rank_approx(y, X)
y |
|
X |
|
the estimated linear coefficients
# Simulate some data: dat = simulate_tlm(n = 200, p = 10, g_type = 'step') # Point estimates for the linear coefficients: theta_hat = suppressWarnings( rank_approx(y = dat$y, X = dat$X[,-1]) # remove intercept ) # warnings occur from glm.fit (fitted probabilities 0 or 1) # Check: correlation with true coefficients cor(dat$beta_true[-1], # excluding the intercept theta_hat)
# Simulate some data: dat = simulate_tlm(n = 200, p = 10, g_type = 'step') # Point estimates for the linear coefficients: theta_hat = suppressWarnings( rank_approx(y = dat$y, X = dat$X[,-1]) # remove intercept ) # warnings occur from glm.fit (fitted probabilities 0 or 1) # Check: correlation with true coefficients cor(dat$beta_true[-1], # excluding the intercept theta_hat)
Monte Carlo sampling for Bayesian Gaussian process regression with an unknown (nonparametric) transformation.
sbgp( y, locs, X = NULL, covfun_name = "matern_isotropic", locs_test = locs, X_test = NULL, nn = 30, emp_bayes = TRUE, approx_g = FALSE, nsave = 1000, ngrid = 100 )
sbgp( y, locs, X = NULL, covfun_name = "matern_isotropic", locs_test = locs, X_test = NULL, nn = 30, emp_bayes = TRUE, approx_g = FALSE, nsave = 1000, ngrid = 100 )
y |
|
locs |
|
X |
|
covfun_name |
string name of a covariance function; see ?GpGp |
locs_test |
|
X_test |
|
nn |
number of nearest neighbors to use; default is 30 (larger values improve the approximation but increase computing cost) |
emp_bayes |
logical; if TRUE, use a (faster!) empirical Bayes approach for estimating the mean function |
approx_g |
logical; if TRUE, apply large-sample approximation for the transformation |
nsave |
number of Monte Carlo simulations |
ngrid |
number of grid points for inverse approximations |
This function provides Bayesian inference for a
transformed Gaussian process model using Monte Carlo (not MCMC) sampling.
The transformation is modeled as unknown and learned jointly
with the regression function (unless approx_g
= TRUE, which then uses
a point approximation). This model applies for real-valued data, positive data, and
compactly-supported data (the support is automatically deduced from the observed y
values).
The results are typically unchanged whether laplace_approx
is TRUE/FALSE;
setting it to TRUE may reduce sensitivity to the prior, while setting it to FALSE
may speed up computations for very large datasets. For computational efficiency,
the Gaussian process parameters are fixed at point estimates, and the latent Gaussian
process is only sampled when emp_bayes
= FALSE. However, the uncertainty
from this term is often negligible compared to the observation errors, and the
transformation serves as an additional layer of robustness.
a list with the following elements:
coefficients
the estimated regression coefficients
fitted.values
the posterior predictive mean at the test points locs_test
fit_gp
the fitted GpGp_fit
object, which includes
covariance parameter estimates and other model information
post_ypred
: nsave x ntest
samples
from the posterior predictive distribution at locs_test
post_g
: nsave
posterior samples of the transformation
evaluated at the unique y
values
model
: the model fit (here, sbgp
)
as well as the arguments passed in.
# Simulate some data: n = 200 # sample size x = seq(0, 1, length = n) # observation points # Transform a noisy, periodic function: y = g_inv_bc( sin(2*pi*x) + sin(4*pi*x) + rnorm(n, sd = .5), lambda = .5) # Signed square-root transformation # Package we use for fast computing w/ Gaussian processes: library(GpGp) # Fit the semiparametric Bayesian Gaussian process: fit = sbgp(y = y, locs = x) names(fit) # what is returned coef(fit) # estimated regression coefficients (here, just an intercept) class(fit$fit_gp) # the GpGp object is also returned # Plot the model predictions (point and interval estimates): pi_y = t(apply(fit$post_ypred, 2, quantile, c(0.05, .95))) # 90% PI plot(x, y, type='n', ylim = range(pi_y,y), xlab = 'x', ylab = 'y', main = paste('Fitted values and prediction intervals')) polygon(c(x, rev(x)),c(pi_y[,2], rev(pi_y[,1])),col='gray', border=NA) lines(x, y, type='p') lines(x, fitted(fit), lwd = 3)
# Simulate some data: n = 200 # sample size x = seq(0, 1, length = n) # observation points # Transform a noisy, periodic function: y = g_inv_bc( sin(2*pi*x) + sin(4*pi*x) + rnorm(n, sd = .5), lambda = .5) # Signed square-root transformation # Package we use for fast computing w/ Gaussian processes: library(GpGp) # Fit the semiparametric Bayesian Gaussian process: fit = sbgp(y = y, locs = x) names(fit) # what is returned coef(fit) # estimated regression coefficients (here, just an intercept) class(fit$fit_gp) # the GpGp object is also returned # Plot the model predictions (point and interval estimates): pi_y = t(apply(fit$post_ypred, 2, quantile, c(0.05, .95))) # 90% PI plot(x, y, type='n', ylim = range(pi_y,y), xlab = 'x', ylab = 'y', main = paste('Fitted values and prediction intervals')) polygon(c(x, rev(x)),c(pi_y[,2], rev(pi_y[,1])),col='gray', border=NA) lines(x, y, type='p') lines(x, fitted(fit), lwd = 3)
Monte Carlo sampling for Bayesian linear regression with an unknown (nonparametric) transformation. A g-prior is assumed for the regression coefficients.
sblm( y, X, X_test = X, psi = length(y), laplace_approx = TRUE, approx_g = FALSE, nsave = 1000, ngrid = 100, verbose = TRUE )
sblm( y, X, X_test = X, psi = length(y), laplace_approx = TRUE, approx_g = FALSE, nsave = 1000, ngrid = 100, verbose = TRUE )
y |
|
X |
|
X_test |
|
psi |
prior variance (g-prior) |
laplace_approx |
logical; if TRUE, use a normal approximation to the posterior in the definition of the transformation; otherwise the prior is used |
approx_g |
logical; if TRUE, apply large-sample approximation for the transformation |
nsave |
number of Monte Carlo simulations |
ngrid |
number of grid points for inverse approximations |
verbose |
logical; if TRUE, print time remaining |
This function provides fully Bayesian inference for a
transformed linear model using Monte Carlo (not MCMC) sampling.
The transformation is modeled as unknown and learned jointly
with the regression coefficients (unless approx_g
= TRUE, which then uses
a point approximation). This model applies for real-valued data, positive data, and
compactly-supported data (the support is automatically deduced from the observed y
values).
The results are typically unchanged whether laplace_approx
is TRUE/FALSE;
setting it to TRUE may reduce sensitivity to the prior, while setting it to FALSE
may speed up computations for very large datasets.
a list with the following elements:
coefficients
the posterior mean of the regression coefficients
fitted.values
the posterior predictive mean at the test points X_test
post_theta
: nsave x p
samples from the posterior distribution
of the regression coefficients
post_ypred
: nsave x n_test
samples
from the posterior predictive distribution at test points X_test
post_g
: nsave
posterior samples of the transformation
evaluated at the unique y
values
model
: the model fit (here, sblm
)
as well as the arguments passed in.
# Simulate some data: dat = simulate_tlm(n = 100, p = 5, g_type = 'step') y = dat$y; X = dat$X # training data y_test = dat$y_test; X_test = dat$X_test # testing data hist(y, breaks = 25) # marginal distribution # Fit the semiparametric Bayesian linear model: fit = sblm(y = y, X = X, X_test = X_test) names(fit) # what is returned # Note: this is Monte Carlo sampling, so no need for MCMC diagnostics! # Evaluate posterior predictive means and intervals on the testing data: plot_pptest(fit$post_ypred, y_test, alpha_level = 0.10) # coverage should be about 90% # Check: correlation with true coefficients cor(dat$beta_true[-1], coef(fit)[-1]) # excluding the intercept # Summarize the transformation: y0 = sort(unique(y)) # posterior draws of g are evaluated at the unique y observations plot(y0, fit$post_g[1,], type='n', ylim = range(fit$post_g), xlab = 'y', ylab = 'g(y)', main = "Posterior draws of the transformation") temp = sapply(1:nrow(fit$post_g), function(s) lines(y0, fit$post_g[s,], col='gray')) # posterior draws lines(y0, colMeans(fit$post_g), lwd = 3) # posterior mean # Add the true transformation, rescaled for easier comparisons: lines(y, scale(dat$g_true)*sd(colMeans(fit$post_g)) + mean(colMeans(fit$post_g)), type='p', pch=2) legend('bottomright', c('Truth'), pch = 2) # annotate the true transformation # Posterior predictive checks on testing data: empirical CDF y0 = sort(unique(y_test)) plot(y0, y0, type='n', ylim = c(0,1), xlab='y', ylab='F_y', main = 'Posterior predictive ECDF') temp = sapply(1:nrow(fit$post_ypred), function(s) lines(y0, ecdf(fit$post_ypred[s,])(y0), # ECDF of posterior predictive draws col='gray', type ='s')) lines(y0, ecdf(y_test)(y0), # ECDF of testing data col='black', type = 's', lwd = 3)
# Simulate some data: dat = simulate_tlm(n = 100, p = 5, g_type = 'step') y = dat$y; X = dat$X # training data y_test = dat$y_test; X_test = dat$X_test # testing data hist(y, breaks = 25) # marginal distribution # Fit the semiparametric Bayesian linear model: fit = sblm(y = y, X = X, X_test = X_test) names(fit) # what is returned # Note: this is Monte Carlo sampling, so no need for MCMC diagnostics! # Evaluate posterior predictive means and intervals on the testing data: plot_pptest(fit$post_ypred, y_test, alpha_level = 0.10) # coverage should be about 90% # Check: correlation with true coefficients cor(dat$beta_true[-1], coef(fit)[-1]) # excluding the intercept # Summarize the transformation: y0 = sort(unique(y)) # posterior draws of g are evaluated at the unique y observations plot(y0, fit$post_g[1,], type='n', ylim = range(fit$post_g), xlab = 'y', ylab = 'g(y)', main = "Posterior draws of the transformation") temp = sapply(1:nrow(fit$post_g), function(s) lines(y0, fit$post_g[s,], col='gray')) # posterior draws lines(y0, colMeans(fit$post_g), lwd = 3) # posterior mean # Add the true transformation, rescaled for easier comparisons: lines(y, scale(dat$g_true)*sd(colMeans(fit$post_g)) + mean(colMeans(fit$post_g)), type='p', pch=2) legend('bottomright', c('Truth'), pch = 2) # annotate the true transformation # Posterior predictive checks on testing data: empirical CDF y0 = sort(unique(y_test)) plot(y0, y0, type='n', ylim = c(0,1), xlab='y', ylab='F_y', main = 'Posterior predictive ECDF') temp = sapply(1:nrow(fit$post_ypred), function(s) lines(y0, ecdf(fit$post_ypred[s,])(y0), # ECDF of posterior predictive draws col='gray', type ='s')) lines(y0, ecdf(y_test)(y0), # ECDF of testing data col='black', type = 's', lwd = 3)
MCMC sampling for Bayesian quantile regression with an unknown (nonparametric) transformation. Like in traditional Bayesian quantile regression, an asymmetric Laplace distribution is assumed for the errors, so the regression models targets the specified quantile. However, these models are often woefully inadequate for describing observed data. We introduce a nonparametric transformation to improve model adequacy while still providing inference for the regression coefficients and the specified quantile. A g-prior is assumed for the regression coefficients.
sbqr( y, X, tau = 0.5, X_test = X, psi = length(y), laplace_approx = TRUE, approx_g = FALSE, nsave = 1000, nburn = 100, ngrid = 100, verbose = TRUE )
sbqr( y, X, tau = 0.5, X_test = X, psi = length(y), laplace_approx = TRUE, approx_g = FALSE, nsave = 1000, nburn = 100, ngrid = 100, verbose = TRUE )
y |
|
X |
|
tau |
the target quantile (between zero and one) |
X_test |
|
psi |
prior variance (g-prior) |
laplace_approx |
logical; if TRUE, use a normal approximation to the posterior in the definition of the transformation; otherwise the prior is used |
approx_g |
logical; if TRUE, apply large-sample approximation for the transformation |
nsave |
number of MCMC iterations to save |
nburn |
number of MCMC iterations to discard |
ngrid |
number of grid points for inverse approximations |
verbose |
logical; if TRUE, print time remaining |
This function provides fully Bayesian inference for a
transformed quantile linear model.
The transformation is modeled as unknown and learned jointly
with the regression coefficients (unless approx_g
= TRUE, which then uses
a point approximation). This model applies for real-valued data, positive data, and
compactly-supported data (the support is automatically deduced from the observed y
values).
The results are typically unchanged whether laplace_approx
is TRUE/FALSE;
setting it to TRUE may reduce sensitivity to the prior, while setting it to FALSE
may speed up computations for very large datasets.
a list with the following elements:
coefficients
the posterior mean of the regression coefficients
fitted.values
the estimated tau
th quantile at test points X_test
post_theta
: nsave x p
samples from the posterior distribution
of the regression coefficients
post_ypred
: nsave x n_test
samples
from the posterior predictive distribution at test points X_test
post_qtau
: nsave x n_test
samples of the tau
th conditional quantile at test points X_test
post_g
: nsave
posterior samples of the transformation
evaluated at the unique y
values
model
: the model fit (here, sbqr
)
as well as the arguments passed in.
# Simulate some heteroskedastic data (no transformation): dat = simulate_tlm(n = 200, p = 10, g_type = 'box-cox', heterosked = TRUE, lambda = 1) y = dat$y; X = dat$X # training data y_test = dat$y_test; X_test = dat$X_test # testing data # Target this quantile: tau = 0.05 # Fit the semiparametric Bayesian quantile regression model: fit = sbqr(y = y, X = X, tau = tau, X_test = X_test) names(fit) # what is returned # Posterior predictive checks on testing data: empirical CDF y0 = sort(unique(y_test)) plot(y0, y0, type='n', ylim = c(0,1), xlab='y', ylab='F_y', main = 'Posterior predictive ECDF') temp = sapply(1:nrow(fit$post_ypred), function(s) lines(y0, ecdf(fit$post_ypred[s,])(y0), # ECDF of posterior predictive draws col='gray', type ='s')) lines(y0, ecdf(y_test)(y0), # ECDF of testing data col='black', type = 's', lwd = 3)
# Simulate some heteroskedastic data (no transformation): dat = simulate_tlm(n = 200, p = 10, g_type = 'box-cox', heterosked = TRUE, lambda = 1) y = dat$y; X = dat$X # training data y_test = dat$y_test; X_test = dat$X_test # testing data # Target this quantile: tau = 0.05 # Fit the semiparametric Bayesian quantile regression model: fit = sbqr(y = y, X = X, tau = tau, X_test = X_test) names(fit) # what is returned # Posterior predictive checks on testing data: empirical CDF y0 = sort(unique(y_test)) plot(y0, y0, type='n', ylim = c(0,1), xlab='y', ylab='F_y', main = 'Posterior predictive ECDF') temp = sapply(1:nrow(fit$post_ypred), function(s) lines(y0, ecdf(fit$post_ypred[s,])(y0), # ECDF of posterior predictive draws col='gray', type ='s')) lines(y0, ecdf(y_test)(y0), # ECDF of testing data col='black', type = 's', lwd = 3)
Monte Carlo sampling for Bayesian spline regression with an unknown (nonparametric) transformation.
sbsm( y, x = NULL, x_test = NULL, psi = NULL, laplace_approx = TRUE, approx_g = FALSE, nsave = 1000, ngrid = 100, verbose = TRUE )
sbsm( y, x = NULL, x_test = NULL, psi = NULL, laplace_approx = TRUE, approx_g = FALSE, nsave = 1000, ngrid = 100, verbose = TRUE )
y |
|
x |
|
x_test |
|
psi |
prior variance (inverse smoothing parameter); if NULL, sample this parameter |
laplace_approx |
logical; if TRUE, use a normal approximation to the posterior in the definition of the transformation; otherwise the prior is used |
approx_g |
logical; if TRUE, apply large-sample approximation for the transformation |
nsave |
number of Monte Carlo simulations |
ngrid |
number of grid points for inverse approximations |
verbose |
logical; if TRUE, print time remaining |
This function provides fully Bayesian inference for a
transformed spline regression model using Monte Carlo (not MCMC) sampling.
The transformation is modeled as unknown and learned jointly
with the regression function (unless approx_g
= TRUE, which then uses
a point approximation). This model applies for real-valued data, positive data, and
compactly-supported data (the support is automatically deduced from the observed y
values).
The results are typically unchanged whether laplace_approx
is TRUE/FALSE;
setting it to TRUE may reduce sensitivity to the prior, while setting it to FALSE
may speed up computations for very large datasets.
a list with the following elements:
coefficients
the posterior mean of the regression coefficients
fitted.values
the posterior predictive mean at the test points x_test
post_theta
: nsave x p
samples from the posterior distribution
of the regression coefficients
post_ypred
: nsave x n_test
samples
from the posterior predictive distribution at x_test
post_g
: nsave
posterior samples of the transformation
evaluated at the unique y
values
model
: the model fit (here, sbsm
)
as well as the arguments passed in.
# Simulate some data: n = 100 # sample size x = sort(runif(n)) # observation points # Transform a noisy, periodic function: y = g_inv_bc( sin(2*pi*x) + sin(4*pi*x) + rnorm(n, sd = .5), lambda = .5) # Signed square-root transformation # Fit the semiparametric Bayesian spline model: fit = sbsm(y = y, x = x) names(fit) # what is returned # Note: this is Monte Carlo sampling, so no need for MCMC diagnostics! # Plot the model predictions (point and interval estimates): pi_y = t(apply(fit$post_ypred, 2, quantile, c(0.05, .95))) # 90% PI plot(x, y, type='n', ylim = range(pi_y,y), xlab = 'x', ylab = 'y', main = paste('Fitted values and prediction intervals')) polygon(c(x, rev(x)),c(pi_y[,2], rev(pi_y[,1])),col='gray', border=NA) lines(x, y, type='p') lines(x, fitted(fit), lwd = 3)
# Simulate some data: n = 100 # sample size x = sort(runif(n)) # observation points # Transform a noisy, periodic function: y = g_inv_bc( sin(2*pi*x) + sin(4*pi*x) + rnorm(n, sd = .5), lambda = .5) # Signed square-root transformation # Fit the semiparametric Bayesian spline model: fit = sbsm(y = y, x = x) names(fit) # what is returned # Note: this is Monte Carlo sampling, so no need for MCMC diagnostics! # Plot the model predictions (point and interval estimates): pi_y = t(apply(fit$post_ypred, 2, quantile, c(0.05, .95))) # 90% PI plot(x, y, type='n', ylim = range(pi_y,y), xlab = 'x', ylab = 'y', main = paste('Fitted values and prediction intervals')) polygon(c(x, rev(x)),c(pi_y[,2], rev(pi_y[,1])),col='gray', border=NA) lines(x, y, type='p') lines(x, fitted(fit), lwd = 3)
Generate training data (X, y) and testing data (X_test, y_test) for a transformed linear model. The covariates are correlated Gaussian variables. Half of the true regression coefficients are zero and the other half are one. There are multiple options for the transformation, which define the support of the data (see below).
simulate_tlm( n, p, g_type = "beta", n_test = 1000, heterosked = FALSE, lambda = 1 )
simulate_tlm( n, p, g_type = "beta", n_test = 1000, heterosked = FALSE, lambda = 1 )
n |
number of observations in the training data |
p |
number of covariates |
g_type |
type of transformation; must be one of
|
n_test |
number of observations in the testing data |
heterosked |
logical; if TRUE, simulate the latent data with heteroskedasticity |
lambda |
Box-Cox parameter (only applies for |
The transformations vary in complexity and support
for the observed data, and include the following options:
beta
yields marginally Beta(0.1, 0.5) data
supported on [0,1]; step
generates a locally-linear
inverse transformation and produces positive data; and box-cox
refers to the signed Box-Cox family indexed by lambda
,
which generates real-valued data with examples including identity,
square-root, and log transformations.
a list with the following elements:
y
: the response variable in the training data
X
: the covariates in the training data
y_test
: the response variable in the testing data
X_test
: the covariates in the testing data
beta_true
: the true regression coefficients
g_true
: the true transformation, evaluated at y
# Simulate data: dat = simulate_tlm(n = 100, p = 5, g_type = 'beta') names(dat) # what is returned hist(dat$y, breaks = 25) # marginal distribution
# Simulate data: dat = simulate_tlm(n = 100, p = 5, g_type = 'beta') names(dat) # what is returned hist(dat$y, breaks = 25) # marginal distribution
Given Monte Carlo draws from the surrogate posterior, apply sampling importance reweighting (SIR) to correct for the true model likelihood.
sir_adjust(fit, sir_frac = 0.3, nsims_prior = 100, verbose = TRUE)
sir_adjust(fit, sir_frac = 0.3, nsims_prior = 100, verbose = TRUE)
fit |
a fitted model object that includes
|
sir_frac |
fraction of draws to sample for SIR |
nsims_prior |
number of draws from the prior |
verbose |
logical; if TRUE, print time remaining |
The Monte Carlo sampling for sblm
and
sbsm
uses a surrogate likelihood for posterior inference,
which enables much faster and easier computing. SIR provides a correction for
the actual (specified) likelihood. However, this correction
step is quite slow and typically does not produce any noticeable
discrepancies, even for small sample sizes.
the fitted model object with the posterior draws subsampled based on the SIR adjustment
SIR sampling is done WITHOUT replacement, so sir_frac
is typically between 0.1 and 0.5. The nsims_priors
draws
are used to approximate a prior expectation, but larger values
can significantly slow down this function.
# Simulate some data: dat = simulate_tlm(n = 50, p = 5, g_type = 'step') y = dat$y; X = dat$X # training data y_test = dat$y_test; X_test = dat$X_test # testing data hist(y, breaks = 10) # marginal distribution # Fit the semiparametric Bayesian linear model: fit = sblm(y = y, X = X, X_test = X_test) names(fit) # what is returned # Update with SIR: fit_sir = sir_adjust(fit) # Prediction: unadjusted vs. adjusted? # Point estimates: y_hat = fitted(fit) y_hat_sir = fitted(fit_sir) cor(y_hat, y_hat_sir) # similar # Interval estimates: pi_y = t(apply(fit$post_ypred, 2, quantile, c(0.05, .95))) # 90% PI pi_y_sir = t(apply(fit_sir$post_ypred, 2, quantile, c(0.05, .95))) # 90% PI # PI overlap (%): overlaps = 100*sapply(1:length(y_test), function(i){ # innermost part (min(pi_y[i,2], pi_y_sir[i,2]) - max(pi_y[i,1], pi_y_sir[i,1]))/ # outermost part (max(pi_y[i,2], pi_y_sir[i,2]) - min(pi_y[i,1], pi_y_sir[i,1])) }) summary(overlaps) # mostly close to 100% # Coverage of PIs on testing data (should be ~ 90%) mean((pi_y[,1] <= y_test)*(pi_y[,2] >= y_test)) # unadjusted mean((pi_y_sir[,1] <= y_test)*(pi_y_sir[,2] >= y_test)) # adjusted # Plot together with testing data: plot(y_test, y_test, type='n', ylim = range(pi_y, pi_y_sir, y_test), xlab = 'y_test', ylab = 'y_hat', main = paste('Prediction intervals: testing data')) abline(0,1) # reference line suppressWarnings( arrows(y_test, pi_y[,1], y_test, pi_y[,2], length=0.15, angle=90, code=3, col='gray', lwd=2) ) # plot the PIs (unadjusted) suppressWarnings( arrows(y_test, pi_y_sir[,1], y_test, pi_y_sir[,2], length=0.15, angle=90, code=3, col='darkgray', lwd=2) ) # plot the PIs (adjusted) lines(y_test, y_hat, type='p', pch=2) # plot the means (unadjusted) lines(y_test, y_hat_sir, type='p', pch=3) # plot the means (adjusted)
# Simulate some data: dat = simulate_tlm(n = 50, p = 5, g_type = 'step') y = dat$y; X = dat$X # training data y_test = dat$y_test; X_test = dat$X_test # testing data hist(y, breaks = 10) # marginal distribution # Fit the semiparametric Bayesian linear model: fit = sblm(y = y, X = X, X_test = X_test) names(fit) # what is returned # Update with SIR: fit_sir = sir_adjust(fit) # Prediction: unadjusted vs. adjusted? # Point estimates: y_hat = fitted(fit) y_hat_sir = fitted(fit_sir) cor(y_hat, y_hat_sir) # similar # Interval estimates: pi_y = t(apply(fit$post_ypred, 2, quantile, c(0.05, .95))) # 90% PI pi_y_sir = t(apply(fit_sir$post_ypred, 2, quantile, c(0.05, .95))) # 90% PI # PI overlap (%): overlaps = 100*sapply(1:length(y_test), function(i){ # innermost part (min(pi_y[i,2], pi_y_sir[i,2]) - max(pi_y[i,1], pi_y_sir[i,1]))/ # outermost part (max(pi_y[i,2], pi_y_sir[i,2]) - min(pi_y[i,1], pi_y_sir[i,1])) }) summary(overlaps) # mostly close to 100% # Coverage of PIs on testing data (should be ~ 90%) mean((pi_y[,1] <= y_test)*(pi_y[,2] >= y_test)) # unadjusted mean((pi_y_sir[,1] <= y_test)*(pi_y_sir[,2] >= y_test)) # adjusted # Plot together with testing data: plot(y_test, y_test, type='n', ylim = range(pi_y, pi_y_sir, y_test), xlab = 'y_test', ylab = 'y_hat', main = paste('Prediction intervals: testing data')) abline(0,1) # reference line suppressWarnings( arrows(y_test, pi_y[,1], y_test, pi_y[,2], length=0.15, angle=90, code=3, col='gray', lwd=2) ) # plot the PIs (unadjusted) suppressWarnings( arrows(y_test, pi_y_sir[,1], y_test, pi_y_sir[,2], length=0.15, angle=90, code=3, col='darkgray', lwd=2) ) # plot the PIs (adjusted) lines(y_test, y_hat, type='p', pch=2) # plot the means (unadjusted) lines(y_test, y_hat_sir, type='p', pch=3) # plot the means (adjusted)
Compute a draw from a univariate distribution using the code provided by Radford M. Neal. The documentation below is also reproduced from Neal (2008).
uni.slice(x0, g, w = 1, m = Inf, lower = -Inf, upper = +Inf, gx0 = NULL)
uni.slice(x0, g, w = 1, m = Inf, lower = -Inf, upper = +Inf, gx0 = NULL)
x0 |
Initial point |
g |
Function returning the log of the probability density (plus constant) |
w |
Size of the steps for creating interval (default 1) |
m |
Limit on steps (default infinite) |
lower |
Lower bound on support of the distribution (default -Inf) |
upper |
Upper bound on support of the distribution (default +Inf) |
gx0 |
Value of g(x0), if known (default is not known) |
The point sampled, with its log density attached as an attribute.
The log density function may return -Inf for points outside the support of the distribution. If a lower and/or upper bound is specified for the support, the log density function will not be called outside such limits.