Custom Plot Confidence Interval
Published:
Manually plotting credible interval of continuous predictor
An example code using the model and data from Juliet and Louis’ ongoing project
Ming Liu
19 December, 2022
Concept
References
This description is based on this popular thread, but I have rewritten for archive and for better understanding. A proper stats textbook could be a better source of validation.
What regression model actually does
Let us define \(x\) as the observed predictor samples, where there are \(n\) number of samples and the average of observed predictor values is \(\bar{x}\). Then, a simple regression model would find the “best” line minimizing the sum of deviation between samples and the line, \[\begin{equation} \tag{1.1} y= a(x-\bar{x})+b, \end{equation}\]
where a is the estimated fixed effect, and b is the intercept at \(\bar{x}\).
Getting the residual standard error of the model
Sadly, no model is perfect and there is always some deviations between actual data and the model. In the language of stats, people call the remaining deviations as ‘residuals’ and this is crucial for us to manually plot the confidence interval. Nevertheless, the raw residual needs to be processed before we can find the range of confidence interval, and the first step is getting the residual standard error of the model.
The formula of residual standard error depends on the distribution assumption of the regression model, some models like Gaussian regression have no additional terms, but Poisson regression have additional terms to calibrate the standard errors. In Gaussian regression, residual standard error is defined as \[\begin{equation} \tag{1.2} \sigma= \sqrt{\frac{\sum{residuals}}{df}}, \end{equation}\]
where residuals are summed across all \(n\) samples, and \(df\) is the degree of freedom. For conventional models, \(df\) is the \(n- (n_{predictors}+1)\), where \(n_{predictors}\) is the number of predictors in the regression model. (Other distributions need double-checking, not sure if anything is different in logistic regression.)
The confidence interval
Now we can consider the standard errors of slope \(a\) and x-bar intercept \(b\). If we know \(b\) is true, the standard error of \(a\) is \(\sigma/\sqrt{n}\). Similarly, if we know \(a\) is true, the standard error of \(b\) is \(\sigma / \sqrt{\sum_{i=1}^{n} (x_i-\bar{x})^{2}}\).
Assuming we are interested in the uncertainty at \(x=x_{focal}\), the effect of uncertainty in slope would be multiplied by how far it is from the mean \(\bar{x}\), giving \((x_{focal}-\bar{x})\sigma / \sqrt{\sum_{i=1}^{n} (x_i-\bar{x})^{2}}\). Now the overall effect is just the square root of the sum of the squares of those two things. (Why? Because variances of the uncorrelated things add, and \(a\) and \(b\) are uncorrelated.) So the overall standard error is the square root of the overall variance, and the variance is the sum of the variances of the components, \(\sqrt{(\sigma/\sqrt{n})^{2}+((x_{focal}-\bar{x})\sigma / \sqrt{\sum_{i=1}^{n} (x_i-\bar{x})^{2}})^{2}}\). With some simplification, we have the usual term for the standard error of the estimate of the mean value at \(x_{focal}\), \[\begin{equation} \tag{1.3} \sigma \sqrt{\frac{1}{n}+\frac{(x_{focal}-\bar{x})^{2}}{\sum_{i=1}^{n} (x_i-\bar{x})^{2}}}. \end{equation}\]
If you draw expression 1.3 as a function of \(x_{focal}\), you will see it forms a curve like a smaile, with a minimum at \(\bar{x}\). The whole term is what gets added to or subtracted from the fitted line (well, multiple of it is, in order to get a desired confidence interval).
The custom function
You can put this function into your working code and load it before making figures (only need to load it once per session, like all R packages).
# The function for numerically generating UNLINKED
# prediction and confidence interval
CI_custom <- function(model, data, predictor, xmin = 0, xmax = 1,
precision = 100) {
# Model should be a regression model with continuous
# predictor Data shoud be the dataframe you used to
# make the regression model Predictor is a numeric list
# or a column of the data, which is the predictor of
# your model Last modified 13 Dec, 2022 (Ming Liu)
if ((length(model) + length(predictor) + length(data)) >
0) {
if (class(model) == "MCMCglmm")
return("This function does not work with Bayesian stats model.") else {
## Extracting information from the model and
## data
num_vars <- length(coef(model)) # Get the number of variables in the model, for degree of freedom calculation
sample_size <- length(predictor) # Number of observations
residual_se <- (sqrt(sum(model$residuals^2)/(sample_size -
num_vars))) # The residual standard error
## Creating list and empty predictions
length_CI <- precision # Length of the list
xx <- seq(xmin, xmax, length.out = length_CI) # The list of x
yy <- xx * model$coefficients[2] + model$coefficients[1] # The list of estimated y
## Calculate the sum of variations in all x
## observations
x_bar <- mean(predictor) # Average of x of samples
var_x <- 0
for (i in 1:sample_size) var_x <- var_x + (predictor[i] -
x_bar)^2
## Numerically calculate the CI boundaries
ci_min <- rep(NA, length_CI) # Temporary space for minimum of CI
ci_max <- rep(NA, length_CI) # Temporary space for maximum of CI
for (i in 1:length_CI) {
difference <- residual_se * sqrt(1/sample_size +
(xx[i] - x_bar)^2/var_x)
ci_min[i] <- yy[i] - 1.96 * difference
ci_max[i] <- yy[i] + 1.96 * difference
}
## Assemble the plotting data frame
df_forplot <- data.frame(Predictor = xx, Prediction = yy,
CI_low = ci_min, CI_high = ci_max)
return(df_forplot)
}
} else return("Invalid inputs! Please check model, data, and predictor options of the function.")
}
Example code
Read in the model and data
data <- fread("AntDataSimple.txt") # Read in the data
model <- readRDS("AntModelSimple.RDS") # Read the model file (use 'saveRDS' to save a stats model)
head(data)
## animal type colony.size SD_all_workers
## 1: Acromyrmex_echinatior ant 19274.12 0.98712107
## 2: Acromyrmex_octospinosus ant 22005.50 0.28834996
## 3: Acromyrmex_subterraneus ant 4766.20 0.60716543
## 4: Aenictus_dentatus ant 325000.00 0.03397548
## 5: Aenictus_laeviceps ant 92500.00 0.07804005
## 6: Allomerus_octoarticulatus ant 116.00 0.03933460
## All_workers_HW_mean worker_CoV log10.colony.size
## 1: 2.1490000 0.45933973 4.284975
## 2: 2.2842000 0.12623674 4.342531
## 3: 2.1843636 0.27795987 3.678172
## 4: 0.8403333 0.04043096 5.511883
## 5: 0.6837500 0.11413536 4.966142
## 6: 0.4871000 0.08075262 2.064458
class(model) # For some reason phylolm object is not easy to print out in R markdown
## [1] "phylolm"
Run
The function returns a dataframe with four columns: the predictor (x), the prediction (y), and the lower and higher points of confidence interval.
df_plot <- CI_custom(model = model, data = data, predictor = data$log10.colony.size,
xmin = 0, xmax = 10)
head(df_plot)
## Predictor Prediction CI_low CI_high
## 1 0.0000000 -1.391545 -1.565460 -1.217630
## 2 0.1010101 -1.377483 -1.546943 -1.208024
## 3 0.2020202 -1.363422 -1.528448 -1.198396
## 4 0.3030303 -1.349360 -1.509977 -1.188744
## 5 0.4040404 -1.335299 -1.491534 -1.179064
## 6 0.5050505 -1.321237 -1.473119 -1.169356
Plot
This model uses log10 transformation for the response, so let’s transform it back.
df_plot_trs <- df_plot
df_plot_trs$Prediction <- 10^(df_plot$Prediction)
df_plot_trs$CI_low <- 10^(df_plot$CI_low)
df_plot_trs$CI_high <- 10^(df_plot$CI_high)
Now let’s make the figure.
ggplot(NULL) + xlab("Colony size (log 10 scale)") + ylab("Worker covariance matrix") +
coord_cartesian(ylim = c(0, 0.8), xlim = c(1, 7.2)) + geom_ribbon(data = df_plot_trs,
aes(x = Predictor, ymin = CI_low, ymax = CI_high), fill = "grey85") +
geom_point(data = data, aes(x = log10.colony.size, y = worker_CoV),
size = 1) + geom_line(data = df_plot_trs, aes(x = Predictor,
y = Prediction))
Updates
08Dec: MCMC does not have residual standard error!
It turns out Bayesian stats requires simulations from the posterior distribution to get the credible interval estimation. The idea is still creating a list of x sequence, but a ‘noisy model’ is needed for creating the lower and upper boundary of estimated credible interval.
Online resources: