Custom Plot Confidence Interval

7 minute read

Published:

This is a historic page made by RMarkdown.Manually plotting credible interval of continuous predictor

Concept

References

This description is based on this popular thread, but I have rewritten for archive and for better understanding. A proper stats textbook could be a better source of validation.

What regression model actually does

Let us define \(x\) as the observed predictor samples, where there are \(n\) number of samples and the average of observed predictor values is \(\bar{x}\). Then, a simple regression model would find the “best” line minimizing the sum of deviation between samples and the line, \[\begin{equation} \tag{1.1} y= a(x-\bar{x})+b, \end{equation}\]

where a is the estimated fixed effect, and b is the intercept at \(\bar{x}\).

Getting the residual standard error of the model

Sadly, no model is perfect and there is always some deviations between actual data and the model. In the language of stats, people call the remaining deviations as ‘residuals’ and this is crucial for us to manually plot the confidence interval. Nevertheless, the raw residual needs to be processed before we can find the range of confidence interval, and the first step is getting the residual standard error of the model.

The formula of residual standard error depends on the distribution assumption of the regression model, some models like Gaussian regression have no additional terms, but Poisson regression have additional terms to calibrate the standard errors. In Gaussian regression, residual standard error is defined as \[\begin{equation} \tag{1.2} \sigma= \sqrt{\frac{\sum{residuals}}{df}}, \end{equation}\]

where residuals are summed across all \(n\) samples, and \(df\) is the degree of freedom. For conventional models, \(df\) is the \(n- (n_{predictors}+1)\), where \(n_{predictors}\) is the number of predictors in the regression model. (Other distributions need double-checking, not sure if anything is different in logistic regression.)

The confidence interval

Now we can consider the standard errors of slope \(a\) and x-bar intercept \(b\). If we know \(b\) is true, the standard error of \(a\) is \(\sigma/\sqrt{n}\). Similarly, if we know \(a\) is true, the standard error of \(b\) is \(\sigma / \sqrt{\sum_{i=1}^{n} (x_i-\bar{x})^{2}}\).

Assuming we are interested in the uncertainty at \(x=x_{focal}\), the effect of uncertainty in slope would be multiplied by how far it is from the mean \(\bar{x}\), giving \((x_{focal}-\bar{x})\sigma / \sqrt{\sum_{i=1}^{n} (x_i-\bar{x})^{2}}\). Now the overall effect is just the square root of the sum of the squares of those two things. (Why? Because variances of the uncorrelated things add, and \(a\) and \(b\) are uncorrelated.) So the overall standard error is the square root of the overall variance, and the variance is the sum of the variances of the components, \(\sqrt{(\sigma/\sqrt{n})^{2}+((x_{focal}-\bar{x})\sigma / \sqrt{\sum_{i=1}^{n} (x_i-\bar{x})^{2}})^{2}}\). With some simplification, we have the usual term for the standard error of the estimate of the mean value at \(x_{focal}\), \[\begin{equation} \tag{1.3} \sigma \sqrt{\frac{1}{n}+\frac{(x_{focal}-\bar{x})^{2}}{\sum_{i=1}^{n} (x_i-\bar{x})^{2}}}. \end{equation}\]

If you draw expression 1.3 as a function of \(x_{focal}\), you will see it forms a curve like a smaile, with a minimum at \(\bar{x}\). The whole term is what gets added to or subtracted from the fitted line (well, multiple of it is, in order to get a desired confidence interval).

The custom function

You can put this function into your working code and load it before making figures (only need to load it once per session, like all R packages).

# The function for numerically generating UNLINKED
# prediction and confidence interval
CI_custom <- function(model, data, predictor, xmin = 0, xmax = 1,
    precision = 100) {
    # Model should be a regression model with continuous
    # predictor Data shoud be the dataframe you used to
    # make the regression model Predictor is a numeric list
    # or a column of the data, which is the predictor of
    # your model Last modified 13 Dec, 2022 (Ming Liu)
    if ((length(model) + length(predictor) + length(data)) >
        0) {
        if (class(model) == "MCMCglmm")
            return("This function does not work with Bayesian stats model.") else {
            ## Extracting information from the model and
            ## data
            num_vars <- length(coef(model))  # Get the number of variables in the model, for degree of freedom calculation
            sample_size <- length(predictor)  # Number of observations
            residual_se <- (sqrt(sum(model$residuals^2)/(sample_size -
                num_vars)))  # The residual standard error
            ## Creating list and empty predictions
            length_CI <- precision  # Length of the list
            xx <- seq(xmin, xmax, length.out = length_CI)  # The list of x
            yy <- xx * model$coefficients[2] + model$coefficients[1]  # The list of estimated y
            ## Calculate the sum of variations in all x
            ## observations
            x_bar <- mean(predictor)  # Average of x of samples
            var_x <- 0
            for (i in 1:sample_size) var_x <- var_x + (predictor[i] -
                x_bar)^2
            ## Numerically calculate the CI boundaries
            ci_min <- rep(NA, length_CI)  # Temporary space for minimum of CI
            ci_max <- rep(NA, length_CI)  # Temporary space for maximum of CI
            for (i in 1:length_CI) {
                difference <- residual_se * sqrt(1/sample_size +
                  (xx[i] - x_bar)^2/var_x)
                ci_min[i] <- yy[i] - 1.96 * difference
                ci_max[i] <- yy[i] + 1.96 * difference
            }
            ## Assemble the plotting data frame
            df_forplot <- data.frame(Predictor = xx, Prediction = yy,
                CI_low = ci_min, CI_high = ci_max)
            return(df_forplot)
        }
    } else return("Invalid inputs! Please check model, data, and predictor options of the function.")
}

Example code

Read in the model and data

data <- fread("AntDataSimple.txt")  # Read in the data
model <- readRDS("AntModelSimple.RDS")  # Read the model file (use 'saveRDS' to save a stats model)
head(data)
##                       animal type colony.size SD_all_workers
## 1:     Acromyrmex_echinatior  ant    19274.12     0.98712107
## 2:   Acromyrmex_octospinosus  ant    22005.50     0.28834996
## 3:   Acromyrmex_subterraneus  ant     4766.20     0.60716543
## 4:         Aenictus_dentatus  ant   325000.00     0.03397548
## 5:        Aenictus_laeviceps  ant    92500.00     0.07804005
## 6: Allomerus_octoarticulatus  ant      116.00     0.03933460
##    All_workers_HW_mean worker_CoV log10.colony.size
## 1:           2.1490000 0.45933973          4.284975
## 2:           2.2842000 0.12623674          4.342531
## 3:           2.1843636 0.27795987          3.678172
## 4:           0.8403333 0.04043096          5.511883
## 5:           0.6837500 0.11413536          4.966142
## 6:           0.4871000 0.08075262          2.064458
class(model)  # For some reason phylolm object is not easy to print out in R markdown
## [1] "phylolm"

Run

The function returns a dataframe with four columns: the predictor (x), the prediction (y), and the lower and higher points of confidence interval.

df_plot <- CI_custom(model = model, data = data, predictor = data$log10.colony.size,
    xmin = 0, xmax = 10)
head(df_plot)
##   Predictor Prediction    CI_low   CI_high
## 1 0.0000000  -1.391545 -1.565460 -1.217630
## 2 0.1010101  -1.377483 -1.546943 -1.208024
## 3 0.2020202  -1.363422 -1.528448 -1.198396
## 4 0.3030303  -1.349360 -1.509977 -1.188744
## 5 0.4040404  -1.335299 -1.491534 -1.179064
## 6 0.5050505  -1.321237 -1.473119 -1.169356

Plot

This model uses log10 transformation for the response, so let’s transform it back.

df_plot_trs <- df_plot
df_plot_trs$Prediction <- 10^(df_plot$Prediction)
df_plot_trs$CI_low <- 10^(df_plot$CI_low)
df_plot_trs$CI_high <- 10^(df_plot$CI_high)

Now let’s make the figure.

ggplot(NULL) + xlab("Colony size (log 10 scale)") + ylab("Worker covariance matrix") +
    coord_cartesian(ylim = c(0, 0.8), xlim = c(1, 7.2)) + geom_ribbon(data = df_plot_trs,
    aes(x = Predictor, ymin = CI_low, ymax = CI_high), fill = "grey85") +
    geom_point(data = data, aes(x = log10.colony.size, y = worker_CoV),
        size = 1) + geom_line(data = df_plot_trs, aes(x = Predictor,
    y = Prediction))
The relation between ant colony size and worker polymorphism (coefficient of variance).

The relation between ant colony size and worker polymorphism (coefficient of variance).

Updates

08Dec: MCMC does not have residual standard error!

It turns out Bayesian stats requires simulations from the posterior distribution to get the credible interval estimation. The idea is still creating a list of x sequence, but a ‘noisy model’ is needed for creating the lower and upper boundary of estimated credible interval.

Online resources: