Abstract

Findings: (1) Predicted probabilities are miscalibrated when using resampling techniques. (2) Typical inference methods are invalid. (3) Using the naive threshold of 0.5 leads to misleading performance estimates.

An important consideration in classification problems is class imbalance. Suppose there are two groups, where 99% of observations come from the majority group. A classifier that simply predicts the majority group every time will have 99% accuracy. But clearly this isn’t a good model (it’s not actually predicting anything!).

Concerned by class imbalance, some modelers might consider strategies to address it, such as minority undersampling, majority oversampling, SMOTE, or many others. But how concerned should we really be? And, if we employ such methods, will it cause any unwanted side effects?

In this note we look at a simple case: a binary outcome with two predictors fit using a logistic regression model. We’ll simulate data and see how three approaches to class imbalance compare.

Code

library(caret)
library(probably)    # For calibration curves.
library(smotefamily) # For SMOTE().
library(gt)
library(broom)
library(tidyverse)
options(digits = 2)

Generating data

We generate a large dataset of n=5000 observations, where \sim10\% of observations are from the minority group. There are two predictors, x and z, where x has a strong association with the outcome and z is independent of both x and y.

Note

In practice, the study/experimental design matters. In this note, we’re assuming the data are a random sample from the population. I.e., the prevalence of classes observed in the data is reflective of the population or data generating process. We won’t be concerned with issues like population drift, and we’re not considering study designs that sample disproportionately, such as case-control studies.

Code

set.seed(12282022)

gen_data = function(n, sigma = 0.1) {
  x = rnorm(n)
  z = rnorm(n)
  p = 1 / (1 + exp(-(7 + 5 * x) + rnorm(n, 0, sigma)))
  y = rbinom(n, 1, p)
  return(data.frame(y, x, z))
}

n = 5000
df = gen_data(n)
table(df$y)


   0    1 
 507 4493

Strategies

We consider three strategies for model fitting:

No adjustment: Just fit the logistic regression model on the original data.
Resampling: use a mix of majority under-sampling and minority over-sampling to obtain a dataset that has exactly 50:50 split of observations between the classes. This produces a sample size equal to the original n = 5000.
SMOTE: use synthetic minority oversampling to create additional observations from the minority class. We use the default arguments from smotefamily::SMOTE() which will use K=5 nearest neighbors and duplicate each observation to obtain approximately the same number of samples as the majority class. This produces a sample size larger than the original dataset.

Code

apply_strategies = function(df) {
  df.smote = SMOTE(df[, c("x", "z")], target = df$y, K = 5, dup_size = 8)$data
  df.smote$y = as.numeric(df.smote$class)
  
  index_1 <- sample(which(df$y == 1), n / 2, replace = TRUE)
  index_0 <- sample(which(df$y == 0), n / 2, replace = TRUE)
  df.sample = df[c(index_0, index_1), ]
  
  df.list = list(
    "no-adjustment" = df,
    "resampling" = df.sample,
    "SMOTE" = df.smote
  )
  
  return(df.list)
}

df.list = apply_strategies(df)
lapply(df.list, function(df) table(df$y))

$`no-adjustment`

   0    1 
 507 4493 

$resampling

   0    1 
2500 2500 

$SMOTE

   0    1 
4563 4493

Fitting the models

We fit a logistic regression model on the three datasets. No other adjustments are made to the data or models. We’ll use a separate dataset of size 5000 simulated from the same data generating process for testing.

Code

fits = lapply(df.list, function(df) glm(y ~ x + z, data = df, family = binomial))

lapply(fits, function(fit) tidy(fit))

$`no-adjustment`
# A tibble: 3 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   7.16      0.302     23.7   1.23e-124
2 x             5.20      0.235     22.1   2.65e-108
3 z            -0.0594    0.0771    -0.770 4.41e-  1

$resampling
# A tibble: 3 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   5.01      0.188     26.7   4.64e-157
2 x             5.18      0.171     30.2   1.37e-200
3 z            -0.0349    0.0574    -0.608 5.43e-  1

$SMOTE
# A tibble: 3 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)  5.39       0.154     34.9   2.18e-267
2 x            5.54       0.140     39.5   0        
3 z           -0.00804    0.0460    -0.175 8.61e-  1

A few observation from the coefficient estimates:

With no adjustments, the estimates for the intercept and two coefficients are within 1 SE from the true value.
In both re-sampling and SMOTE strategies, the estimate for the intercept is much lower than the true value. This is because the intercept adjusts for the baseline rate, and these two adjustments causes the baseline rate to be closer to 0.5.
The SE’s for the coefficient estimates are smaller with re-sampling and SMOTE strategies. This might seem like a good thing, but it turns out they are biased–the values are lower than the true standard deviation of the estimates. This can be confirmed analytically or through a simulation study.

Here’s a quick simulation for (3), which suggests not only are the SE’s biased for the resampling method, but the actual SD’s are larger compared to the no-adjustment method (which makes sense, since we have the variation due to sampling from the popultion plus the variation due to the resampling strategy).

Code

res = replicate(100, {
  df = gen_data(n) # Note, important to get a new random sample each time.
  index_1 <- sample(which(df$y == 1), n / 2, replace = TRUE)
  index_0 <- sample(which(df$y == 0), n / 2, replace = TRUE)
  df.sample = df[c(index_0, index_1), ]
  fit = glm(y ~ x + z, data = df.sample, family = binomial)
  c(tidy(fit)$estimate, tidy(fit)$std.error)
})
gt(data.frame(param = c("Intercept", "x", "z"),
              estimated_SD = apply(res[1:3, ], 1, sd),
              average_SE = apply(res[4:6, ], 1, mean)))

param	estimated_SD	average_SE
Intercept	0.38	0.18
x	0.33	0.17
z	0.12	0.06

Take-away

With biased SE’s, the usual p-values and CI’s become invalid if we’re using a resampling strategy. Either a different method for computing p-values and CI’s needs to be used, or the model fitting procedure needs to be wrapped in a bootstrap to get valid inferences.

Here’s another quick simulation to show that simple bootstrap estimates for SE’s are much closer to the true values.

Code

res.boot = replicate(100, {
  df.boot = df[sample(1:n, n, replace = TRUE), ]
  index_1 <- sample(which(df.boot$y == 1), n / 2, replace = TRUE)
  index_0 <- sample(which(df.boot$y == 0), n / 2, replace = TRUE)
  df.sample = df.boot[c(index_0, index_1), ]
  fit = glm(y ~ x + z, data = df.sample, family = binomial)
  c(tidy(fit)$estimate)
})
gt(data.frame(param = c("Intercept", "x", "z"),
              estimated_SD = apply(res[1:3, ], 1, sd),
              bootstrap_SE = apply(res.boot, 1, sd)))

param	estimated_SD	bootstrap_SE
Intercept	0.38	0.416
x	0.33	0.336
z	0.12	0.095

Calibration

Before looking at classification performance, let’s first take a look at calibration. If we understand how the intercept term in the regression model reflects the baseline rate, then it should be no surprise that the two resampling strategies provide uncalibrated probabilities. These could always be re-calibrated, but it should be kept in mind when moving on to the next step of selecting a threshold for classification.

Code

testing = gen_data(n) # Generate another dataset of the same size for testing.
lapply(fits, function(fit) {
  # Add predicted probabilities to the original dataset
  testing$`.pred_1` = predict(fit, newdata = testing, type = "response")
  testing$`.pred_0` = 1 - testing$`.pred_1`
  testing$y = factor(testing$y)
  cal_plot_breaks(testing, y, .pred_0)
})

$`no-adjustment`


$resampling


$SMOTE

Take-away

The model will not be calibrated when using resampling approaches.

Performance measures

Finally, we turn our fitted model into a classifier. This involves applying a threshold to the predicted probabilities from the model to obtain the predicted class.

First, let’s look at what happens if we naively use a threshold of 0.5.

Code

testing = gen_data(n) # Generate another dataset of the same size for testing.
res = lapply(fits, function(fit) {
  threshold = 0.5
  pred = predict(fit, newdata = testing, type = "response")
  caret::confusionMatrix(factor((pred > threshold) * 1),
                         reference = factor(testing$y))$byClass
})
do.call(cbind, res)

                     no-adjustment resampling SMOTE
Sensitivity                  0.693      0.922 0.919
Specificity                  0.981      0.910 0.911
Pos Pred Value               0.813      0.548 0.551
Neg Pred Value               0.964      0.990 0.990
Precision                    0.813      0.548 0.551
Recall                       0.693      0.922 0.919
F1                           0.748      0.687 0.688
Prevalence                   0.106      0.106 0.106
Detection Rate               0.073      0.097 0.097
Detection Prevalence         0.090      0.178 0.176
Balanced Accuracy            0.837      0.916 0.915

While it looks like the no-adjustment method performs much worse than the other two approaches, this is only because of the 0.5 threshold.

Note

Don’t just apply a 0.5 threshold to get predicted classes. This is a naive approach that makes a well-calibrated model look like it performs poorly for classification. Instead, we should either (a) choose the baseline rate as the default threshold for calibrated models or (b) choose a threshold that maximizes some metric that reflects the balance of type-I and type-II errors that we’re willing to make (for example, the F1 score).

Let’s run this again, but instead of using 0.5 we’ll use the prevalance in the data used to fit the model. (Note, both resampling methods this will be close to 0.5 by design, but for the no-adjustment it will be close to 0.1.)

Code

testing = gen_data(n) # Generate another dataset of the same size for testing.
res = lapply(fits, function(fit) {
  threshold = mean(fit$data$y)
  pred = predict(fit, newdata = testing, type = "response")
  caret::confusionMatrix(factor((pred > threshold) * 1),
                         reference = factor(testing$y))$byClass
})
do.call(cbind, res)

                     no-adjustment resampling SMOTE
Sensitivity                  0.962      0.958 0.956
Specificity                  0.910      0.912 0.914
Pos Pred Value               0.528      0.532 0.537
Neg Pred Value               0.996      0.995 0.995
Precision                    0.528      0.532 0.537
Recall                       0.962      0.958 0.956
F1                           0.682      0.684 0.688
Prevalence                   0.095      0.095 0.095
Detection Rate               0.091      0.091 0.090
Detection Prevalence         0.172      0.170 0.168
Balanced Accuracy            0.936      0.935 0.935

Now we see that… well, all three models perform almost identically. This is due to a few key reasons:

All three are using the same underlying model (logistic regression) and the same predictors.
All three models are using a threshold set to the baseline rate the model saw during training, which removes any performance differences due to differences in calibration.
We generated a large dataset with a strong signal, so the variation introduced from the resampling methods doesn’t affect performance much.

Take-away

If we think about model calibration and choose the threshold for the classifier appropriately, we find no advantage to the resampling or SMOTE strategies (at least for this simple model).

That last part–“at least for this simple model”–feels like it needs attention. Will these results still hold if we have a more noisy dataset? Or a dataset with more variables, or a more complex underlying model? What if we need to use a random forest or neural network? Does the simple solution of calibration and choosing an appropriate threshold fix all the problems caused by class imbalance in those scenarios?

Well, that’s too much to get into for this note, but we’ll at least run a few more simulations with different sample sizes and noisy datasets. Otherwise, I’ll leave it here for now.

Code

set.seed(12345)
res = lapply(1:20, function(sim) {
  n = 100
  sigma = 1
  df = gen_data(n, sigma = sigma)
  df.list = apply_strategies(df)
  fits = lapply(df.list, function(df) glm(y ~ x + z, data = df, family = binomial))
  testing = gen_data(n, sigma = sigma) # Generate another dataset of the same size for testing.
  res = lapply(1:3, function(i) {
    fit = fits[[i]]
    threshold = mean(fit$data$y)
    pred = predict(fit, newdata = testing, type = "response")
    val = caret::confusionMatrix(factor((pred > threshold) * 1),
                                 reference = factor(testing$y))$byClass[-c(8:11)]
    dat = data.frame(metric = names(val), value = val, model = names(fits)[i])
    dat$metric = factor(dat$metric, levels = names(val))
    return(dat)
  })
  data.frame(sim = sim, do.call(rbind, res))
})
dat = do.call(rbind, res)

Code

dat %>%
  ggplot(aes(x = model, y = value)) +
  facet_wrap(. ~ metric, scales = "free_y") +
  geom_point(position = position_jitter(width = 0.3)) +
  geom_boxplot(alpha = 0.6) +
  theme_bw()

Figure 1: Boxplots summarizing marginal performance of each model across 20 simulations using a threshold set to baseline rate in training set.

Code

dat %>%
  ggplot(aes(x = model, y = value, color = factor(sim), group = model)) +
  facet_wrap(. ~ metric, scales = "free_y") +
  # geom_point(position = position_jitter(width = 0.3)) +
  geom_point() +
  geom_line(aes(group = factor(sim))) +
  # geom_boxplot(alpha = 0.6) +
  theme_bw() +
  theme(legend.position = "none")

Figure 2: Line plot to show relative performance of each approach for each of the 20 simulations.

Take-away

With a fairly small, noisy dataset, there is minimal difference between the three approaches.