Code
library(caret)
library(probably) # For calibration curves.
library(smotefamily) # For SMOTE().
library(gt)
library(broom)
library(tidyverse)
options(digits = 2)
Tyler Grimes
December 28, 2022
An important consideration in classification problems is class imbalance. Suppose there are two groups, where 99% of observations come from the majority group. A classifier that simply predicts the majority group every time will have 99% accuracy. But clearly this isn’t a good model (it’s not actually predicting anything!).
Concerned by class imbalance, some modelers might consider strategies to address it, such as minority undersampling, majority oversampling, SMOTE, or many others. But how concerned should we really be? And, if we employ such methods, will it cause any unwanted side effects?
In this note we look at a simple case: a binary outcome with two predictors fit using a logistic regression model. We’ll simulate data and see how three approaches to class imbalance compare.
We generate a large dataset of n=5000 observations, where \sim10\% of observations are from the minority group. There are two predictors, x and z, where x has a strong association with the outcome and z is independent of both x and y.
In practice, the study/experimental design matters. In this note, we’re assuming the data are a random sample from the population. I.e., the prevalence of classes observed in the data is reflective of the population or data generating process. We won’t be concerned with issues like population drift, and we’re not considering study designs that sample disproportionately, such as case-control studies.
We consider three strategies for model fitting:
smotefamily::SMOTE()
which will use K=5 nearest neighbors and duplicate each observation to obtain approximately the same number of samples as the majority class. This produces a sample size larger than the original dataset.apply_strategies = function(df) {
df.smote = SMOTE(df[, c("x", "z")], target = df$y, K = 5, dup_size = 8)$data
df.smote$y = as.numeric(df.smote$class)
index_1 <- sample(which(df$y == 1), n / 2, replace = TRUE)
index_0 <- sample(which(df$y == 0), n / 2, replace = TRUE)
df.sample = df[c(index_0, index_1), ]
df.list = list(
"no-adjustment" = df,
"resampling" = df.sample,
"SMOTE" = df.smote
)
return(df.list)
}
df.list = apply_strategies(df)
lapply(df.list, function(df) table(df$y))
$`no-adjustment`
0 1
507 4493
$resampling
0 1
2500 2500
$SMOTE
0 1
4563 4493
We fit a logistic regression model on the three datasets. No other adjustments are made to the data or models. We’ll use a separate dataset of size 5000 simulated from the same data generating process for testing.
$`no-adjustment`
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 7.16 0.302 23.7 1.23e-124
2 x 5.20 0.235 22.1 2.65e-108
3 z -0.0594 0.0771 -0.770 4.41e- 1
$resampling
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 5.01 0.188 26.7 4.64e-157
2 x 5.18 0.171 30.2 1.37e-200
3 z -0.0349 0.0574 -0.608 5.43e- 1
$SMOTE
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 5.39 0.154 34.9 2.18e-267
2 x 5.54 0.140 39.5 0
3 z -0.00804 0.0460 -0.175 8.61e- 1
A few observation from the coefficient estimates:
Here’s a quick simulation for (3), which suggests not only are the SE’s biased for the resampling method, but the actual SD’s are larger compared to the no-adjustment method (which makes sense, since we have the variation due to sampling from the popultion plus the variation due to the resampling strategy).
res = replicate(100, {
df = gen_data(n) # Note, important to get a new random sample each time.
index_1 <- sample(which(df$y == 1), n / 2, replace = TRUE)
index_0 <- sample(which(df$y == 0), n / 2, replace = TRUE)
df.sample = df[c(index_0, index_1), ]
fit = glm(y ~ x + z, data = df.sample, family = binomial)
c(tidy(fit)$estimate, tidy(fit)$std.error)
})
gt(data.frame(param = c("Intercept", "x", "z"),
estimated_SD = apply(res[1:3, ], 1, sd),
average_SE = apply(res[4:6, ], 1, mean)))
param | estimated_SD | average_SE |
---|---|---|
Intercept | 0.38 | 0.18 |
x | 0.33 | 0.17 |
z | 0.12 | 0.06 |
With biased SE’s, the usual p-values and CI’s become invalid if we’re using a resampling strategy. Either a different method for computing p-values and CI’s needs to be used, or the model fitting procedure needs to be wrapped in a bootstrap to get valid inferences.
Here’s another quick simulation to show that simple bootstrap estimates for SE’s are much closer to the true values.
res.boot = replicate(100, {
df.boot = df[sample(1:n, n, replace = TRUE), ]
index_1 <- sample(which(df.boot$y == 1), n / 2, replace = TRUE)
index_0 <- sample(which(df.boot$y == 0), n / 2, replace = TRUE)
df.sample = df.boot[c(index_0, index_1), ]
fit = glm(y ~ x + z, data = df.sample, family = binomial)
c(tidy(fit)$estimate)
})
gt(data.frame(param = c("Intercept", "x", "z"),
estimated_SD = apply(res[1:3, ], 1, sd),
bootstrap_SE = apply(res.boot, 1, sd)))
param | estimated_SD | bootstrap_SE |
---|---|---|
Intercept | 0.38 | 0.416 |
x | 0.33 | 0.336 |
z | 0.12 | 0.095 |
Before looking at classification performance, let’s first take a look at calibration. If we understand how the intercept term in the regression model reflects the baseline rate, then it should be no surprise that the two resampling strategies provide uncalibrated probabilities. These could always be re-calibrated, but it should be kept in mind when moving on to the next step of selecting a threshold for classification.
testing = gen_data(n) # Generate another dataset of the same size for testing.
lapply(fits, function(fit) {
# Add predicted probabilities to the original dataset
testing$`.pred_1` = predict(fit, newdata = testing, type = "response")
testing$`.pred_0` = 1 - testing$`.pred_1`
testing$y = factor(testing$y)
cal_plot_breaks(testing, y, .pred_0)
})
$`no-adjustment`
$resampling
$SMOTE
The model will not be calibrated when using resampling approaches.
Finally, we turn our fitted model into a classifier. This involves applying a threshold to the predicted probabilities from the model to obtain the predicted class.
First, let’s look at what happens if we naively use a threshold of 0.5.
testing = gen_data(n) # Generate another dataset of the same size for testing.
res = lapply(fits, function(fit) {
threshold = 0.5
pred = predict(fit, newdata = testing, type = "response")
caret::confusionMatrix(factor((pred > threshold) * 1),
reference = factor(testing$y))$byClass
})
do.call(cbind, res)
no-adjustment resampling SMOTE
Sensitivity 0.693 0.922 0.919
Specificity 0.981 0.910 0.911
Pos Pred Value 0.813 0.548 0.551
Neg Pred Value 0.964 0.990 0.990
Precision 0.813 0.548 0.551
Recall 0.693 0.922 0.919
F1 0.748 0.687 0.688
Prevalence 0.106 0.106 0.106
Detection Rate 0.073 0.097 0.097
Detection Prevalence 0.090 0.178 0.176
Balanced Accuracy 0.837 0.916 0.915
While it looks like the no-adjustment method performs much worse than the other two approaches, this is only because of the 0.5 threshold.
Don’t just apply a 0.5 threshold to get predicted classes. This is a naive approach that makes a well-calibrated model look like it performs poorly for classification. Instead, we should either (a) choose the baseline rate as the default threshold for calibrated models or (b) choose a threshold that maximizes some metric that reflects the balance of type-I and type-II errors that we’re willing to make (for example, the F1 score).
Let’s run this again, but instead of using 0.5 we’ll use the prevalance in the data used to fit the model. (Note, both resampling methods this will be close to 0.5 by design, but for the no-adjustment it will be close to 0.1.)
testing = gen_data(n) # Generate another dataset of the same size for testing.
res = lapply(fits, function(fit) {
threshold = mean(fit$data$y)
pred = predict(fit, newdata = testing, type = "response")
caret::confusionMatrix(factor((pred > threshold) * 1),
reference = factor(testing$y))$byClass
})
do.call(cbind, res)
no-adjustment resampling SMOTE
Sensitivity 0.962 0.958 0.956
Specificity 0.910 0.912 0.914
Pos Pred Value 0.528 0.532 0.537
Neg Pred Value 0.996 0.995 0.995
Precision 0.528 0.532 0.537
Recall 0.962 0.958 0.956
F1 0.682 0.684 0.688
Prevalence 0.095 0.095 0.095
Detection Rate 0.091 0.091 0.090
Detection Prevalence 0.172 0.170 0.168
Balanced Accuracy 0.936 0.935 0.935
Now we see that… well, all three models perform almost identically. This is due to a few key reasons:
If we think about model calibration and choose the threshold for the classifier appropriately, we find no advantage to the resampling or SMOTE strategies (at least for this simple model).
That last part–“at least for this simple model”–feels like it needs attention. Will these results still hold if we have a more noisy dataset? Or a dataset with more variables, or a more complex underlying model? What if we need to use a random forest or neural network? Does the simple solution of calibration and choosing an appropriate threshold fix all the problems caused by class imbalance in those scenarios?
Well, that’s too much to get into for this note, but we’ll at least run a few more simulations with different sample sizes and noisy datasets. Otherwise, I’ll leave it here for now.
set.seed(12345)
res = lapply(1:20, function(sim) {
n = 100
sigma = 1
df = gen_data(n, sigma = sigma)
df.list = apply_strategies(df)
fits = lapply(df.list, function(df) glm(y ~ x + z, data = df, family = binomial))
testing = gen_data(n, sigma = sigma) # Generate another dataset of the same size for testing.
res = lapply(1:3, function(i) {
fit = fits[[i]]
threshold = mean(fit$data$y)
pred = predict(fit, newdata = testing, type = "response")
val = caret::confusionMatrix(factor((pred > threshold) * 1),
reference = factor(testing$y))$byClass[-c(8:11)]
dat = data.frame(metric = names(val), value = val, model = names(fits)[i])
dat$metric = factor(dat$metric, levels = names(val))
return(dat)
})
data.frame(sim = sim, do.call(rbind, res))
})
dat = do.call(rbind, res)
dat %>%
ggplot(aes(x = model, y = value, color = factor(sim), group = model)) +
facet_wrap(. ~ metric, scales = "free_y") +
# geom_point(position = position_jitter(width = 0.3)) +
geom_point() +
geom_line(aes(group = factor(sim))) +
# geom_boxplot(alpha = 0.6) +
theme_bw() +
theme(legend.position = "none")
With a fairly small, noisy dataset, there is minimal difference between the three approaches.
---
title: "Resampling and SMOTE to handle class imbalance"
author: Tyler Grimes
date: '2022-12-28'
abstract: 'Findings: (1) Predicted probabilities are miscalibrated when using resampling techniques. (2) Typical inference methods are invalid. (3) Using the naive threshold of 0.5 leads to misleading performance estimates.'
categories:
- Predictive modeling
- Classification
- Class imbalance
- Calibration
- SMOTE
tags: []
format:
html: default
---
An important consideration in classification problems is class imbalance. Suppose there are two groups, where 99% of observations come from the majority group. A classifier that simply predicts the majority group every time will have 99% accuracy. But clearly this isn't a good model (it's not actually predicting anything!).
Concerned by class imbalance, some modelers might consider strategies to address it, such as minority undersampling, majority oversampling, SMOTE, or many others. But how concerned should we really be? And, if we employ such methods, will it cause any unwanted side effects?
In this note we look at a simple case: a binary outcome with two predictors fit using a logistic regression model. We'll simulate data and see how three approaches to class imbalance compare.
```{r message=FALSE}
library(caret)
library(probably) # For calibration curves.
library(smotefamily) # For SMOTE().
library(gt)
library(broom)
library(tidyverse)
options(digits = 2)
```
# Generating data
We generate a large dataset of $n=5000$ observations, where $\sim10\%$ of observations are from the minority group. There are two predictors, $x$ and $z$, where $x$ has a strong association with the outcome and $z$ is independent of both $x$ and $y$.
::: {.callout-note}
In practice, the study/experimental design matters. In this note, we're assuming the data are a random sample from the population. I.e., the prevalence of classes observed in the data is reflective of the population or data generating process. We won't be concerned with issues like population drift, and we're not considering study designs that sample disproportionately, such as case-control studies.
:::
```{r}
set.seed(12282022)
gen_data = function(n, sigma = 0.1) {
x = rnorm(n)
z = rnorm(n)
p = 1 / (1 + exp(-(7 + 5 * x) + rnorm(n, 0, sigma)))
y = rbinom(n, 1, p)
return(data.frame(y, x, z))
}
n = 5000
df = gen_data(n)
table(df$y)
```
# Strategies
We consider three strategies for model fitting:
1. No adjustment: Just fit the logistic regression model on the original data.
2. Resampling: use a mix of majority under-sampling and minority over-sampling to obtain a dataset that has exactly 50:50 split of observations between the classes. This produces a sample size equal to the original $n = 5000$.
3. SMOTE: use synthetic minority oversampling to create additional observations from the minority class. We use the default arguments from `smotefamily::SMOTE()` which will use $K=5$ nearest neighbors and duplicate each observation to obtain approximately the same number of samples as the majority class. This produces a sample size larger than the original dataset.
```{r}
apply_strategies = function(df) {
df.smote = SMOTE(df[, c("x", "z")], target = df$y, K = 5, dup_size = 8)$data
df.smote$y = as.numeric(df.smote$class)
index_1 <- sample(which(df$y == 1), n / 2, replace = TRUE)
index_0 <- sample(which(df$y == 0), n / 2, replace = TRUE)
df.sample = df[c(index_0, index_1), ]
df.list = list(
"no-adjustment" = df,
"resampling" = df.sample,
"SMOTE" = df.smote
)
return(df.list)
}
df.list = apply_strategies(df)
lapply(df.list, function(df) table(df$y))
```
# Fitting the models
We fit a logistic regression model on the three datasets. No other adjustments are made to the data or models. We'll use a separate dataset of size $5000$ simulated from the same data generating process for testing.
```{r}
fits = lapply(df.list, function(df) glm(y ~ x + z, data = df, family = binomial))
lapply(fits, function(fit) tidy(fit))
```
A few observation from the coefficient estimates:
1. With no adjustments, the estimates for the intercept and two coefficients are within 1 SE from the true value.
2. In both re-sampling and SMOTE strategies, the estimate for the intercept is much lower than the true value. This is because the intercept adjusts for the baseline rate, and these two adjustments causes the baseline rate to be closer to 0.5.
3. The SE's for the coefficient estimates are smaller with re-sampling and SMOTE strategies. This might seem like a good thing, but it turns out they are biased--the values are lower than the true standard deviation of the estimates. This can be confirmed analytically or through a simulation study.
Here's a quick simulation for (3), which suggests not only are the SE's biased for the resampling method, but the actual SD's are larger compared to the no-adjustment method (which makes sense, since we have the variation due to sampling from the popultion *plus* the variation due to the resampling strategy).
```{r}
res = replicate(100, {
df = gen_data(n) # Note, important to get a new random sample each time.
index_1 <- sample(which(df$y == 1), n / 2, replace = TRUE)
index_0 <- sample(which(df$y == 0), n / 2, replace = TRUE)
df.sample = df[c(index_0, index_1), ]
fit = glm(y ~ x + z, data = df.sample, family = binomial)
c(tidy(fit)$estimate, tidy(fit)$std.error)
})
gt(data.frame(param = c("Intercept", "x", "z"),
estimated_SD = apply(res[1:3, ], 1, sd),
average_SE = apply(res[4:6, ], 1, mean)))
```
::: {.callout-tip}
## Take-away
With biased SE's, the usual p-values and CI's become invalid if we're using a resampling strategy. Either a different method for computing p-values and CI's needs to be used, or the model fitting procedure needs to be wrapped in a bootstrap to get valid inferences.
:::
Here's another quick simulation to show that simple bootstrap estimates for SE's are much closer to the true values.
```{r}
res.boot = replicate(100, {
df.boot = df[sample(1:n, n, replace = TRUE), ]
index_1 <- sample(which(df.boot$y == 1), n / 2, replace = TRUE)
index_0 <- sample(which(df.boot$y == 0), n / 2, replace = TRUE)
df.sample = df.boot[c(index_0, index_1), ]
fit = glm(y ~ x + z, data = df.sample, family = binomial)
c(tidy(fit)$estimate)
})
gt(data.frame(param = c("Intercept", "x", "z"),
estimated_SD = apply(res[1:3, ], 1, sd),
bootstrap_SE = apply(res.boot, 1, sd)))
```
# Calibration
Before looking at classification performance, let's first take a look at calibration. If we understand how the intercept term in the regression model reflects the baseline rate, then it should be no surprise that the two resampling strategies provide uncalibrated probabilities. These could always be re-calibrated, but it should be kept in mind when moving on to the next step of selecting a threshold for classification.
```{r}
testing = gen_data(n) # Generate another dataset of the same size for testing.
lapply(fits, function(fit) {
# Add predicted probabilities to the original dataset
testing$`.pred_1` = predict(fit, newdata = testing, type = "response")
testing$`.pred_0` = 1 - testing$`.pred_1`
testing$y = factor(testing$y)
cal_plot_breaks(testing, y, .pred_0)
})
```
::: {.callout-tip}
## Take-away
The model will not be calibrated when using resampling approaches.
:::
# Performance measures
Finally, we turn our fitted model into a classifier. This involves applying a threshold to the predicted probabilities from the model to obtain the predicted class.
First, let's look at what happens if we naively use a threshold of 0.5.
```{r}
testing = gen_data(n) # Generate another dataset of the same size for testing.
res = lapply(fits, function(fit) {
threshold = 0.5
pred = predict(fit, newdata = testing, type = "response")
caret::confusionMatrix(factor((pred > threshold) * 1),
reference = factor(testing$y))$byClass
})
do.call(cbind, res)
```
While it looks like the no-adjustment method performs much worse than the other two approaches, this is only because of the 0.5 threshold.
::: {.callout-note}
Don't just apply a 0.5 threshold to get predicted classes. This is a naive approach that makes a well-calibrated model look like it performs poorly for classification. Instead, we should either (a) choose the baseline rate as the default threshold for calibrated models or (b) choose a threshold that maximizes some metric that reflects the balance of type-I and type-II errors that we're willing to make (for example, the F1 score).
:::
Let's run this again, but instead of using 0.5 we'll use the prevalance in the data used to fit the model. (Note, both resampling methods this will be close to 0.5 by design, but for the no-adjustment it will be close to 0.1.)
```{r}
testing = gen_data(n) # Generate another dataset of the same size for testing.
res = lapply(fits, function(fit) {
threshold = mean(fit$data$y)
pred = predict(fit, newdata = testing, type = "response")
caret::confusionMatrix(factor((pred > threshold) * 1),
reference = factor(testing$y))$byClass
})
do.call(cbind, res)
```
Now we see that... well, all three models perform almost identically. This is due to a few key reasons:
1. All three are using the same underlying model (logistic regression) and the same predictors.
2. All three models are using a threshold set to the baseline rate the model saw during training, which removes any performance differences due to differences in calibration.
2. We generated a large dataset with a strong signal, so the variation introduced from the resampling methods doesn't affect performance much.
::: {.callout-tip}
## Take-away
If we think about model calibration and choose the threshold for the classifier appropriately, we find no advantage to the resampling or SMOTE strategies (at least for this simple model).
:::
That last part--"at least for this simple model"--feels like it needs attention. Will these results still hold if we have a more noisy dataset? Or a dataset with more variables, or a more complex underlying model? What if we need to use a random forest or neural network? Does the simple solution of calibration and choosing an appropriate threshold fix all the problems caused by class imbalance in those scenarios?
Well, that's too much to get into for this note, but we'll at least run a few more simulations with different sample sizes and noisy datasets. Otherwise, I'll leave it here for now.
```{r warning = FALSE}
set.seed(12345)
res = lapply(1:20, function(sim) {
n = 100
sigma = 1
df = gen_data(n, sigma = sigma)
df.list = apply_strategies(df)
fits = lapply(df.list, function(df) glm(y ~ x + z, data = df, family = binomial))
testing = gen_data(n, sigma = sigma) # Generate another dataset of the same size for testing.
res = lapply(1:3, function(i) {
fit = fits[[i]]
threshold = mean(fit$data$y)
pred = predict(fit, newdata = testing, type = "response")
val = caret::confusionMatrix(factor((pred > threshold) * 1),
reference = factor(testing$y))$byClass[-c(8:11)]
dat = data.frame(metric = names(val), value = val, model = names(fits)[i])
dat$metric = factor(dat$metric, levels = names(val))
return(dat)
})
data.frame(sim = sim, do.call(rbind, res))
})
dat = do.call(rbind, res)
```
```{r}
#| label: fig-box
#| fig-cap: Boxplots summarizing marginal performance of each model across 20 simulations using a threshold set to baseline rate in training set.
dat %>%
ggplot(aes(x = model, y = value)) +
facet_wrap(. ~ metric, scales = "free_y") +
geom_point(position = position_jitter(width = 0.3)) +
geom_boxplot(alpha = 0.6) +
theme_bw()
```
```{r}
#| label: fig-lines
#| fig-cap: Line plot to show relative performance of each approach for each of the 20 simulations.
dat %>%
ggplot(aes(x = model, y = value, color = factor(sim), group = model)) +
facet_wrap(. ~ metric, scales = "free_y") +
# geom_point(position = position_jitter(width = 0.3)) +
geom_point() +
geom_line(aes(group = factor(sim))) +
# geom_boxplot(alpha = 0.6) +
theme_bw() +
theme(legend.position = "none")
```
::: {.callout-tip}
## Take-away
With a fairly small, noisy dataset, there is minimal difference between the three approaches.
:::