Adjusting classification threshold by strata

When predicting binary outcomes, if we want to threshold our estimated probabilities to obtain predicted classes, then we devise a strategy for determining the optimal threshold. This will almost always involve some trade-off between sensitivity and specificity, or between false positives and false negatives. However, if we know that the prevalence of the class varies across some known strata, should we choose different thresholds within each strata? How does that affect the final performance?

Code

library(tidyverse)

Let’s simulate data from three groups where the true prevalence ranges from 0.6 to 0.9 across the groups. Our dataset will be composed of samples from these three groups with unequal sizes.

Code

set.seed(0)
n = 300
x = rnorm(n)
groups = c("a", "b", "c")
z = sample(groups, n, replace = TRUE, prob = c(0.3, 0.2, 0.5))
z = factor(z, levels = groups)
beta.x = 1
beta.z = c("a" = 0, "b" = 1, "c" = 2)
epsilon = rnorm(n, 0, 0.2)
prob = 1 / (1 + exp(-(0.5 + beta.x * x + beta.z[z] + epsilon)))
y = rbinom(n, 1, prob)
df = data.frame(x = x, z = z, y = y, prob = prob)

table(z)

z
  a   b   c 
 87  61 152

Code

tapply(y, z, mean)

        a         b         c 
0.6321839 0.7868852 0.8947368

Code

mean(y)

[1] 0.7966667

We’ll fit a logistic regression model with a single predictor and the strata variable. Two strategies for thresholding will be used:

Single threshold based on the overall prevalence
Varying threshold based on the prevalence within each strata

Code

fit = glm("y ~ x + z", family = "binomial", data = df)
coef(fit)

(Intercept)           x          zb          zc 
  0.6803849   0.6825582   0.6736314   1.5837307

Code

df$p = predict(fit, newdat = df, type = "response")

df = df %>%
  mutate(thr = mean(y)) %>%
  group_by(z) %>%
  mutate(thr.z = mean(y) - 0.2 * (mean(y) - thr)) %>%
  ungroup() %>%
  mutate(y.hat = 1 * (p > thr),
         y.hat.z = 1 * (p > thr.z))

thr.group = sapply(groups, function(group) df$thr.z[df$z == group][1])

Just looking at the overall accuracy, we see below that using the single threshold gives better performance (69.3% vs 67.7%). However, how does the performance vary across strata? Let’s take a look.

Here’s the overall and within-strata accuracy for the single threshold strategy:

Code

df %>%
  group_by(z) %>%
  summarize(n = n(), 
            acc = mean(y == y.hat)) %>%
  mutate(overall_acc = sum(n * acc) / sum(n))

# A tibble: 3 × 4
  z         n   acc overall_acc
  <fct> <int> <dbl>       <dbl>
1 a        87 0.460       0.693
2 b        61 0.607       0.693
3 c       152 0.862       0.693

and for the varying-threshold strategy:

Code

df %>%
  group_by(z) %>%
  summarize(n = n(), 
            acc = mean(y == y.hat.z)) %>%
  mutate(overall_acc = sum(n * acc) / sum(n))

# A tibble: 3 × 4
  z         n   acc overall_acc
  <fct> <int> <dbl>       <dbl>
1 a        87 0.632       0.677
2 b        61 0.607       0.677
3 c       152 0.730       0.677

We see that that performance varies widely with the single-threshold strategy. By using an adaptive threshold, the classifier is able to account for the varying prevalance across the groups.

Here’s another summary table showing more performance measures. We see the same thing, where the varying-threshold strategy provides more consistent performance results across strata.

Code

df %>%
  group_by(z) %>%
  summarize(n = n(),
            p      = round(mean(y), 2), 
            pp     = round(mean(y.hat), 2),
            pp.z   = round(mean(y.hat.z), 2),
            acc    = round(mean(y == y.hat), 2),
            sens   = round(mean(y[y==1] == y.hat[y==1]), 2),
            spec   = round(mean(y[y==0] == y.hat[y==0]), 2),
            PPV    = round(mean(y[y.hat==1] == y.hat[y.hat==1]), 2),
            NPV    = round(mean(y[y.hat==0] == y.hat[y.hat==0]), 2),
            acc.z  = round(mean(y == y.hat.z), 2),
            sens.z = round(mean(y[y==1] == y.hat.z[y==1]), 2),
            spec.z = round(mean(y[y==0] == y.hat.z[y==0]), 2),
            PPV.z  = round(mean(y[y.hat.z==1] == y.hat.z[y.hat.z==1]), 2),
            NPV.z  = round(mean(y[y.hat.z==0] == y.hat.z[y.hat.z==0]), 2)) %>%
  as.data.frame()

  z   n    p   pp pp.z  acc sens spec  PPV  NPV acc.z sens.z spec.z PPV.z NPV.z
1 a  87 0.63 0.14 0.40 0.46 0.18 0.94 0.83 0.40  0.63   0.53   0.81  0.83  0.50
2 b  61 0.79 0.46 0.46 0.61 0.54 0.85 0.93 0.33  0.61   0.54   0.85  0.93  0.33
3 c 152 0.89 0.94 0.72 0.86 0.95 0.12 0.90 0.22  0.73   0.75   0.56  0.94  0.21

Visualization of the thresholds and distributions of each predicted probabilities.

Code

df %>%
  ggplot(aes(x = p, fill = z, group = z)) +
  geom_histogram(position = position_identity(), alpha = 0.6, bins = 30) +
  geom_vline(xintercept = unique(df$thr), col = "black") +
  geom_vline(xintercept = thr.group, col = scales::hue_pal()(length(thr.group))) +
  theme_bw()

Bias in ML models

Imagine that the strata in this simulation were gender, race, or some other demographic variable. Although the model accounted for the strata variable, hence it correctly modeled the underlying prevalence, the additional step of classifying can introduce bias if we do not account for prevalence differences in the threshold used. Using a single threshold may lead to overall optimal performance, but we must look to see whether performance is consistent across important groups. If not, using a varying threshold could improve consistency while providing similar overall performance.

--- title: "Adjusting classification threshold by strata" author: Tyler Grimes date: '2024-07-03' categories: - Predictive modeling - Classification tags: [] format: html: default --- When predicting binary outcomes, if we want to threshold our estimated probabilities to obtain predicted classes, then we devise a strategy for determining the optimal threshold. This will almost always involve some trade-off between sensitivity and specificity, or between false positives and false negatives. However, if we know that the prevalence of the class varies across some known strata, should we choose different thresholds within each strata? How does that affect the final performance? ```{r message = FALSE} library(tidyverse) ``` Let's simulate data from three groups where the true prevalence ranges from 0.6 to 0.9 across the groups. Our dataset will be composed of samples from these three groups with unequal sizes. ```{r} set.seed(0) n = 300 x = rnorm(n) groups = c("a", "b", "c") z = sample(groups, n, replace = TRUE, prob = c(0.3, 0.2, 0.5)) z = factor(z, levels = groups) beta.x = 1 beta.z = c("a" = 0, "b" = 1, "c" = 2) epsilon = rnorm(n, 0, 0.2) prob = 1 / (1 + exp(-(0.5 + beta.x * x + beta.z[z] + epsilon))) y = rbinom(n, 1, prob) df = data.frame(x = x, z = z, y = y, prob = prob) table(z) tapply(y, z, mean) mean(y) ``` We'll fit a logistic regression model with a single predictor and the strata variable. Two strategies for thresholding will be used: 1. Single threshold based on the overall prevalence 2. Varying threshold based on the prevalence within each strata ```{r} fit = glm("y ~ x + z", family = "binomial", data = df) coef(fit) df$p = predict(fit, newdat = df, type = "response") df = df %>% mutate(thr = mean(y)) %>% group_by(z) %>% mutate(thr.z = mean(y) - 0.2 * (mean(y) - thr)) %>% ungroup() %>% mutate(y.hat = 1 * (p > thr), y.hat.z = 1 * (p > thr.z)) thr.group = sapply(groups, function(group) df$thr.z[df$z == group][1]) ``` Just looking at the overall accuracy, we see below that using the single threshold gives better performance (69.3% vs 67.7%). However, how does the performance vary across strata? Let's take a look. Here's the overall and within-strata accuracy for the single threshold strategy: ```{r} #| tbl-cap: Accuracy with the single-threshold strategy df %>% group_by(z) %>% summarize(n = n(), acc = mean(y == y.hat)) %>% mutate(overall_acc = sum(n * acc) / sum(n)) ``` and for the varying-threshold strategy: ```{r} #| tbl-cap: Accuracy with the varying-threshold strategy df %>% group_by(z) %>% summarize(n = n(), acc = mean(y == y.hat.z)) %>% mutate(overall_acc = sum(n * acc) / sum(n)) ``` We see that that performance varies widely with the single-threshold strategy. By using an adaptive threshold, the classifier is able to account for the varying prevalance across the groups. Here's another summary table showing more performance measures. We see the same thing, where the varying-threshold strategy provides more consistent performance results across strata. ```{r} df %>% group_by(z) %>% summarize(n = n(), p = round(mean(y), 2), pp = round(mean(y.hat), 2), pp.z = round(mean(y.hat.z), 2), acc = round(mean(y == y.hat), 2), sens = round(mean(y[y==1] == y.hat[y==1]), 2), spec = round(mean(y[y==0] == y.hat[y==0]), 2), PPV = round(mean(y[y.hat==1] == y.hat[y.hat==1]), 2), NPV = round(mean(y[y.hat==0] == y.hat[y.hat==0]), 2), acc.z = round(mean(y == y.hat.z), 2), sens.z = round(mean(y[y==1] == y.hat.z[y==1]), 2), spec.z = round(mean(y[y==0] == y.hat.z[y==0]), 2), PPV.z = round(mean(y[y.hat.z==1] == y.hat.z[y.hat.z==1]), 2), NPV.z = round(mean(y[y.hat.z==0] == y.hat.z[y.hat.z==0]), 2)) %>% as.data.frame() ``` Visualization of the thresholds and distributions of each predicted probabilities. ```{r} df %>% ggplot(aes(x = p, fill = z, group = z)) + geom_histogram(position = position_identity(), alpha = 0.6, bins = 30) + geom_vline(xintercept = unique(df$thr), col = "black") + geom_vline(xintercept = thr.group, col = scales::hue_pal()(length(thr.group))) + theme_bw() ``` ## Bias in ML models Imagine that the strata in this simulation were gender, race, or some other demographic variable. Although the model accounted for the strata variable, hence it correctly modeled the underlying prevalence, the additional step of **classifying** can introduce bias if we do not account for prevalence differences in the threshold used. Using a single threshold may lead to overall optimal performance, but we must look to see whether performance is consistent across important groups. If not, using a varying threshold could improve consistency while providing similar overall performance.