Adjusting classification threshold by strata

Predictive modeling
Classification
Author

Tyler Grimes

Published

July 3, 2024

When predicting binary outcomes, if we want to threshold our estimated probabilities to obtain predicted classes, then we devise a strategy for determining the optimal threshold. This will almost always involve some trade-off between sensitivity and specificity, or between false positives and false negatives. However, if we know that the prevalence of the class varies across some known strata, should we choose different thresholds within each strata? How does that affect the final performance?

Code
library(tidyverse)

Let’s simulate data from three groups where the true prevalence ranges from 0.6 to 0.9 across the groups. Our dataset will be composed of samples from these three groups with unequal sizes.

Code
set.seed(0)
n = 300
x = rnorm(n)
groups = c("a", "b", "c")
z = sample(groups, n, replace = TRUE, prob = c(0.3, 0.2, 0.5))
z = factor(z, levels = groups)
beta.x = 1
beta.z = c("a" = 0, "b" = 1, "c" = 2)
epsilon = rnorm(n, 0, 0.2)
prob = 1 / (1 + exp(-(0.5 + beta.x * x + beta.z[z] + epsilon)))
y = rbinom(n, 1, prob)
df = data.frame(x = x, z = z, y = y, prob = prob)

table(z)
z
  a   b   c 
 87  61 152 
Code
tapply(y, z, mean)
        a         b         c 
0.6321839 0.7868852 0.8947368 
Code
mean(y)
[1] 0.7966667

We’ll fit a logistic regression model with a single predictor and the strata variable. Two strategies for thresholding will be used:

  1. Single threshold based on the overall prevalence
  2. Varying threshold based on the prevalence within each strata
Code
fit = glm("y ~ x + z", family = "binomial", data = df)
coef(fit)
(Intercept)           x          zb          zc 
  0.6803849   0.6825582   0.6736314   1.5837307 
Code
df$p = predict(fit, newdat = df, type = "response")

df = df %>%
  mutate(thr = mean(y)) %>%
  group_by(z) %>%
  mutate(thr.z = mean(y) - 0.2 * (mean(y) - thr)) %>%
  ungroup() %>%
  mutate(y.hat = 1 * (p > thr),
         y.hat.z = 1 * (p > thr.z))

thr.group = sapply(groups, function(group) df$thr.z[df$z == group][1])

Just looking at the overall accuracy, we see below that using the single threshold gives better performance (69.3% vs 67.7%). However, how does the performance vary across strata? Let’s take a look.

Here’s the overall and within-strata accuracy for the single threshold strategy:

Code
df %>%
  group_by(z) %>%
  summarize(n = n(), 
            acc = mean(y == y.hat)) %>%
  mutate(overall_acc = sum(n * acc) / sum(n))
# A tibble: 3 × 4
  z         n   acc overall_acc
  <fct> <int> <dbl>       <dbl>
1 a        87 0.460       0.693
2 b        61 0.607       0.693
3 c       152 0.862       0.693

and for the varying-threshold strategy:

Code
df %>%
  group_by(z) %>%
  summarize(n = n(), 
            acc = mean(y == y.hat.z)) %>%
  mutate(overall_acc = sum(n * acc) / sum(n)) 
# A tibble: 3 × 4
  z         n   acc overall_acc
  <fct> <int> <dbl>       <dbl>
1 a        87 0.632       0.677
2 b        61 0.607       0.677
3 c       152 0.730       0.677

We see that that performance varies widely with the single-threshold strategy. By using an adaptive threshold, the classifier is able to account for the varying prevalance across the groups.

Here’s another summary table showing more performance measures. We see the same thing, where the varying-threshold strategy provides more consistent performance results across strata.

Code
df %>%
  group_by(z) %>%
  summarize(n = n(),
            p      = round(mean(y), 2), 
            pp     = round(mean(y.hat), 2),
            pp.z   = round(mean(y.hat.z), 2),
            acc    = round(mean(y == y.hat), 2),
            sens   = round(mean(y[y==1] == y.hat[y==1]), 2),
            spec   = round(mean(y[y==0] == y.hat[y==0]), 2),
            PPV    = round(mean(y[y.hat==1] == y.hat[y.hat==1]), 2),
            NPV    = round(mean(y[y.hat==0] == y.hat[y.hat==0]), 2),
            acc.z  = round(mean(y == y.hat.z), 2),
            sens.z = round(mean(y[y==1] == y.hat.z[y==1]), 2),
            spec.z = round(mean(y[y==0] == y.hat.z[y==0]), 2),
            PPV.z  = round(mean(y[y.hat.z==1] == y.hat.z[y.hat.z==1]), 2),
            NPV.z  = round(mean(y[y.hat.z==0] == y.hat.z[y.hat.z==0]), 2)) %>%
  as.data.frame()
  z   n    p   pp pp.z  acc sens spec  PPV  NPV acc.z sens.z spec.z PPV.z NPV.z
1 a  87 0.63 0.14 0.40 0.46 0.18 0.94 0.83 0.40  0.63   0.53   0.81  0.83  0.50
2 b  61 0.79 0.46 0.46 0.61 0.54 0.85 0.93 0.33  0.61   0.54   0.85  0.93  0.33
3 c 152 0.89 0.94 0.72 0.86 0.95 0.12 0.90 0.22  0.73   0.75   0.56  0.94  0.21

Visualization of the thresholds and distributions of each predicted probabilities.

Code
df %>%
  ggplot(aes(x = p, fill = z, group = z)) +
  geom_histogram(position = position_identity(), alpha = 0.6, bins = 30) +
  geom_vline(xintercept = unique(df$thr), col = "black") +
  geom_vline(xintercept = thr.group, col = scales::hue_pal()(length(thr.group))) +
  theme_bw()

Bias in ML models

Imagine that the strata in this simulation were gender, race, or some other demographic variable. Although the model accounted for the strata variable, hence it correctly modeled the underlying prevalence, the additional step of classifying can introduce bias if we do not account for prevalence differences in the threshold used. Using a single threshold may lead to overall optimal performance, but we must look to see whether performance is consistent across important groups. If not, using a varying threshold could improve consistency while providing similar overall performance.

Back to top