flowchart TD z --> x z --> y
Consider three variables x, y, and z. The correlation between x and y is the linear association between them (a straight line in a scatter plot). But what if we think z is a confounder and want to condition on it? The partial correlation between x and y given z is a way to account for the value of z when looking at the linear association between x and y, but what does it look like? Furthermore, what does it mean if the partial correlation is zero?
Let’s simulate some data and take a look. In these data, x and y are conditionally independent given z.
Code
library(tidyverse)
library(ppcor)
Code
= 200
n = 0.05
sigma
set.seed(0)
= rnorm(n)
z = z + rnorm(n, sigma)
x = z + rnorm(n, sigma)
y = cut(x, breaks = 9)
x.groups = cut(z, breaks = 9)
z.groups levels(x.groups) = paste0("x in ", levels(x.groups))
levels(z.groups) = paste0("z in ", levels(z.groups))
= tibble(x, y, z, x.groups, z.groups) df
The correlation between x and y and partial correlation given z is computed below. We see the correlation is fairly strong but the partial correlation is near zero.
Code
cor(df$x, df$y)
[1] 0.5094733
Code
::pcor(df[, 1:3])$estimate[1, 2] ppcor
[1] 0.03041614
Code
# Note: can also calculate partial correlation from the pairwise correlations:
= sqrt(summary(lm(y ~ x))$r.sq) # or use cor(x, y).
r_yx = sqrt(summary(lm(y ~ z))$r.sq)
r_yz = sqrt(summary(lm(x ~ z))$r.sq)
r_xz = (r_yx - r_yz * r_xz) / sqrt((1 - r_yz^2) * (1 - r_xz^2))
r_yx.z r_yx.z
[1] 0.03041614
To see the partial correlation, we need to look at x and y within a small window of values for z. The second plot below shows 9 windows, sliding from smaller values to larger. This is how to think of conditional associations–and the sliding windows is how to think of “controlling” for the confounder.
Code
%>%
df ggplot(aes(x = x, y = y, color = z)) +
geom_point() +
stat_smooth(method = "lm") +
theme_bw()
`geom_smooth()` using formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation:
colour.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
Code
%>%
df ggplot(aes(x = x, y = y)) +
facet_wrap(. ~ z.groups, scales = "free") +
geom_point() +
stat_smooth(method = "lm") +
theme_bw()
`geom_smooth()` using formula = 'y ~ x'
Let’s take a look at the relationship between z and y. Note how the partial correlation given x still shows a strong association, and we see the linear association persist across windows of x.
Code
cor(df$z, df$y)
[1] 0.7199976
Code
::pcor(df[, 1:3])$estimate[1, 3] ppcor
[1] 0.5349701
Code
%>%
df ggplot(aes(x = z, y = y)) +
geom_point() +
stat_smooth(method = "lm") +
theme_bw()
`geom_smooth()` using formula = 'y ~ x'
Code
%>%
df ggplot(aes(x = z, y = y)) +
facet_wrap(. ~ x.groups, scales = "free") +
geom_point() +
stat_smooth(method = "lm") +
theme_bw()
`geom_smooth()` using formula = 'y ~ x'
Warning in qt((1 - level)/2, df): NaNs produced
Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf
Code
%>%
df group_by(x.groups) %>%
summarize(n = n(),
`corr_zy|x` = cor(z, y),
corr_zy = cor(df$z, df$y)) %>%
as.data.frame()
x.groups n corr_zy|x corr_zy
1 x in (-4.03,-3.21] 2 -1.0000000 0.7199976
2 x in (-3.21,-2.39] 5 0.6153158 0.7199976
3 x in (-2.39,-1.58] 14 0.2950542 0.7199976
4 x in (-1.58,-0.768] 38 0.6371105 0.7199976
5 x in (-0.768,0.0458] 39 0.5852865 0.7199976
6 x in (0.0458,0.859] 47 0.5224778 0.7199976
7 x in (0.859,1.67] 29 0.6070441 0.7199976
8 x in (1.67,2.49] 15 0.7324336 0.7199976
9 x in (2.49,3.31] 11 0.6928078 0.7199976
When does partial correlation fail?
In the above example, the correlation between z and y was positive and linear for every value of x. What if, instead, the association changed depending on x? An interaction effect. In this simulated dataset, x and y have a negative correlation if z < 0 and a positive correlation if z \geq 0.
Given the relationship described above, what would we expect the scatterplot of x and y to look like? What about the scatterplots across windows of z? What value do we expect the partial correlation of x and y given z to be?
Code
# July 02, 2024
set.seed(0)
= rnorm(n)
x = rnorm(n)
z = (-1)^(z < 0) * x + rnorm(n, sigma)
y = cut(x, breaks = 9)
x.groups = cut(z, breaks = 9)
z.groups levels(x.groups) = paste0("x in ", levels(x.groups))
levels(z.groups) = paste0("z in ", levels(z.groups))
= tibble(x, y, z, x.groups, z.groups) df
Code
cor(df$x, df$y)
[1] 0.09657089
Code
::pcor(df[, 1:3])$estimate[1, 3] ppcor
[1] 0.01930732
Code
%>%
df group_by(z.groups) %>%
summarize(n = n(),
`corr_xy|z` = cor(x, y),
corr_xy = cor(df$x, df$y)) %>%
as.data.frame()
z.groups n corr_xy|z corr_xy
1 z in (-2.91,-2.22] 1 NA 0.09657089
2 z in (-2.22,-1.53] 12 -0.9033113 0.09657089
3 z in (-1.53,-0.848] 29 -0.6202951 0.09657089
4 z in (-0.848,-0.162] 40 -0.7262352 0.09657089
5 z in (-0.162,0.524] 58 0.4046969 0.09657089
6 z in (0.524,1.21] 34 0.7398319 0.09657089
7 z in (1.21,1.9] 21 0.9325690 0.09657089
8 z in (1.9,2.58] 3 0.2854700 0.09657089
9 z in (2.58,3.27] 2 -1.0000000 0.09657089
Code
%>%
df ggplot(aes(x = z, y = y)) +
geom_point() +
stat_smooth(method = "lm") +
theme_bw()
`geom_smooth()` using formula = 'y ~ x'
Code
%>%
df ggplot(aes(x = z, y = y)) +
facet_wrap(. ~ x.groups, scales = "free") +
geom_point() +
stat_smooth(method = "lm") +
theme_bw()
`geom_smooth()` using formula = 'y ~ x'
Notice also the linear model suggests the same–no association between x and y
Code
summary(lm(y ~ x + z, data = df))
Call:
lm(formula = y ~ x + z, data = df)
Residuals:
Min 1Q Median 3Q Max
-3.8590 -1.0467 -0.0356 1.0217 3.9886
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.00504 0.10536 -0.048 0.962
x 0.15595 0.11433 1.364 0.174
z -0.01335 0.10574 -0.126 0.900
Residual standard error: 1.489 on 197 degrees of freedom
Multiple R-squared: 0.009406, Adjusted R-squared: -0.0006507
F-statistic: 0.9353 on 2 and 197 DF, p-value: 0.3942
However, if we model the interaction, then we see the association.
Code
summary(lm(y ~ x*z, data = df))
Call:
lm(formula = y ~ x * z, data = df)
Residuals:
Min 1Q Median 3Q Max
-3.6673 -0.6147 0.0309 0.7276 3.6902
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.02679 0.07817 -0.343 0.73217
x 0.23817 0.08506 2.800 0.00562 **
z 0.11263 0.07906 1.425 0.15588
x:z 1.10060 0.08647 12.728 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.104 on 196 degrees of freedom
Multiple R-squared: 0.4577, Adjusted R-squared: 0.4494
F-statistic: 55.14 on 3 and 196 DF, p-value: < 2.2e-16