##Code used for manuscript "Survival bias in Mendelian randomization studies: a threat to causal inference" ##Roelof Smit and Saskia le Cessie ##R version 3.4.1 ##loading in relevant libraries library(sem) library(psychometric) ############################################################################################################## ## Simulations which generated the results presented in main text, Figure 2, and eFigures 2-3 ## This code calculates i) the explained variance (Rsq) of X by the instrument G and the prevalence of G, as function of the age of selection ## and ii) performs internal/external estimation of Wald ratio for a range of age of selection. ## Please note that exposure R is equal to the variable u2 throughout this code. ## In these simulations R is a binary exposure variable, with prevalence 0.25 ################################################################################################################## # In function sim1, the data are generated # Function parameters of sim1: #number of observations (n), contribution of G to X (b), of X to Y (c), # of R(u2) to Y (d), of X to age of death (e), of R(u2) to age of death (f) sim1<-function(n,b,c,d,e,f,agerange) { set.seed(345) #generate instrument g g<- c(rep(0,n/2), rep(1,n/2)) #generate second exposure u2 u2 <- rep(c(0,0,0,1),n/4) #generate continuous exposure of interest x x <- rnorm(n,0 + b*(g-0.5),1) # generate outcome y y <- rnorm(n, c*x + d*(u2-0.25), 1) #generate survival time based on Gompertz model gomp.a<-0.0000459053 gomp.b<-0.0876978320 U<- runif(n) RR <- exp(log(e)*x+log(f)*u2) survtime <- log(1-(gomp.b*log(U)/(gomp.a*RR)))/gomp.b #create dataset for use in sim2 data<- data.frame(x,g,u2,y,survtime) sapply(agerange, sim2, data) } # The function sim2 uses the data generated through function sim1 # Function parameters of sim2: age_sel is age at study inclusion sim2<-function(agesel, data) { s <- (data$survtime > agesel) #survival status lm.xg<-lm(x~g, data=data) lm.xg.s<-lm(x~g, data=data[s==1,]) n2 <- nrow(data[s==1,]) #number of observations xg <- coef(lm.xg)[2] xg.s <- coef(lm.xg.s)[2] lm.yg<-lm(y~g, data=data) lm.yg.s<-lm(y~g, data=data[s==1,]) yg<-coef(lm.yg)[2] yg.s<-coef(lm.yg.s)[2] m<-summary(lm.yg.s) yg.s_se<-m$coefficients[2,2] iv<-yg/xg #iv-estimator for whole (i.e. unselected) sample iv.internal<-yg.s/xg.s #iv-estimator when both the numerator and denominator are taken from same selected dataset iv.external=yg.s/xg #iv-estimator when only numerator is taken from selected dataset model_iv1<- tsls(y ~ x, instruments=~ g, data=data[s==1,]) m2<-summary(model_iv1) wald_b<-m2$coefficient[2,1] wald_se<-m2$coefficients[2,2] corgx<-cor(data$x, data$g) corgx.s <- cor(data$x[s==1], data$g[s==1]) r_sq_CI95<-CI.Rsq(corgx.s^2, n2, 1, level = .95) r_sq<-corgx^2 r_sq.s<-corgx.s^2 #explained variance in selected population r_sq_low<-r_sq_CI95[,3] r_sq_up<-r_sq_CI95[,4] prev_g<-mean(data$g) prev_g.s<-mean(data$g[s==1]) #prevalance G in selected population # tmp <- c(agesel, n2, xg, xg.s, yg, yg.s, yg.s_se, iv, iv.internal, iv.external, wald_b, wald_se, r_sq, r_sq.s, r_sq_low, r_sq_up, prev_g, prev_g.s) names(tmp) <- c("age_sel", "n2", "xg", "xg.s", "yg", "yg.s", "yg.s_se", "iv", "iv.internal", "iv.external", "wald_b", "wald_se", "r_sq", "r_sq.s", "r_sq_low", "r_sq_up", "prev_g", "prev_g.s") tmp } # Calls which generate data presented in main text, Figure 2, and eFigures 2-3 results1 <- sim1(n=10000000, b=0.458831468, c=0, d=0.5, e=1.25, f=1.5, agerange = seq(75, 95, 0.1)) results2 <- sim1(n=10000000, b=0.458831468, c=0.5, d=0.5, e=1.25, f=1.5, agerange = seq(75, 95, 0.1)) results3 <- sim1(n=10000000, b=0.458831468, c=2, d=0.5, e=1.25, f=1.5, agerange = seq(75, 95, 0.1)) ############################################################################################################## ## Additional simulations which generated the results presented in eFigures 2-3 ## This code performs internal/external estimation of Wald ratio for a range of age of selection. ## In this part R is normally distributed ################################################################################################################## sim1<-function(n,b,c,d,e,f,agerange) { set.seed(345) g<- c(rep(0,n/2), rep(1,n/2)) u2 <- rnorm(n,0,1) #note difference with previous sim1 x <- rnorm(n,0 + b*(g-0.5),1) y <- rnorm(n, c*x + d*u2, 1) #note difference with previous sim1 gomp.a<-0.0000459053 gomp.b<-0.0876978320 U<- runif(n) RR <- exp(log(e)*x+log(f)*u2) survtime <- log(1-(gomp.b*log(U)/(gomp.a*RR)))/gomp.b data<- data.frame(x,g,u2,y,survtime) sapply(agerange, sim2, data) } sim2<-function(agesel, data) { s <- (data$survtime > agesel) lm.xg<-lm(x~g, data=data) lm.xg.s<-lm(x~g, data=data[s==1,]) n2 <- nrow(data[s==1,]) xg <- coef(lm.xg)[2] xg.s <- coef(lm.xg.s)[2] lm.yg<-lm(y~g, data=data) lm.yg.s<-lm(y~g, data=data[s==1,]) yg<-coef(lm.yg)[2] yg.s<-coef(lm.yg.s)[2] m<-summary(lm.yg.s) yg.s_se<-m$coefficients[2,2] iv<-yg/xg iv.internal<-yg.s/xg.s iv.external=yg.s/xg model_iv1<- tsls(y ~ x, instruments=~ g, data=data[s==1,]) m2<-summary(model_iv1) wald_b<-m2$coefficient[2,1] wald_se<-m2$coefficients[2,2] # tmp <- c(agesel, n2, xg, xg.s, yg, yg.s, yg.s_se, iv, iv.internal, iv.external, wald_b, wald_se) names(tmp) <- c("age_sel", "n2", "xg", "xg.s", "yg", "yg.s", "yg.s_se", "iv", "iv.internal", "iv.external", "wald_b", "wald_se") tmp } results4 <- sim1(n=10000000, b=0.458831468, c=0, d=0.5, e=1.25, f=1.5, agerange = seq(75, 95, 0.1)) results5 <- sim1(n=10000000, b=0.458831468, c=0.5, d=0.5, e=1.25, f=1.5, agerange = seq(75, 95, 0.1)) results6 <- sim1(n=10000000, b=0.458831468, c=2, d=0.5, e=1.25, f=1.5, agerange = seq(75, 95, 0.1)) ############################################################################################################## ## Additional simulations which generated the results presented in eFigure 4 ## This code performs internal/external estimation of Wald ratio for a range of age of selection. ## In this part we simulated data from the DAG in Figure 2B. Here instead of a second exposure, ## there is a confounder U which both affects X and Y ################################################################################################################## # Note introduction of confounder U with the same effect (h) on X and Y, and absence of second exposure R(u2) sim1<-function(n,b,c,e,h,agerange) { set.seed(345) g<- c(rep(0,n/2), rep(1,n/2)) u<-rnorm(n,0,1) x <- rnorm(n,0 + b*(g-0.5) + h*u, sqrt(1-h**2)) y <- rnorm(n, c*x + h*u, sqrt(1-h**2)) gomp.a<-0.0000459053 gomp.b<-0.0876978320 U<- runif(n) RR <- exp(log(e)*x) survtime <- log(1-(gomp.b*log(U)/(gomp.a*RR)))/gomp.b data<- data.frame(x,g,u,y,survtime) sapply(agerange, sim2, data) } sim2<-function(agesel, data) { s <- (data$survtime > agesel) lm.xg<-lm(x~g, data=data) lm.xg.s<-lm(x~g, data=data[s==1,]) n2 <- nrow(data[s==1,]) xg <- coef(lm.xg)[2] xg.s <- coef(lm.xg.s)[2] lm.yg<-lm(y~g, data=data) lm.yg.s<-lm(y~g, data=data[s==1,]) yg<-coef(lm.yg)[2] yg.s<-coef(lm.yg.s)[2] m<-summary(lm.yg.s) yg.s_se<-m$coefficients[2,2] iv<-yg/xg iv.internal<-yg.s/xg.s iv.external=yg.s/xg model_iv1<- tsls(y ~ x, instruments=~ g, data=data[s==1,]) m2<-summary(model_iv1) wald_b<-m2$coefficient[2,1] wald_se<-m2$coefficients[2,2] # tmp <- c(agesel, n2, xg, xg.s, yg, yg.s, yg.s_se, iv, iv.internal, iv.external, wald_b, wald_se) names(tmp) <- c("age_sel", "n2", "xg", "xg.s", "yg", "yg.s", "yg.s_se", "iv", "iv.internal", "iv.external", "wald_b", "wald_se") tmp } results7 <- sim1(n=10000000, b=0.458831468, c=0, e=1.25, h=0.5, agerange = seq(75, 95, 0.1)) results8 <- sim1(n=10000000, b=0.458831468, c=0.5, e=1.25, h=0.5, agerange = seq(75, 95, 0.1)) results9 <- sim1(n=10000000, b=0.458831468, c=2, e=1.25, h=0.5, agerange = seq(75, 95, 0.1)) ############################################################################################################## ## Additional simulations which generated the results presented in eFigures 5 and 6 ## This code performs internal/external estimation of Wald ratio for a range of age of selection. ## In this part we simulated data from the combined DAGs in Figure 2B 2B. Both a second exposure R(u2) and ## a continuous confounder U which affects both X and Y were generated ################################################################################################################## # Binary R sim1<-function(n,b,c,d,e,f,h,agerange) { set.seed(345) g<- c(rep(0,n/2), rep(1,n/2)) u2 <- rep(c(0,0,0,1),n/4) u<-rnorm(n,0,1) x <- rnorm(n,0 + b*(g-0.5) + h*u, sqrt(1-h**2)) y <- rnorm(n, c*x + d*(u2-0.25) + h*u, sqrt(1-h**2)) gomp.a<-0.0000459053 gomp.b<-0.0876978320 U<- runif(n) RR <- exp(log(e)*x+log(f)*u2) survtime <- log(1-(gomp.b*log(U)/(gomp.a*RR)))/gomp.b data<- data.frame(x,g,u2,y,survtime) sapply(agerange, sim2, data) } sim2<-function(agesel, data) { s <- (data$survtime > agesel) lm.xg<-lm(x~g, data=data) lm.xg.s<-lm(x~g, data=data[s==1,]) n2 <- nrow(data[s==1,]) xg <- coef(lm.xg)[2] xg.s <- coef(lm.xg.s)[2] lm.yg<-lm(y~g, data=data) lm.yg.s<-lm(y~g, data=data[s==1,]) yg<-coef(lm.yg)[2] yg.s<-coef(lm.yg.s)[2] m<-summary(lm.yg.s) yg.s_se<-m$coefficients[2,2] iv<-yg/xg iv.internal<-yg.s/xg.s iv.external=yg.s/xg model_iv1<- tsls(y ~ x, instruments=~ g, data=data[s==1,]) m2<-summary(model_iv1) wald_b<-m2$coefficient[2,1] wald_se<-m2$coefficients[2,2] # tmp <- c(agesel, n2, xg, xg.s, yg, yg.s, yg.s_se, iv, iv.internal, iv.external, wald_b, wald_se) names(tmp) <- c("age_sel", "n2", "xg", "xg.s", "yg", "yg.s", "yg.s_se", "iv", "iv.internal", "iv.external", "wald_b", "wald_se") tmp } results10 <- sim1(n=10000000, b=0.458831468, c=0, d=0.5, e=1.25, f=1.5, h=0.5, agerange = seq(75, 95, 0.1)) results11 <- sim1(n=10000000, b=0.458831468, c=0.5, d=0.5, e=1.25, f=1.5, h=0.5, agerange = seq(75, 95, 0.1)) results12 <- sim1(n=10000000, b=0.458831468, c=2, d=0.5, e=1.25, f=1.5, h=0.5, agerange = seq(75, 95, 0.1)) # Continuous R sim1<-function(n,b,c,d,e,f,h,agerange) { set.seed(345) g<- c(rep(0,n/2), rep(1,n/2)) u2 <- rnorm(n,0,1) u<-rnorm(n,0,1) x <- rnorm(n,0 + b*(g-0.5) + h*u, sqrt(1-h**2)) y <- rnorm(n, c*x + d*u2 + h*u, sqrt(1-h**2)) gomp.a<-0.0000459053 gomp.b<-0.0876978320 U<- runif(n) RR <- exp(log(e)*x+log(f)*u2) survtime <- log(1-(gomp.b*log(U)/(gomp.a*RR)))/gomp.b data<- data.frame(x,g,u2,y,survtime) sapply(agerange, sim2, data) } sim2<-function(agesel, data) { s <- (data$survtime > agesel) lm.xg<-lm(x~g, data=data) lm.xg.s<-lm(x~g, data=data[s==1,]) n2 <- nrow(data[s==1,]) xg <- coef(lm.xg)[2] xg.s <- coef(lm.xg.s)[2] lm.yg<-lm(y~g, data=data) lm.yg.s<-lm(y~g, data=data[s==1,]) yg<-coef(lm.yg)[2] yg.s<-coef(lm.yg.s)[2] m<-summary(lm.yg.s) yg.s_se<-m$coefficients[2,2] iv<-yg/xg iv.internal<-yg.s/xg.s iv.external=yg.s/xg model_iv1<- tsls(y ~ x, instruments=~ g, data=data[s==1,]) m2<-summary(model_iv1) wald_b<-m2$coefficient[2,1] wald_se<-m2$coefficients[2,2] # tmp <- c(agesel, n2, xg, xg.s, yg, yg.s, yg.s_se, iv, iv.internal, iv.external, wald_b, wald_se) names(tmp) <- c("age_sel", "n2", "xg", "xg.s", "yg", "yg.s", "yg.s_se", "iv", "iv.internal", "iv.external", "wald_b", "wald_se") tmp } results13 <- sim1(n=10000000, b=0.458831468, c=0, d=0.5, e=1.25, f=1.5, h=0.5, agerange = seq(75, 95, 0.1)) results14 <- sim1(n=10000000, b=0.458831468, c=0.5, d=0.5, e=1.25, f=1.5, h=0.5, agerange = seq(75, 95, 0.1)) results15 <- sim1(n=10000000, b=0.458831468, c=2, d=0.5, e=1.25, f=1.5, h=0.5, agerange = seq(75, 95, 0.1)) ############################################################################################################## ## Additional simulations which generated the results presented in eFigure 7 ## This code performs internal/external estimation of Wald ratio for a range of age of selection. ## In this part we simulated interaction between X and R on survival time ################################################################################################################## # Code which generated the data with interaction term (inter) between X and R on survival time sim.inter<-function(n,b,c,d,e,f,inter,agerange) { set.seed(345) g<- c(rep(0,n/2), rep(1,n/2)) u2 <- rep(c(0,0,0,1),n/4) x <- rnorm(n,0 + b*(g-0.5), 1) y <- rnorm(n, c*x + d*(u2-0.25), 1) gomp.a<-0.0000459053 gomp.b<-0.0876978320 U<- runif(n) RR <- exp(log(e)*x+log(f)*u2 + inter*x*u2) survtime <- log(1-(gomp.b*log(U)/(gomp.a*RR)))/gomp.b data<- data.frame(x,g,u2,y,survtime) sapply(agerange, sim2, data) } sim2<-function(age_sel, data) { s <- (data$survtime > age_sel) lm.xg<-lm(x~g, data=data) lm.xg.s<-lm(x~g, data=data[s==1,]) n2 <- nrow(data[s==1,]) xg <- coef(lm.xg)[2] xg.s <- coef(lm.xg.s)[2] lm.yg<-lm(y~g, data=data) lm.yg.s<-lm(y~g, data=data[s==1,]) yg<-coef(lm.yg)[2] yg.s<-coef(lm.yg.s)[2] m<-summary(lm.yg.s) yg.s_se<-m$coefficients[2,2] iv<-yg/xg iv.internal<-yg.s/xg.s iv.external=yg.s/xg model_iv1<- tsls(y ~ x, instruments=~ g, data=data[s==1,]) m2<-summary(model_iv1) wald_b<-m2$coefficient[2,1] wald_se<-m2$coefficients[2,2] # tmp <- c(age_sel, n2, xg, xg.s, yg, yg.s, yg.s_se, iv, iv.internal, iv.external, wald_b, wald_se) names(tmp) <- c("age_sel", "n2", "xg", "xg.s", "yg", "yg.s", "yg.s_se", "iv", "iv.internal", "iv.external", "wald_b", "wald_se") tmp } # Calls for true effect X on Y=0 results16<- sim.inter(n=10000000, b=0.458831468, c=0, d=0.5, e=1.25, f=1.5, inter=log(2), agerange = seq(75, 95, 0.1)) results17<- sim.inter(n=10000000, b=0.458831468, c=0, d=0.5, e=1.25, f=1.5, inter=log(1.5), agerange = seq(75, 95, 0.1)) results18<- sim.inter(n=10000000, b=0.458831468, c=0, d=0.5, e=1.25, f=1.5, inter=0, agerange = seq(75, 95, 0.1)) results19<- sim.inter(n=10000000, b=0.458831468, c=0, d=0.5, e=1.25, f=1.5, inter=-log(1.5), agerange = seq(75, 95, 0.1)) results20<- sim.inter(n=10000000, b=0.458831468, c=0, d=0.5, e=1.25, f=1.5, inter=-log(2), agerange = seq(75, 95, 0.1)) # Calls for true effect X on Y=2 results21<- sim.inter(n=10000000, b=0.458831468, c=2, d=0.5, e=1.25, f=1.5, inter=log(2), agerange = seq(75, 95, 0.1)) results22<- sim.inter(n=10000000, b=0.458831468, c=2, d=0.5, e=1.25, f=1.5, inter=log(1.5), agerange = seq(75, 95, 0.1)) results23<- sim.inter(n=10000000, b=0.458831468, c=2, d=0.5, e=1.25, f=1.5, inter=0, agerange = seq(75, 95, 0.1)) results24<- sim.inter(n=10000000, b=0.458831468, c=2, d=0.5, e=1.25, f=1.5, inter=-log(1.5), agerange = seq(75, 95, 0.1)) results25<- sim.inter(n=10000000, b=0.458831468, c=2, d=0.5, e=1.25, f=1.5, inter=-log(2), agerange = seq(75, 95, 0.1))