2017/11/06

statistics / R

WE ARE LATE FOR HALLOWEEN, but let’s get out our broom and purrr as we tidy some statistical results.

Today I had occasion to be reminded of competing risks and a handy statistical result on competing risks from A.P. Basu and J.K. Ghosh published in the Journal of Multivariate Analysis in 1978. The paper Identifiability of the multinormal and other distributions under competing risks model showed an analytical result on the distribution of a variable Z which is the minimum of two Gaussian (Normal) random variables. Let’s simulate the results with R, building on the R snippet available here: Exact Distribution of the Max/Min of Two Gaussian Random Variables.

Draw some random variables

Let’s draw from a bivariate nomral distribution and construct the minimum of the two variables:

#####################################################################################
## Load libraries ##
#####################################################################################
library(MASS)
library(stats4)
library(tidyverse)
library(broom)
library(scales)
library(viridis)
library(forcats)

# Set seed
set.seed(11062017)
m1<-  0                  # mean of y1
m2<-  1                  # mean of y2
sigma1=1                 # sigma y1
sigma2=0.5               # sigma y2
r1=0                     # sigma_y1_y2
rho<-r1/(sigma1*sigma2)  # correlation
Sigma=matrix(c(sigma1^2,r1,r1,sigma2^2),2,2)
N=1000
y=mvrnorm(N,c(m1,m2),Sigma)
z=apply(y,1,FUN=min)
df<-data.frame(id=1:N,z=z,y1=y[,1],y2=y[,2])

# Function for the distribtuoin of Z = min( y1, y2)
fmin<-function(x,mu1,mu2,sigma1,sigma2,rho){
  f1<- dnorm(x,mu1,sigma1)*pnorm(rho*(x-mu1) / (sigma1*sqrt(1-rho*rho))-
                                   (x-mu2) / (sigma2*sqrt(1-rho*rho))) 
  f2<- dnorm(x,mu2,sigma2)*pnorm(rho*(x-mu2) / (sigma2*sqrt(1-rho*rho))-
                                   (x-mu1) / (sigma1*sqrt(1-rho*rho))) 
  return(f1+f2)
}


# plot a draw
# use gather so the plots can be overlaid
ggplot(data=df %>% gather(variable,value,-id), aes(x=value,color=variable,fill=variable))+
  geom_density(alpha=0.1)+
  stat_function(fun=fmin, args=list(m1,m2,sigma1,sigma2,rho),color="black",linetype=2,size=0.75)+
  theme_minimal()+
  scale_color_viridis(discrete=T,option="C",end=0.8)+
  scale_fill_viridis(discrete=T,option="C",end=0.8)+
    labs(title="Distribution of bivariate normal variables (y1,y2) and z=min(y1,y2)",
         subtitle="shaded areas kernel fit to draw of N=1,000,\nblack line theoretical distribution of z = min(y1,y2)")

We might be interested in seeing how well we can recover the means \(\mu_1\) and \(\mu_2\). We can do this via maximum likelihood.

#log likelihood function
ll<- function(params) {
  params <- as.list(params)
  R<-fmin(z,params$mu1,params$mu2,params$s1,params$s2,params$rho1)
  return(-sum(log(R)))
}

# maximum likelihood
out<-optim(par= c(mu1 = 0,mu2=1, s1=1, s2=0.5,rho1=0),
            fn=ll,
            hessian=T,
            control=list(fnscale=1) )

# combine estimated results with true parameter values
# stack up with broom::tidy
out2<-cbind(tidy(out),se.param=diag(solve(out$hessian)),true=c(m1,m2,sigma1,sigma2,rho))

knitr::kable(out2,row.names=FALSE)

parameter	value	se.param	true
mu1	0.1268490	0.0041521	0.0
mu2	0.6410181	10.1543614	1.0
s1	1.0829111	0.0029388	1.0
s2	0.4785279	0.0267281	0.5
rho1	0.3702692	28.9813605	0.0

Finally, we can use purrr to run some simulations for different sample sizes.

# function for returning a data frame of size N
draw_samp<-function(N,m1,m2,Sigma){
  y=mvrnorm(N,c(m1,m2),Sigma)
  z=apply(y,1,FUN=min)
  return(data.frame(id=1:N,z=z,y1=y[,1],y2=y[,2]))
}

# draw samples of 10, 100, 1,000 and 10,000
df0<-data.frame(N=c(10,100,1000,10000)) %>% 
  mutate(df=map(N,draw_samp,m1,m2,Sigma)) %>% unnest(df)

# make a plot
ggplot(data=df0 %>% gather(variable,value,-id,-N), aes(x=value,color=variable,fill=variable))+
  geom_density(alpha=0.1)+
  stat_function(fun=fmin, args=list(m1,m2,sigma1,sigma2,rho),color="black",linetype=2,size=0.75)+
  theme_minimal()+
  scale_color_viridis(discrete=T,option="C",end=0.8)+
  scale_fill_viridis(discrete=T,option="C",end=0.8)+
  facet_wrap(~forcats::fct_reorder(paste("N =",comma(N)),N))+
  labs(title="Distribution of bivariate normal variables (y1,y2) and z=min(y1,y2)",
       subtitle="shaded areas kernel fit to draw of N=1,000,\nblack line theoretical distribution of z = min(y1,y2)")

Ground-breaking? No, considering these results were published before I was born. But still, could be handy.

A note on competing risks

A small note on a statistical result. Some simulations and maximum likelihood estimation with R, Broom and Purrr.

Draw some random variables

Share!