Estimate Rat Heritability and Sparsity

Last updated: 2022-08-26

Checks: 6 1

Knit directory: RatXcan_Training/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20220711)

The command set.seed(20220711) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: ea180ba

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version ea180ba. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.Rhistory
    Ignored:    output/
    Ignored:    scripts/.Rhistory

Untracked files:
    Untracked:  .DS_Store
    Untracked:  .gitignore
    Untracked:  code/helpers.R

Unstaged changes:
    Modified:   analysis/.DS_Store
    Modified:   analysis/GEMMA_BSLMM.Rmd
    Modified:   analysis/Plot_Predictability_Heritability.Rmd
    Modified:   code/.DS_Store

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/GEMMA_BSLMM.Rmd) and HTML (docs/GEMMA_BSLMM.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	28edc16	sabrina-mi	2022-08-22	add link and change titles
html	28edc16	sabrina-mi	2022-08-22	add link and change titles
html	4f1a0e2	sabrina-mi	2022-08-19	forgot to add html
Rmd	362ebde	sabrina-mi	2022-08-19	add comments
Rmd	a04720a	sabrina-mi	2022-08-19	add links to website
html	a04720a	sabrina-mi	2022-08-19	add links to website
Rmd	2b4373a	sabrina-mi	2022-08-18	completed documentation but could use some reorganizing
html	2b4373a	sabrina-mi	2022-08-18	completed documentation but could use some reorganizing
Rmd	f65b5a1	sabrina-mi	2022-08-10	wflow_rename("analysis/Rat_Heritability.Rmd", "analysis/GEMMA_BSLMM.Rmd")

GEMMA generates heritability and predictability values for a given phenotype. In our analysis, we use gene expression values in place of a phenotype to estimate the heritability. It takes genotype, gene expression, and an estimated relatedness matrix as input.

library(tidyverse)
"%&%" = function(a,b) paste(a,b,sep="")

# rat tissues
tissues = c("Ac", "Il", "Lh", "Pl", "Vo")

# Path to directory with genotype, gene expression, and phyMap files
data.dir <- "/Users/sabrinami/Github/Rat_Genomics_Paper_Pipeline/data/"
# Create new directory for GEMMA inputs and outputs
base.dir <- "gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/"
# path to write bimbam files
bim.dir <- "gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/bimbam"

Format Genotype Data

The “bimbam” file we are writing does not exactly follow the BIMBAM mean genotype file format, but it contains all the necessary information to create subseted BIMBAM files for estimating the genetic relatedness matrix (GRM).

The function below expects a snp annotation, genotype, and gene expression table. Ours were loaded directly from Data-From-Abe-Palmer-Lab/Rdata/genoGex.Rdata.

The snp annotation table, phyMap, should be a dataframe SNP, Chr, Pos, Ref, Alt as columns.

            SNP Chr     Pos Ref Alt
1    chr1:55365   1   55365   A   T
8   chr1:759319   1  759319   T   C
9  chr1:1134030   1 1134030   A   G

The genotypes, geno, should be a matrix with individuals as column names.

     0007A0008B 0007A00024 0007A000DB 0007A001C5 0007A0059F 0007A00263
[1,]          2          2          2          2          2          2
[2,]          1          0          1          0          1          1
[3,]          2          2          2          1          2          2

The gene expression file should be a dataframe with ensemble IDs in the first column and individuals as the remaining column names.

       EnsemblGeneID 00077E67B5 00077E8336 00077EA7E6 000789FF7D 00078A0041
2 ENSRNOG00000000007     207.78     208.71     240.37     244.36     241.34
5 ENSRNOG00000000010       0.13       0.83       0.52       0.27       0.14
6 ENSRNOG00000000012       0.91       2.19       1.17       0.28       1.01

fn_write_bimbam = function(tis){
  bimfile <- bim.dir %&% tis %&% "_bimbam"

# allows us to call gene expression table for each tissue without hard coding
  gex = eval(as.name("gex" %&% tis))

  geno_tis = geno[,match(colnames(gex)[-1], colnames(geno))]
  snps <- paste(phyMap$Chr, phyMap$Pos, phyMap$Ref, phyMap$Alt, sep = "_")
  bimbam <- cbind(phyMap$Chr, phyMap$Pos, snps, phyMap$Ref, phyMap$Alt, geno_tis)
  write.table(bimbam, file = bimfile,quote=F,col.names=F,row.names=F)
}

for tis in tissues{
  fn_write_bimbam(tis)
}

Generate GRMs

For each gene, we subset the initial “bimbam” file to cis eQTL variants for the genotype input and use gene expression values for the phenotype input. The output is a GRM estimating relatedness between the rats. BSLMM takes the GRM as input to control for individaul relatedness in association tests between genetic variance and gene expression.

fn_run_gemma_grm = function(tis){
  #Read in bimbam file
  bimfile <- bim.dir %&% tis %&% "_bimbam" ###get SNP position information###

  pheno.dir <- base.dir %&% tis %&% "/phenotype_files/"
  ge.dir <- base.dir %&% tis %&% "/genotype_files/"
  grm.dir <- base.dir %&% tis %&% "/output/"

  bim <- read.table(bimfile)

  gex = eval(as.name("gex" %&% tis))
  ensidlist <- gex$EnsemblGeneID
  setwd(base.dir %&% tis)
  for(i in 1:length(ensidlist)){
    cat(i,"/",length(ensidlist),"\n")
    gene <- ensidlist[i]
    geneinfo <- gtf[match(gene, gtf$Gene),]
    chr <- geneinfo[1]
    c <- chr$Chr
    start <- geneinfo$Start - 1e6 ### 1Mb lower bound for cis-eQTLS
    end <- geneinfo$End + 1e6 ### 1Mb upper bound for cis-eQTLs
    chrsnps <- subset(bim, bim[,1]==c) ### pull snps on same chr
    cissnps <- subset(chrsnps,chrsnps[,2]>=start & chrsnps[,2]<=end) ### pull cis-SNP info
    snplist <- cissnps[,3:ncol(cissnps)]
    write.table(snplist, file= ge.dir %&% "tmp." %&% tis %&% ".geno" %&% gene, quote=F,col.names=F,row.names=F)
    geneexp <- cbind(as.numeric(gex[i,-1]))
    write.table(geneexp, file= pheno.dir %&% "tmp.pheno." %&% gene, col.names=F, row.names = F, quote=F) #output pheno for gemma
    runGEMMAgrm <- "gemma -g " %&%  ge.dir %&% "tmp." %&% tis %&% ".geno" %&% gene %&% " -p " %&% pheno.dir %&% "tmp.pheno." %&%  gene  %&%  " -gk -o /grm_" %&% tis %&% "_" %&% gene
    system(runGEMMAgrm)
  }
}

This function takes about an hour to run, so instead of looping, I ran it separately for each tissue.

fn_run_gemma_grm("Ac")
fn_run_gemma_grm("Il")
fn_run_gemma_grm("Lh")
fn_run_gemma_grm("Pl")
fn_run_gemma_grm("Vo")

GEMMA creates a new folder, output, in the working directory the GRMs. They should have file extension cXX.txt. When we run BSLMM, GEMMA writes to the same output folder, so we rename to separate the two.

DIR=/gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/Ac
mv $DIR/output $DIR/GRMs

Run BSLMM

Running BSLMM is more computationally expensive than estimating GRMs, but parallelizable if we run it in CRI. We use Alvaro’s Badger script to create a PBS batch job for each gene.

I put all the bash commands for submitting BSLMM jobs into a shell script:

cd gpfs/data/im-lab/nas40t2/Github/RatXcan_Training/code
chmod +x Ac_GEMMA_BSLMM.sh
chmod +x Il_GEMMA_BSLMM.sh
chmod +x Lh_GEMMA_BSLMM.sh
chmod +x Pl_GEMMA_BSLMM.sh
chmod +x Vo_GEMMA_BSLMM.sh

./Ac_GEMMA_BSLMM.sh

Because of the CRI job limit, I waited for all the jobs to clear before repeating for the next tissue.

The details of the scripts are documented below:

DIR=/gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/Ac
cd /gpfs/data/im-lab/nas40t2/Github/badger
/gpfs/data/im-lab/nas40t2/lab_software/miniconda/envs/ukb_env/bin/python src/badger.py \
-yaml_configuration_file $DIR/GEMMA_submission.yaml \
-parsimony 9

There should be a folder /gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/Ac/jobs that contains all the job scripts. CRI has a job limit of 10,000, but we have about 15,000 genes. We can work around it by repeating the chunk below.

cd $DIR
for job in jobs/gemma*; do
if  (($(qstat | wc -l) > 10000)); then
break
else
qsub $job
mv $job completed_jobs/
fi
done

Read Heritability and Sparsity Estimates

The BSLMM output is a table with columns: h, pve, rho, pge, pi, n_gamma. PVE is the proportion variance explained by causal variants and PGE is the proportion of genetic variance explained by sparse effects. We interpret them as estimates for heritability and sparsity, respectively.

For some bizarre reason, the output table contains an extra tab at the end of each row, so we need to trim it in order for R to be able to parse it. We use a simple shell script, RatXcan_Training/code/remove_trailing_tab.sh.

CODE=/gpfs/data/im-lab/nas40t2/Github/RatXcan_Training/code
cd $CODE
chmod +x remove_trailing_tab.sh
DIR=/gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/Ac
mkdir $DIR/hyp
for file in $DIR/output/*.hyp.txt; do
  ./remove_trailing_tab.sh $file
done

Now we should have a new directory, hyp, that contains all of the .hyp.txt files in the correct format.

base.dir <- "/gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/"

Each gene results in whole columns of PVE values, so we summarize the distribution in a new dataframe. (BSLMM employs the Metropolis-Hastings algorithm, a way to estimate the probability distributions of PVE and PGE given the observed data. Under a very, very loose interpretation, BSLMM simulates different values of the parameters, computes the probability they result in our observed data, and repeats until it settles on the most likely sampled value of the parameters. Each row in the .hyp.txt file represents an iteration of the M-H algorithm.)

library(stringr)
library(LearnBayes)

fn_PVE_estimates <- function(tis){
  files <- list.files(path = base.dir %&% tis %&% "/hyp", pattern = ".hyp.txt", full.names = TRUE)
  PVE_df <- as.data.frame(matrix(NA, 0, 4)) 
  
  for(i in 1:length(files)){
    gene <- str_extract(files[i], "ENSRNOG([\\d]+)")
    df <- read.delim(files[i])
  
    q1 <- quantile(df$pve, 0.1)
    q2 <- quantile(df$pve, 0.9)
    quantile1=list(p=.1 ,x=q1)
    quantile2=list(p=.9, x=q2)
    if(quantile1$x != quantile2$x) {
    prior <- beta.select(quantile1, quantile2)
    credible_set <- list(qbeta(0.025,prior[1],  prior[2]), qbeta(0.975,prior[1],  prior[2]))
  
    PVE_df[i, 1] <- gene
    PVE_df[i, 2] <- qbeta(0.5, prior[1],  prior[2])
    PVE_df[i, 3] <- credible_set[1]
    PVE_df[i, 4] <- credible_set[2]
    }
    else 
      i = i+1
  }

  write_tsv(PVE_df, base.dir %&% tis %&% "/PVE_estimates.txt", col_names = FALSE )
}

tissues = c("Ac", "Il", "Lh", "Pl", "Vo")
for tis in tissues{
  fn_PVE_estimates(tis)
}

And repeat for PGE.

fn_PVE_estimates <- function(tis){
  files <- list.files(path = base.dir %&% tis %&% "/hyp", pattern = ".hyp.txt", full.names = TRUE)
  PGE_df <- as.data.frame(matrix(NA, 0, 4))
  for(i in 1:length(files)){
    gene <- str_extract(files[i], "ENSRNOG([\\d]+)")
    df <- read.delim(files[i])
  
    q1 <- quantile(df$pge, 0.5)
    q2 <- quantile(df$pge, 0.9)
    quantile1=list(p=.5 ,x=q1)
    quantile2=list(p=.9, x=q2)
    if(quantile1$x != quantile2$x) {
    prior <- beta.select(quantile1, quantile2)
    credible_set <- list(qbeta(0.025,prior[1],  prior[2]), qbeta(0.975,prior[1],  prior[2]))
  
    PGE_df[i, 1] <- gene
    PGE_df[i, 2] <- qbeta(0.5, prior[1],  prior[2])
    PGE_df[i, 3] <- credible_set[1]
    PGE_df[i, 4] <- credible_set[2]
    }
    else 
      i = i+1
  }
  write_tsv(PGE_df, base.dir %&% tis %&% "/PGE_estimates.txt", col_names = FALSE )
}

for tis in tissues{
  fn_PGE_estimates(tis)
}

sessionInfo()

R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3     rstudioapi_0.11  whisker_0.4      knitr_1.39      
 [5] magrittr_1.5     workflowr_1.6.2  R6_2.4.1         rlang_1.0.2     
 [9] fastmap_1.1.0    stringr_1.4.0    tools_4.0.3      xfun_0.31       
[13] cli_3.3.0        git2r_0.27.1     jquerylib_0.1.4  htmltools_0.5.2 
[17] ellipsis_0.3.2   rprojroot_1.3-2  yaml_2.2.1       digest_0.6.27   
[21] tibble_3.0.4     lifecycle_0.2.0  crayon_1.3.4     later_1.1.0.1   
[25] sass_0.4.1       vctrs_0.4.1      fs_1.5.0         promises_1.1.1  
[29] glue_1.6.2       evaluate_0.15    rmarkdown_2.14   stringi_1.5.3   
[33] compiler_4.0.3   bslib_0.3.1      pillar_1.4.6     backports_1.1.10
[37] jsonlite_1.7.1   httpuv_1.5.4     pkgconfig_2.0.3

Estimate Rat Heritability and Sparsity

sabrina-mi

2022-08-09

Format Genotype Data

Generate GRMs

Run BSLMM

Read Heritability and Sparsity Estimates