Last updated: 2022-08-26
Checks: 6 1
Knit directory: RatXcan_Training/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown file has unstaged changes. To know which version of
the R Markdown file created these results, you’ll want to first commit
it to the Git repo. If you’re still working on the analysis, you can
ignore this warning. When you’re finished, you can run
wflow_publish
to commit the R Markdown file and build the
HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20220711)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version ea180ba. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: analysis/.Rhistory
Ignored: output/
Ignored: scripts/.Rhistory
Untracked files:
Untracked: .DS_Store
Untracked: .gitignore
Untracked: code/helpers.R
Unstaged changes:
Modified: analysis/.DS_Store
Modified: analysis/GEMMA_BSLMM.Rmd
Modified: analysis/Plot_Predictability_Heritability.Rmd
Modified: code/.DS_Store
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/GEMMA_BSLMM.Rmd
) and HTML
(docs/GEMMA_BSLMM.html
) files. If you’ve configured a
remote Git repository (see ?wflow_git_remote
), click on the
hyperlinks in the table below to view the files as they were in that
past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 28edc16 | sabrina-mi | 2022-08-22 | add link and change titles |
html | 28edc16 | sabrina-mi | 2022-08-22 | add link and change titles |
html | 4f1a0e2 | sabrina-mi | 2022-08-19 | forgot to add html |
Rmd | 362ebde | sabrina-mi | 2022-08-19 | add comments |
Rmd | a04720a | sabrina-mi | 2022-08-19 | add links to website |
html | a04720a | sabrina-mi | 2022-08-19 | add links to website |
Rmd | 2b4373a | sabrina-mi | 2022-08-18 | completed documentation but could use some reorganizing |
html | 2b4373a | sabrina-mi | 2022-08-18 | completed documentation but could use some reorganizing |
Rmd | f65b5a1 | sabrina-mi | 2022-08-10 | wflow_rename("analysis/Rat_Heritability.Rmd", "analysis/GEMMA_BSLMM.Rmd") |
GEMMA generates heritability and predictability values for a given phenotype. In our analysis, we use gene expression values in place of a phenotype to estimate the heritability. It takes genotype, gene expression, and an estimated relatedness matrix as input.
library(tidyverse)
"%&%" = function(a,b) paste(a,b,sep="")
# rat tissues
tissues = c("Ac", "Il", "Lh", "Pl", "Vo")
# Path to directory with genotype, gene expression, and phyMap files
data.dir <- "/Users/sabrinami/Github/Rat_Genomics_Paper_Pipeline/data/"
# Create new directory for GEMMA inputs and outputs
base.dir <- "gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/"
# path to write bimbam files
bim.dir <- "gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/bimbam"
The “bimbam” file we are writing does not exactly follow the BIMBAM mean genotype file format, but it contains all the necessary information to create subseted BIMBAM files for estimating the genetic relatedness matrix (GRM).
The function below expects a snp annotation, genotype, and gene
expression table. Ours were loaded directly from
Data-From-Abe-Palmer-Lab/Rdata/genoGex.Rdata
.
The snp annotation table, phyMap
, should be a dataframe
SNP
, Chr
, Pos
, Ref
,
Alt
as columns.
SNP Chr Pos Ref Alt
1 chr1:55365 1 55365 A T
8 chr1:759319 1 759319 T C
9 chr1:1134030 1 1134030 A G
The genotypes, geno
, should be a matrix with individuals
as column names.
0007A0008B 0007A00024 0007A000DB 0007A001C5 0007A0059F 0007A00263
[1,] 2 2 2 2 2 2
[2,] 1 0 1 0 1 1
[3,] 2 2 2 1 2 2
The gene expression file should be a dataframe with ensemble IDs in the first column and individuals as the remaining column names.
EnsemblGeneID 00077E67B5 00077E8336 00077EA7E6 000789FF7D 00078A0041
2 ENSRNOG00000000007 207.78 208.71 240.37 244.36 241.34
5 ENSRNOG00000000010 0.13 0.83 0.52 0.27 0.14
6 ENSRNOG00000000012 0.91 2.19 1.17 0.28 1.01
fn_write_bimbam = function(tis){
bimfile <- bim.dir %&% tis %&% "_bimbam"
# allows us to call gene expression table for each tissue without hard coding
gex = eval(as.name("gex" %&% tis))
geno_tis = geno[,match(colnames(gex)[-1], colnames(geno))]
snps <- paste(phyMap$Chr, phyMap$Pos, phyMap$Ref, phyMap$Alt, sep = "_")
bimbam <- cbind(phyMap$Chr, phyMap$Pos, snps, phyMap$Ref, phyMap$Alt, geno_tis)
write.table(bimbam, file = bimfile,quote=F,col.names=F,row.names=F)
}
for tis in tissues{
fn_write_bimbam(tis)
}
For each gene, we subset the initial “bimbam” file to cis eQTL variants for the genotype input and use gene expression values for the phenotype input. The output is a GRM estimating relatedness between the rats. BSLMM takes the GRM as input to control for individaul relatedness in association tests between genetic variance and gene expression.
fn_run_gemma_grm = function(tis){
#Read in bimbam file
bimfile <- bim.dir %&% tis %&% "_bimbam" ###get SNP position information###
pheno.dir <- base.dir %&% tis %&% "/phenotype_files/"
ge.dir <- base.dir %&% tis %&% "/genotype_files/"
grm.dir <- base.dir %&% tis %&% "/output/"
bim <- read.table(bimfile)
gex = eval(as.name("gex" %&% tis))
ensidlist <- gex$EnsemblGeneID
setwd(base.dir %&% tis)
for(i in 1:length(ensidlist)){
cat(i,"/",length(ensidlist),"\n")
gene <- ensidlist[i]
geneinfo <- gtf[match(gene, gtf$Gene),]
chr <- geneinfo[1]
c <- chr$Chr
start <- geneinfo$Start - 1e6 ### 1Mb lower bound for cis-eQTLS
end <- geneinfo$End + 1e6 ### 1Mb upper bound for cis-eQTLs
chrsnps <- subset(bim, bim[,1]==c) ### pull snps on same chr
cissnps <- subset(chrsnps,chrsnps[,2]>=start & chrsnps[,2]<=end) ### pull cis-SNP info
snplist <- cissnps[,3:ncol(cissnps)]
write.table(snplist, file= ge.dir %&% "tmp." %&% tis %&% ".geno" %&% gene, quote=F,col.names=F,row.names=F)
geneexp <- cbind(as.numeric(gex[i,-1]))
write.table(geneexp, file= pheno.dir %&% "tmp.pheno." %&% gene, col.names=F, row.names = F, quote=F) #output pheno for gemma
runGEMMAgrm <- "gemma -g " %&% ge.dir %&% "tmp." %&% tis %&% ".geno" %&% gene %&% " -p " %&% pheno.dir %&% "tmp.pheno." %&% gene %&% " -gk -o /grm_" %&% tis %&% "_" %&% gene
system(runGEMMAgrm)
}
}
This function takes about an hour to run, so instead of looping, I ran it separately for each tissue.
fn_run_gemma_grm("Ac")
fn_run_gemma_grm("Il")
fn_run_gemma_grm("Lh")
fn_run_gemma_grm("Pl")
fn_run_gemma_grm("Vo")
GEMMA creates a new folder, output
, in the working
directory the GRMs. They should have file extension
cXX.txt
. When we run BSLMM, GEMMA writes to the same
output
folder, so we rename to separate the two.
DIR=/gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/Ac
mv $DIR/output $DIR/GRMs
Running BSLMM is more computationally expensive than estimating GRMs, but parallelizable if we run it in CRI. We use Alvaro’s Badger script to create a PBS batch job for each gene.
I put all the bash commands for submitting BSLMM jobs into a shell script:
cd gpfs/data/im-lab/nas40t2/Github/RatXcan_Training/code
chmod +x Ac_GEMMA_BSLMM.sh
chmod +x Il_GEMMA_BSLMM.sh
chmod +x Lh_GEMMA_BSLMM.sh
chmod +x Pl_GEMMA_BSLMM.sh
chmod +x Vo_GEMMA_BSLMM.sh
./Ac_GEMMA_BSLMM.sh
Because of the CRI job limit, I waited for all the jobs to clear before repeating for the next tissue.
The details of the scripts are documented below:
DIR=/gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/Ac
cd /gpfs/data/im-lab/nas40t2/Github/badger
/gpfs/data/im-lab/nas40t2/lab_software/miniconda/envs/ukb_env/bin/python src/badger.py \
-yaml_configuration_file $DIR/GEMMA_submission.yaml \
-parsimony 9
There should be a folder
/gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/Ac/jobs
that contains all the job scripts. CRI has a job limit of 10,000, but we
have about 15,000 genes. We can work around it by repeating the chunk
below.
cd $DIR
for job in jobs/gemma*; do
if (($(qstat | wc -l) > 10000)); then
break
else
qsub $job
mv $job completed_jobs/
fi
done
The BSLMM output is a table with columns: h
,
pve
, rho
, pge
, pi
,
n_gamma
. PVE is the proportion variance explained by causal
variants and PGE is the proportion of genetic variance explained by
sparse effects. We interpret them as estimates for heritability and
sparsity, respectively.
For some bizarre reason, the output table contains an extra tab at
the end of each row, so we need to trim it in order for R to be able to
parse it. We use a simple shell script, RatXcan_Training/code/remove_trailing_tab.sh
.
CODE=/gpfs/data/im-lab/nas40t2/Github/RatXcan_Training/code
cd $CODE
chmod +x remove_trailing_tab.sh
DIR=/gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/Ac
mkdir $DIR/hyp
for file in $DIR/output/*.hyp.txt; do
./remove_trailing_tab.sh $file
done
Now we should have a new directory, hyp
, that contains
all of the .hyp.txt
files in the correct format.
base.dir <- "/gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/"
Each gene results in whole columns of PVE values, so we summarize the
distribution in a new dataframe. (BSLMM employs the Metropolis-Hastings
algorithm, a way to estimate the probability distributions of PVE and
PGE given the observed data. Under a very, very loose interpretation,
BSLMM simulates different values of the parameters, computes the
probability they result in our observed data, and repeats until it
settles on the most likely sampled value of the parameters. Each row in
the .hyp.txt
file represents an iteration of the M-H
algorithm.)
library(stringr)
library(LearnBayes)
fn_PVE_estimates <- function(tis){
files <- list.files(path = base.dir %&% tis %&% "/hyp", pattern = ".hyp.txt", full.names = TRUE)
PVE_df <- as.data.frame(matrix(NA, 0, 4))
for(i in 1:length(files)){
gene <- str_extract(files[i], "ENSRNOG([\\d]+)")
df <- read.delim(files[i])
q1 <- quantile(df$pve, 0.1)
q2 <- quantile(df$pve, 0.9)
quantile1=list(p=.1 ,x=q1)
quantile2=list(p=.9, x=q2)
if(quantile1$x != quantile2$x) {
prior <- beta.select(quantile1, quantile2)
credible_set <- list(qbeta(0.025,prior[1], prior[2]), qbeta(0.975,prior[1], prior[2]))
PVE_df[i, 1] <- gene
PVE_df[i, 2] <- qbeta(0.5, prior[1], prior[2])
PVE_df[i, 3] <- credible_set[1]
PVE_df[i, 4] <- credible_set[2]
}
else
i = i+1
}
write_tsv(PVE_df, base.dir %&% tis %&% "/PVE_estimates.txt", col_names = FALSE )
}
tissues = c("Ac", "Il", "Lh", "Pl", "Vo")
for tis in tissues{
fn_PVE_estimates(tis)
}
And repeat for PGE.
fn_PVE_estimates <- function(tis){
files <- list.files(path = base.dir %&% tis %&% "/hyp", pattern = ".hyp.txt", full.names = TRUE)
PGE_df <- as.data.frame(matrix(NA, 0, 4))
for(i in 1:length(files)){
gene <- str_extract(files[i], "ENSRNOG([\\d]+)")
df <- read.delim(files[i])
q1 <- quantile(df$pge, 0.5)
q2 <- quantile(df$pge, 0.9)
quantile1=list(p=.5 ,x=q1)
quantile2=list(p=.9, x=q2)
if(quantile1$x != quantile2$x) {
prior <- beta.select(quantile1, quantile2)
credible_set <- list(qbeta(0.025,prior[1], prior[2]), qbeta(0.975,prior[1], prior[2]))
PGE_df[i, 1] <- gene
PGE_df[i, 2] <- qbeta(0.5, prior[1], prior[2])
PGE_df[i, 3] <- credible_set[1]
PGE_df[i, 4] <- credible_set[2]
}
else
i = i+1
}
write_tsv(PGE_df, base.dir %&% tis %&% "/PGE_estimates.txt", col_names = FALSE )
}
for tis in tissues{
fn_PGE_estimates(tis)
}
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] Rcpp_1.0.8.3 rstudioapi_0.11 whisker_0.4 knitr_1.39
[5] magrittr_1.5 workflowr_1.6.2 R6_2.4.1 rlang_1.0.2
[9] fastmap_1.1.0 stringr_1.4.0 tools_4.0.3 xfun_0.31
[13] cli_3.3.0 git2r_0.27.1 jquerylib_0.1.4 htmltools_0.5.2
[17] ellipsis_0.3.2 rprojroot_1.3-2 yaml_2.2.1 digest_0.6.27
[21] tibble_3.0.4 lifecycle_0.2.0 crayon_1.3.4 later_1.1.0.1
[25] sass_0.4.1 vctrs_0.4.1 fs_1.5.0 promises_1.1.1
[29] glue_1.6.2 evaluate_0.15 rmarkdown_2.14 stringi_1.5.3
[33] compiler_4.0.3 bslib_0.3.1 pillar_1.4.6 backports_1.1.10
[37] jsonlite_1.7.1 httpuv_1.5.4 pkgconfig_2.0.3