E+G+Methyl

Integration of mQTL and enhancer-target gene maps with GWAS summary results

Most trait-associated genetic variants identified in genome-wide association studies (GWAS) are located in non-coding regions of the genome and thought to act through their regulatory roles. To account for enriched association signals in DNA regulatory elements, we propose a novel and general gene-based association testing strategy called “E + G + Methyl” that integrates enhancer-target gene pairs and methylation quantitative trait locus (mQTL) data with GWAS summary results; it aims to both boost statistical power for new discoveries and enhance mechanistic interpretability of any new discovery. This online tutorial aims to facilitate data analyses via “E + G + Methyl” in the wider community. Please cite the following manuscript for using “E + G + Methyl”:

Wu, C., and Pan, W. (2019+). Integration of mQTL and enhancer-target gene maps with GWAS summary results. In revision.

For questions or comments, contact Wei Pan (weip@biostat.umn.edu) or Chong Wu (cwu3@fsu.edu).

Installation

install.packages('optparse','RColorBrewer','data.table','matlib','Rcpp','RcppArmadillo','bigmemory','mvtnorm','MASS')
install.packages('plink2R-master/plink2R/',repos=NULL)
if (!require("devtools"))
install.packages("devtools")
devtools::install_github("ChongWu-Biostat/aSPU2")
devtools::install_github("gabraham/plink2R/plink2R")

Data availability

We share the some processed datasets as follows. Please cite the original paper and our paper for using these datasets.

For using Hippo information: Cao, Q., Anyansi, C., Hu, X., Xu, L., Xiong, L., Tang, W., Mok, M. T., Cheng, C., Fan, X., Gerstein, M., et al. (2017). Reconstruction of enhancer-target networks in 935 samples of human primary cells, tissues and cell lines. Nature Genetics, 49(10), 1428–1436.

For using MCF7 information: Li, G., Ruan, X., Auerbach, R. K., Sandhu, K. S., Zheng, M., Wang, P., Poh, H. M., Goh, Y., Lim, J., Zhang, J., et al. (2012). Extensive promoter-centered chromatin interactions provide a topological basis for transcription regulation. Cell, 148(1), 84–98.

Typical analysis and output

We will use the schizophrenia GWAS summary data (scz1) as an example to illustrate how to use “E + G + Methyl”. This example assumes you have set up the required environment and data as illustrated in the previous section. All analyses are based on R/3.3.1.

Input: GWAS summary statistics

At a minimum, we need a summary text file with a header row containing the following fields:

  1. SNP_map – SNP identifier (rs id)

  2. A1 – first allele (effect allele, should be capitalized)

  3. A2 – second allele (another allele, should be capitalized)

  4. Z – Z-scores, sign with respect to A1.

Additional columns are allowed but will be ignored. We recommend removing the additional columns before analysis to save space.

Note: We highly recommend running TWAS-aSPU with raw summary-level data. Pre-process step such as pruning and restricting to top SNPs may harm the performance.

Performing “E + G + Methyl”

To let users easier to use the software, we included all the required data in the software. Once we downloaded necessary packages and plink, we can run “E + G + Methyl” via the following single line.

Rscript egmethyl.R \
--sumstats scz1.txt \
--out ./out \
--out_name res \
--gene_list gene_list.txt \
--tissue Hippo \
--test_method asy

Due to some limits in GitHub, we have to put all the mQTL information in Google Drive. You need to download the mQTL informationn to conduct a genome-wide E + G + Methyl analysis. This example code should take less than one minute, and you will see some intermediate steps on the screen. If everything works, you will see a file under the .out.

Through egmethyl.R, we perform the following steps: (1) combine the GWAS and reference SNPs and remove ambiguous SNPs; (2) prioritize the SNPs that are mQTL in either enhancer regions or a gene body region; (3) impute GWAS Z-scores for any reference SNPs with missing GWAS Z-scores via the IMPG algorithm ([https:academic.oup.combioinformaticsarticle302029062422225/Fast-and-accurate-imputation-of-summary-statistics Pasaniuc et al. 2014); (4) perform E + G + Methyl analysis with different gene-based testing methods; (5) report results and store them in the working directory.

Output: Gene-disease association

The results are stored in a user-defined output file. For illustration, we explain the meaning of each entry in the first two lines of the output.

Col. num. Column name Value Explanations
1 gene A4GALT Feature/gene identifier, taken from gene_list file
2 CHR 22 Chromosome
3 P0 41088126 Gene start (from hg19 list from plink website)
4 P1 43116876 Gene end (from hg19 list from plink website)
5 nSNPs 5 Number of mQTLs in either enhancers or a gene body region
6 SPU(1) 0.69 SPU(1) p-value. The p-value is based on asymptotic distribution.
7 SPU(2) 0.61 SPU(2) p-value The p-value is based on asymptotic distribution.
8-23 SPU 0.69 SPU or aSPU p-values. Note: The results here are only for illustration.

Command-line parameters

TWAS_aSPU.R

Flag Usage Default
--sumstats ummary statistics (rds file and must have SNP and Z column headers) Required
--out Path to output file Required
--out_name output file name Required
--gene_list Gene list to be analyzed Required
--tissue Tissue to use. Either “MCF7” or “Hippo”. Required
--test_method Gene-based methods to be used. “aSPU” for aSPU test.. Required

Note: A single layer/loop of Monte Carlo simulation is used to obtain the \(p\)-values of all the SPU and aSPU tests simultaneously if test.method = “aSPU”. To save computational time, We use an adaptive way to select the number of simulations and calculate \(p\)-values efficiently.

FAQ

What QC is performed internally when using “E + G + Methyl”?

“E + G + Methyl” performs similar quality control steps as its competitors did. Specifically, we automatically match up SNPs, remove all ambiguous SNPs (A/C or G/T) and non-biallelic SNPs, and flip alleles to match the reference data.

Can I run “E + G + Methyl” with aSPU test?

Yes. aSPU generally provides more significant results and identifies more significant genes. You can use the following code to run aSPU test. However, you need to make sure you have installed the necessary supporting software. For example, in MacOS, you should install XCode and XCode select (by xcode-select –install).

Rscript egmethyl.R \
--sumstats scz1.txt \
--out ./out \
--out_name res \
--gene_list gene_list.txt \
--tissue Hippo \
--test_method aSPU

Can you provide the mQTL and enhancer information via ped format? I am not familiar with R and want to use Python to conduct some data analysis.

Yes. You can download the file through the bed file format link. Each file represents the mQTL information. The files with prefix “Hippo_” stands for the mQTL information of Hippo tissue, while the files with prefix “MCF7” stands for the mQTL information of MCF7 tissue. Note: the file is large, about 1GB.

Acknowledgements

This research was supported by NIH grants R21AG057038, R01HL116720, R01GM113250, and R01HL105397.

License

Maintainer: Chong Wu (cwu3@fsu.edu)

MIT

Copyright (c) 2013-present, Chong Wu (cwu3@fsu.edu) & Wei Pan(weip@biostat.umn.edu).