aSPUpath2

Integrating eQTL data with GWAS summary statistics in pathway-based analysis

Many genetic variants affect complex traits through gene expression, which can be exploited to boost statistical power and enhance interpretation in genome-wide association studies (GWASs) as demonstrated by the transcriptome-wide association study (TWAS) approach. Furthermore, due to polygenic architecture, a complex trait may be affected by multiple genes with similar function as annotated in gene pathways. Here we extend TWAS from gene-based analysis to pathway-based analysis: we integrate public pathway collections, gene expression data and GWAS summary association statistics to identify gene pathways associated with complex traits. The basic idea is to impute the genetically regulated component of gene expression for each gene in a pathway, then adaptively test for association between imputed expression levels of the genes in the pathways and a GWAS trait by effectively aggregating possibly weak association signals across the genes in the pathway. Please cite the following manuscript for using our proposed method, aSPUpath2:

Wu, C. and Pan, W. (2018). Integrating eQTL data with GWAS summary statistics in pathwaybased analysis. Accepted by Genetic Epidemiology, early online.

For questions or comments regarding methods, contact Wei Pan (weip@biostat.umn.edu); For questions or comments regarding data & codes, contact Chong Wu (wuxx0845@umn.edu).

Installation

Note: To maximize the compatibility of aSPUpath2 and reduce the learning curve for using aSPUpath2, some steps are exactly the same and taken from the TWAS website.

wget https://data.broadinstitute.org/alkesgroup/FUSION/LDREF.tar.bz2
tar xjvf LDREF.tar.bz2
git clone https://github.com/ChongWu-Biostat/aSPUpath2.git
install.packages('optparse','RColorBrewer','data.table','matlib','Rcpp','RcppArmadillo','bigmemory','mvtnorm','MASS','magic')
if (!require("devtools"))
install.packages("devtools")
devtools::install_github("ChongWu-Biostat/aSPU2")
devtools::install_github("gabraham/plink2R/plink2R")

Typical analysis and output

aSPUpath2 integrates gene expression reference weights, GWAS summary data, SNP linkage disequilibrium (LD) information, and candidate pathways to identify pathways whose expression is associated with complex traits directly (Figure 1). We will use the PGC schizophrenia GWAS summary data (Ripke et al. 2013) as an example to illustrate how to use aSPUpath2. This example assumes you have setup the required environment and data as illustrated in the previous section. All analyses are based on R/3.3.1.

Portrait of Chong Wu 

Input: GWAS summary statistics

At a minimum, we need a summary rds file with a header row containing the following fields:

  1. SNP_map – SNP identifier (CHR: BP)

  2. A1 – first allele (effect allele, should be capitalized)

  3. A2 – second allele (the other allele, should be capitalized)

  4. Z – Z-scores, sign with respect to A1.

Additional columns are allowed but will be ignored. We recommend removing the additional columns before analysis to save space.

Note: The performance of aSPUpath2 depends on the density of GWAS summary data. We highly recommend running aSPUpath2 with raw GWAS summary data. Pre-process step such as pruning and restricting to top SNPs may harm the performance.

Input: external weights

The pre-computed external weights can be downloaded from TWAS (Gusev et al. 2016) or PrediXcan websites.

Performing the TWAS-aSPU

After we prepared the data, we can run aSPUpath2 via the following single line.

Rscript aSPUpath2.R \
--sumstats ./Example/example.stat.rds \
--out ./Example/example_res.rds \
--weights ./WEIGHTS/CMC.BRAIN.RNASEQ.pos \
--weights_dir ./WEIGHTS/ \
--ref_ld ./LDREF/1000G.EUR. \
--pathway_list ./Example/example_GOCC.txt

This should take around one or two minutes, and you will see some intermediate steps on the screen. If everything works, you will see a file example_res.rds under the Example and output.txt in the working directory.

Through aSPUpath2, we perform the following steps: (1) combine the GWAS and reference SNPs and remove ambiguous SNPs. (2) impute GWAS Z-scores for any reference SNPs with missing GWAS Z-scores via the IMPG algorithm (Pasaniuc et al. 2014); (3) perform aSPUpath2; (4) report results and store them in the working directory.

Output: pathway-disease association

The results are stored in a user-defined output file. For illustration, we explain the meaning of each entry in the first two lines of the output.

Col. num. Column name Value Explanations
1 pathway GO_FILOPODIUM Pathway identifier, taken from pathway_list file
2 # genes 6 Number of genes with gene expression reference weights.
3 #nonzero_SNPs 1714 Number of non-zero weight SNPs
4 PathSPU(1) 0.006 p value of PathSPU(1). The p-value is based on asymptotic distribution.
5 PathSPU(2) 0.003 p value of PathSPU(2). The p-value is based on asymptotic distribution.
6 aSPUpath2 0.006 p value of aSPUpath2. The p-value is based on asymptotic distribution.
7 time 10.24 running time (s) for aSPUpath2.

Note: For a given pathway, we exclude the genes without gene expression reference weights.

Further Analyses

Testing with the individual level GWAS data

aSPUpath2 can be applied to GWAS individual data as well. The main idea is the same. Specifically, we can calculate the score and its corresponding covariance matrix for SNPs in the gene regions. Then we can apply aSPU2 package with pre-computed imaging endophenotype based external weights provided on this website. Since there some restrictions sharing GWAS individual data, we do not provide any example regarding applying aSPUpath2 to GWAS individual data.

Command-line parameters

aSPUpath2.R

Flag Usage Default
--sumstats summary statistics (rds file and must have SNP and Z column headers) Required
--out Path to output file Required
--weights File listing molecular weight (rds files and must have columns ID, CHR, P0, P1, and Weights) Required
--ref_ld Reference LD files in binary PLINK format Required
--pathway_list Pathways we want to analyze Required

note: We use an asymptotics-based way to calculate p-values of aSPUpath2, which is fast in general.

FAQ

aSPUpath2 has some conceptual similarities with two gene-based methods: TWAS and PrediXcan. What's the difference between them?

Yes, aSPUpath2 has some conceptual similarities with two gene-based methods: TWAS (Gusev, et al 2016) and PrediXcan (Gamazon et al 2015) that aim to impute the genetic regulated component of gene expression and then test the ‘imputed’ gene expression with the phenotype directly. However, these methods are focused on identifying significant genes instead of significant pathways. Importantly, unlike TWAS and PrediXcan, which use the weighted linear combination of genetic variants to construct test statistics, our approach aggregates information based on the underlying association patterns adaptively, thus increasing discovery power.

What related software is available?

To our knowledge, aSPUpath2 is the first software for pathway-based analysis with gene expression reference weights. However, there are many software for gene-based analysis with gene expression reference weights.

Two methods are highly correlated with two well-known groups. MetaXcan and PrediXcan (Gamazon et al 2015) developed by the Im lab perform gene-based association tests with and without summary data. TWAS by Gusev performs gene-based association tests with individual-level or summary statistics. MetaXcan and TWAS are similar. The main difference between TWAS and MetaXcan is that they use a slightly different way and different reference data to construct weights. We recommend you to check their websites to download other gene-expression based weights as well.

TWAS-aSPU was proposed to boost statistical powerful over TWAS and PrediXcan. Instead of using gene-expression based external weight, we may construct weights based on other endophenotypes. See our IWAS (Xu et al 2017) for more details.

What QC is performed internally when using aSPUpath2?

aSPUpath2 performs similar quality control steps as TWAS did. We automatically match up SNPs, remove ambiguous markers (A/C or G/T) and flip alleles to match the reference data.

Acknowledgements

This research was supported by NIH grants R21AG057038, R01HL116720, R01GM113250 and R01HL105397, and by the Minnesota Supercomputing Institute.

References

License

Maintainer: Chong Wu (wuxx0845@umn.edu)

MIT

Copyright (c) 2013-present, Chong Wu (wuxx0845@umn.edu) & Wei Pan(weip@biostat.umn.edu).