Many genetic variants affect complex traits through gene expression, which can be exploited to boost statistical power and enhance interpretation in genome-wide association studies (GWASs) as demonstrated by the transcriptome-wide association study (TWAS) approach. Furthermore, due to polygenic architecture, a complex trait may be affected by multiple genes with similar function as annotated in gene pathways. Here we extend TWAS from gene-based analysis to pathway-based analysis: we integrate public pathway collections, gene expression data and GWAS summary association statistics to identify gene pathways associated with complex traits. The basic idea is to impute the genetically regulated component of gene expression for each gene in a pathway, then adaptively test for association between imputed expression levels of the genes in the pathways and a GWAS trait by effectively aggregating possibly weak association signals across the genes in the pathway. Please cite the following manuscript for using our proposed method, aSPUpath2:
Wu, C. and Pan, W. (2018). Integrating eQTL data with GWAS summary statistics in pathwaybased analysis. Accepted by Genetic Epidemiology, early online.
For questions or comments regarding methods, contact Wei Pan (weip@biostat.umn.edu); For questions or comments regarding data & codes, contact Chong Wu (wuxx0845@umn.edu).
Note: To maximize the compatibility of aSPUpath2 and reduce the learning curve for using aSPUpath2, some steps are exactly the same and taken from the TWAS website.
Download and unpack the (1000 Genomes) LD reference data:
wget https://data.broadinstitute.org/alkesgroup/FUSION/LDREF.tar.bz2 tar xjvf LDREF.tar.bz2
Download the weights from the TWAS website or PrediXcan website. Please put the weights into the WEIGHTS subfolder.
Download and unpack the Scripts and examples (Using clone or download option at GitHub).
git clone https://github.com/ChongWu-Biostat/aSPUpath2.git
Launch R and install required libraries:
install.packages('optparse','RColorBrewer','data.table','matlib','Rcpp','RcppArmadillo','bigmemory','mvtnorm','MASS','magic') if (!require("devtools")) install.packages("devtools") devtools::install_github("ChongWu-Biostat/aSPU2") devtools::install_github("gabraham/plink2R/plink2R")
aSPUpath2 integrates gene expression reference weights, GWAS summary data, SNP linkage disequilibrium (LD) information, and candidate pathways to identify pathways whose expression is associated with complex traits directly (Figure 1). We will use the PGC schizophrenia GWAS summary data (Ripke et al. 2013) as an example to illustrate how to use aSPUpath2. This example assumes you have setup the required environment and data as illustrated in the previous section. All analyses are based on R/3.3.1.
At a minimum, we need a summary rds file with a header row containing the following fields:
SNP_map – SNP identifier (CHR: BP)
A1 – first allele (effect allele, should be capitalized)
A2 – second allele (the other allele, should be capitalized)
Z – Z-scores, sign with respect to A1.
Additional columns are allowed but will be ignored. We recommend removing the additional columns before analysis to save space.
Note: The performance of aSPUpath2 depends on the density of GWAS summary data. We highly recommend running aSPUpath2 with raw GWAS summary data. Pre-process step such as pruning and restricting to top SNPs may harm the performance.
The pre-computed external weights can be downloaded from TWAS (Gusev et al. 2016) or PrediXcan websites.
After we prepared the data, we can run aSPUpath2 via the following single line.
Rscript aSPUpath2.R \ --sumstats ./Example/example.stat.rds \ --out ./Example/example_res.rds \ --weights ./WEIGHTS/CMC.BRAIN.RNASEQ.pos \ --weights_dir ./WEIGHTS/ \ --ref_ld ./LDREF/1000G.EUR. \ --pathway_list ./Example/example_GOCC.txt
This should take around one or two minutes, and you will see some intermediate steps on the screen. If everything works, you will see a file example_res.rds under the Example and output.txt in the working directory.
Through aSPUpath2, we perform the following steps: (1) combine the GWAS and reference SNPs and remove ambiguous SNPs. (2) impute GWAS Z-scores for any reference SNPs with missing GWAS Z-scores via the IMPG algorithm (Pasaniuc et al. 2014); (3) perform aSPUpath2; (4) report results and store them in the working directory.
The results are stored in a user-defined output file. For illustration, we explain the meaning of each entry in the first two lines of the output.
Col. num. | Column name | Value | Explanations |
1 | pathway | GO_FILOPODIUM | Pathway identifier, taken from pathway_list file |
2 | # genes | 6 | Number of genes with gene expression reference weights. |
3 | #nonzero_SNPs | 1714 | Number of non-zero weight SNPs |
4 | PathSPU(1) | 0.006 | p value of PathSPU(1). The p-value is based on asymptotic distribution. |
5 | PathSPU(2) | 0.003 | p value of PathSPU(2). The p-value is based on asymptotic distribution. |
6 | aSPUpath2 | 0.006 | p value of aSPUpath2. The p-value is based on asymptotic distribution. |
7 | time | 10.24 | running time (s) for aSPUpath2. |
Note: For a given pathway, we exclude the genes without gene expression reference weights.
aSPUpath2 can be applied to GWAS individual data as well. The main idea is the same. Specifically, we can calculate the score and its corresponding covariance matrix for SNPs in the gene regions. Then we can apply aSPU2 package with pre-computed imaging endophenotype based external weights provided on this website. Since there some restrictions sharing GWAS individual data, we do not provide any example regarding applying aSPUpath2 to GWAS individual data.
Flag | Usage | Default |
--sumstats | summary statistics (rds file and must have SNP and Z column headers) | Required |
--out | Path to output file | Required |
--weights | File listing molecular weight (rds files and must have columns ID, CHR, P0, P1, and Weights) | Required |
--ref_ld | Reference LD files in binary PLINK format | Required |
--pathway_list | Pathways we want to analyze | Required |
note: We use an asymptotics-based way to calculate p-values of aSPUpath2, which is fast in general.
aSPUpath2 has some conceptual similarities with two gene-based methods: TWAS and PrediXcan. What's the difference between them?
Yes, aSPUpath2 has some conceptual similarities with two gene-based methods: TWAS (Gusev, et al 2016) and PrediXcan (Gamazon et al 2015) that aim to impute the genetic regulated component of gene expression and then test the ‘imputed’ gene expression with the phenotype directly. However, these methods are focused on identifying significant genes instead of significant pathways. Importantly, unlike TWAS and PrediXcan, which use the weighted linear combination of genetic variants to construct test statistics, our approach aggregates information based on the underlying association patterns adaptively, thus increasing discovery power.
What related software is available?
To our knowledge, aSPUpath2 is the first software for pathway-based analysis with gene expression reference weights. However, there are many software for gene-based analysis with gene expression reference weights.
Two methods are highly correlated with two well-known groups. MetaXcan and PrediXcan (Gamazon et al 2015) developed by the Im lab perform gene-based association tests with and without summary data. TWAS by Gusev performs gene-based association tests with individual-level or summary statistics. MetaXcan and TWAS are similar. The main difference between TWAS and MetaXcan is that they use a slightly different way and different reference data to construct weights. We recommend you to check their websites to download other gene-expression based weights as well.
TWAS-aSPU was proposed to boost statistical powerful over TWAS and PrediXcan. Instead of using gene-expression based external weight, we may construct weights based on other endophenotypes. See our IWAS (Xu et al 2017) for more details.
What QC is performed internally when using aSPUpath2?
aSPUpath2 performs similar quality control steps as TWAS did. We automatically match up SNPs, remove ambiguous markers (A/C or G/T) and flip alleles to match the reference data.
This research was supported by NIH grants R21AG057038, R01HL116720, R01GM113250 and R01HL105397, and by the Minnesota Supercomputing Institute.
Ripke, S., C. O’Dushlaine, K. Chambert, J. L. Moran, A. K. K ̈ahler, S. Akterin, S. E. Bergen, A. L. Collins, J. J. Crowley, M. Fromer, et al. (2013). Genome-wide association analysis identifies 13 new risk loci for schizophrenia. Nature Genetics 45 (10), 1150–1159.
Pasaniuc, B., Zaitlen, N., Shi, H., Bhatia, G., Gusev, A., Pickrell, J., Hirschhorn, J., Strachan, D.P., Patterson, N. and Price, A.L. (2014). Fast and accurate imputation of summary statistics enhances evidence of functional enrichment. Bioinformatics 30, 2906- 2914.
Gamazon, E.R. et al. (2015) A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 47, 1091-1098.
Gusev, A. et al. (2016) Integrative approaches for large-scale transcriptome-wide association studies. Nat Genet. 48, 245-252.
Alvaro Barbeira, Kaanan P Shah, Jason M Torres, Heather E Wheeler, Eric S Torstenson, Todd Edwards, Tzintzuni Garcia, Graeme I Bell, Dan Nicolae, Nancy J Cox, Hae Kyung Im. (2016) MetaXcan: Summary Statistics Based Gene-Level Association Method Infers Accurate PrediXcan Results.
Xu, Z., Wu, C., Pan, W., & Alzheimer's Disease Neuroimaging Initiative. (2017). Imaging-wide association study: Integrating imaging endophenotypes in GWAS. NeuroImage, 159, 159-169.
Maintainer: Chong Wu (wuxx0845@umn.edu)
Copyright (c) 2013-present, Chong Wu (wuxx0845@umn.edu) & Wei Pan(weip@biostat.umn.edu).