Quick tutorials
We provide four quick tutorials; each shows one function of KGGSEE. In each tutorial, we provide the command line and a brief explanation of flags and output files. Please refer to Detailed Document and Options for details. The first tutorial (GATES and ECS (gene-based association tests)) should be done first, then you can run any of the following tutorials.
Make sure the KGGSEE Java archive kggsee.jar, the running resource data folder resources/, and the tutorial data folder tutorials/ are under the same directory. Then, we suppose you are under the directory tutorials/.
GATES and ECS (gene-based association tests)
GATES and ECS are two statistical methods combining the p-values of a group of SNPs into one p-value. This analysis inputs p-values of SNPs and outputs p-values of genes. The tutorial command is:
java -Xmx4g -jar ../kggsee.jar \
--sum-file ./scz_gwas_eur_chr1.tsv.gz \
--vcf-ref ./1kg_hg19_eur_chr1.vcf.gz \
--keep-ref ./VCFRefhg19/ \
--gene-assoc \
--out t1
Options and input files
Flag |
Description |
|---|---|
|
Specifies a whitespace delimitated file of GWAS summary statistics. In this analysis, columns of SNP coordinates and p-values (CHR, BP, and P by default) are needed. |
|
Specifies a VCF file of genotypes sampled from a reference population. These genotypes are used to estimate LD correlation coefficients among SNPs. |
|
Keep the parsed VCF file in KGGSEE object format in the specified directory. KGGSEE will read these files in the following tutorials, which will be faster than parsing VCF files. |
|
Triggers gene-based association tests. |
|
Specifies the prefix of output files. |
Output files
The numeric results of gene-based association tests are saved in t1.gene.pvalue.txt. There are seven columns in the file:
Header |
Description |
|---|---|
Gene |
Gene symbol |
#Var |
Number of variants within a gene |
ECSP |
p-value of ECS |
GATESP |
p-value of GATES |
Chrom |
Chromosome of the gene |
Pos |
Coordinate of the variant with the lowest p-value within the gene |
GWAS_Var_P |
p-value of the variant with the lowest p-value within the gene |
The columns of t1.gene.var.pvalue.txt.gz are the same as t1.gene.pvalue.txt. The difference is that, for each gene, in t1.gene.pvalue.txt, only the variant with the lowest p-value is output, while in t1.gene.var.pvalue.txt.gz, all variants are output.
The Q-Q plots for p-values of input GWAS file (inside or outside of each gene) and gene-based association tests by GATES or ECS are saved in t1.qq.pdf.
DESE (driver-tissue inference)
DESE performs phenotype-tissue association tests and conditional gene-based association tests at the same time. This analysis inputs p-values of a GWAS and expression profile of multiple tissues and outputs p-values of phenotype-tissue associations and conditional p-values of genes. The tutorial command is:
java -Xmx4g -jar ../kggsee.jar \
--sum-file ./scz_gwas_eur_chr1.tsv.gz \
--saved-ref ./VCFRefhg19/ \
--expression-file ../resources/GTEx_v8_TMM_all.gene.meanSE.txt.gz \
--gene-assoc-condi \
--out t2
Options and input files
Flag |
Description |
|---|---|
|
Specifies a whitespace delimitated file of GWAS summary statistics. In this analysis, columns of SNP coordinates and p-values are needed. |
|
Specifies the directory of the genotypes of the reference population in KGGSEE object format, which is saved by the |
|
Specifies a gene expression file that contains means and standard errors of gene expressions for tissues/cell types. Here |
|
Triggers the DESE analysis. |
|
Specifies the prefix of the output files. |
Output files
The three files of t2.gene.pvalue.txt, t2.gene.var.pvalue.txt.gz, and t2.qq.pdf are the same as their counterparts with the same suffixes of the first tutorial. In addition, the results of conditional gene-based association tests are in t2.gene.assoc.condi.txt which contains nine columns:
Header |
Description |
|---|---|
Gene |
Gene symbol |
Chrom |
Chromosome of the gene |
StartPos |
Start coordinate of the gene |
EndPos |
End coordinate of the gene |
#Var |
Number of variants within the gene |
Group |
LD group number. Conditional ECS tests were performed for genes within the same LD group. |
ECSP |
p-value of ECS |
CondiECSP |
p-value of the conditional gene-based association tests by conditional ECS |
GeneScore |
Gene’s selective-expression score. A gene with a high score will be given higher priority to enter the conditioning procedure. |
Results of driver-tissue prioritizations are in t2.celltype.txt. This is a Wilcoxon rank-sum test which tests whether the selective expression median of the phenotype-associated genes is significantly higher than that of the other genes in the interrogated tissue. The file contains three columns:
Header |
Description |
|---|---|
TissueName |
Name of the tissue being tested |
Unadjusted(p) |
Unadjusted p-values for the tissue-phenotype associations |
Adjusted(p) |
Adjusted p-values calculated by adjusting both selection bias and multiple testing |
Median(IQR)SigVsAll |
Median (interquartile range) expression of the conditionally significant genes and all the background genes |
EMIC (gene-expression causal-effect inference)
EMIC inferences gene expressions’ causal effect on a complex phenotype with dependent expression quantitative loci by a robust median-based Mendelian randomization. SNPs with effects on both the phenotype and a gene are considered as instrumental variables (IVs) of the gene, which can be used to infer the gene’s expression effect on the phenotype. This analysis uses effect sizes of SNPs on the phenotype and genes’ expressions and outputs effect sizes and p-values of the expression effects on the phenotype. The tutorial command is:
java -Xmx4g -jar ../kggsee.jar \
--sum-file ./scz_gwas_eur_chr1.tsv.gz \
--saved-ref ./VCFRefhg19/ \
--eqtl-file ./GTEx_v8_gene_BrainBA9.eqtl.txt.gz \
--emic-plot-p 0.01 \
--beta-col OR \
--beta-type 2 \
--emic \
--out t3
Options and input files
Flag |
Description |
|---|---|
|
Specifies a whitespace delimitated file of GWAS summary statistics. In this analysis, in addition to the columns of SNP coordinates and p-values, two columns of SNP alleles (named A1 and A2 by default), a column of the effect allele (A1) frequency (named FRQ_U by default), and two columns of SNP effect sizes and their standard errors (named SE by default) are also needed. |
|
Specifies the directory of genotypes of reference population in KGGSEE object format, which is saved by the |
|
Specifies a fasta-styled file of SNPs’ effects on gene expressions. Here |
|
Specifies the p-value threshold for plotting a scatter plot. |
|
Specifies the column name of effect sizes in the GWAS file. |
|
Specifies the type of the effect sizes; here |
|
Triggers the EMIC analysis. |
|
Specifies the prefix of the output files. |
Output files
The numeric results of EMIC are saved in t3.emic.gene.txt. There are nine columns in the file:
Header |
Description |
|---|---|
Gene |
The gene symbol |
#Var |
Number of IVs within the gene |
minP_EMIC |
p-value of EMIC. When a transcript-level EMIC is performed, this is the minimum p-value among all transcripts of the gene. |
Details_EMIC |
Each detailed result has four components in brackets: the number of IVs, the causal effect estimate and its standard error, and the p-value. When a transcript-level EMIC is performed, results for each transcript are listed. |
Chrom |
Chromosome of a gene |
Pos |
The coordinate of the IV with the lowest GWAS p-value |
GWAS_Var_P |
GWAS p-value of an IV |
GWAS_Var_Beta |
The phenotype association effect size of an IV |
GWAS_Var_SE |
Standard error of an effect size |
The columns of t3.emic.gene.var.tsv.gz are the same as t3.emic.gene.txt. The difference is that, for each gene, in t3.emic.gene.txt, only the eQTL with the lowest GWAS p-value is output, while in turorial_3.emic.gene.var.tsv.gz, all eQTLs are output. In this tutorial, the file t3.emic.gene.PleiotropyFinemapping.txt is empty, we ignore it here.
File t3.qq.pdf saves the Q-Q plot for the GWAS p-values of IVs. File t3.emic.qq.pdf saves the Q-Q plot for the EMIC p-values.
File t3.scatterplots.emic.pdf saves the scatter plots of the genetic association with gene expression. Each gene with an EMIC p-value lower than 2.5E-3 (default threshold) is saved on a separate page of the PDF. A filled rectangle on the plots denotes an IV. The red rectangle denotes the most significant GWAS variant among all the IVs of a gene. The slope of the line represents the estimated causal effect. The color of an IV denotes the degree of the LD between the IVs and the most significant GWAS variant. The error bar in a rectangle denotes the standard error of the coefficient estimate. File t3.scatterplots.emic.txt saves the numeric results of the scatter plots in t3.scatterplots.emic.pdf.
EHE (gene-based heritability estimation)
Heritability is a measure of how well differences in people’s genes account for differences in their phenotypes. This tutorial estimates the heritability of each gene using GWAS summary statistics. The tutorial command is:
java -Xmx4g -jar ../kggsee.jar \
--sum-file ./scz_gwas_eur_chr1.tsv.gz \
--saved-ref ./VCFRefhg19/ \
--case-col Nca \
--control-col Nco \
--gene-herit \
--out t4
Options and input files
Flag |
Description |
|---|---|
|
Specifies a whitespace delimitated file of GWAS summary statistics. In this analysis, in addition to the columns of SNP coordinates and p-values, two columns of case and control sample sizes are also needed. |
|
Specifies the directory of the genotypes of the reference population in KGGSEE object format, which is saved by the |
|
Specifies the column name of the case sample size. |
|
Specifies the column name of the control sample size. |
|
Triggers gene-based association tests and estimations of gene heritability. |
|
Specifies the prefix of the output files. |
Output files
The output files are generally the same as the first tutorial, except that, in t4.gene.pvalue.txt, t4.gene.var.pvalue.txt.gz, there are two more columns named Herit and HeritSE, which are the estimate and its standard error of a gene’s heritability.