Quick tutorials

We provide four quick tutorials; each shows one function of KGGSEE. In each tutorial, we provide the command line and a brief explanation of flags and output files. Please refer to Detailed Document and Options for details. The first tutorial (GATES and ECS (gene-based association tests)) should be done first, then you can run any of the following tutorials.

Make sure the KGGSEE Java archive kggsee.jar, the running resource data folder resources/, and the tutorial data folder tutorials/ are under the same directory. Then, we suppose you are under the directory tutorials/.

GATES and ECS (gene-based association tests)

GATES and ECS are two statistical methods combining the p-values of a group of SNPs into one p-value. This analysis inputs p-values of SNPs and outputs p-values of genes. The tutorial command is:

java -Xmx4g -jar ../kggsee.jar \
  --sum-file ./scz_gwas_eur_chr1.tsv.gz \
  --vcf-ref ./1kg_hg19_eur_chr1.vcf.gz \
  --keep-ref ./VCFRefhg19/ \
  --gene-assoc \
  --out t1

Options and input files

Flag

Description

--sum-file

Specifies a whitespace delimitated file of GWAS summary statistics. In this analysis, columns of SNP coordinates and p-values (CHR, BP, and P by default) are needed.

--vcf-ref

Specifies a VCF file of genotypes sampled from a reference population. These genotypes are used to estimate LD correlation coefficients among SNPs.

--keep-ref

Keep the parsed VCF file in KGGSEE object format in the specified directory. KGGSEE will read these files in the following tutorials, which will be faster than parsing VCF files.

--gene-assoc

Triggers gene-based association tests.

--out

Specifies the prefix of output files.

Output files

The numeric results of gene-based association tests are saved in t1.gene.pvalue.txt. There are seven columns in the file:

Header

Description

Gene

Gene symbol

#Var

Number of variants within a gene

ECSP

p-value of ECS

GATESP

p-value of GATES

Chrom

Chromosome of the gene

Pos

Coordinate of the variant with the lowest p-value within the gene

GWAS_Var_P

p-value of the variant with the lowest p-value within the gene

The columns of t1.gene.var.pvalue.txt.gz are the same as t1.gene.pvalue.txt. The difference is that, for each gene, in t1.gene.pvalue.txt, only the variant with the lowest p-value is output, while in t1.gene.var.pvalue.txt.gz, all variants are output.

The Q-Q plots for p-values of input GWAS file (inside or outside of each gene) and gene-based association tests by GATES or ECS are saved in t1.qq.pdf.

DESE (driver-tissue inference)

DESE performs phenotype-tissue association tests and conditional gene-based association tests at the same time. This analysis inputs p-values of a GWAS and expression profile of multiple tissues and outputs p-values of phenotype-tissue associations and conditional p-values of genes. The tutorial command is:

java -Xmx4g -jar ../kggsee.jar \
  --sum-file ./scz_gwas_eur_chr1.tsv.gz \
  --saved-ref ./VCFRefhg19/ \
  --expression-file ../resources/GTEx_v8_TMM_all.gene.meanSE.txt.gz \
  --gene-assoc-condi \
  --out t2

Options and input files

Flag

Description

--sum-file

Specifies a whitespace delimitated file of GWAS summary statistics. In this analysis, columns of SNP coordinates and p-values are needed.

--saved-ref

Specifies the directory of the genotypes of the reference population in KGGSEE object format, which is saved by the --keep-ref flag in the first tutorial.

--expression-file

Specifies a gene expression file that contains means and standard errors of gene expressions for tissues/cell types. Here GTEx_v8_TMM_all.gene.meanSE.txt is for gene-level DESE. Try GTEx_v8_TMM_all.transcript.meanSE.txt for transcript-level DESE.

--gene-assoc-condi

Triggers the DESE analysis.

--out

Specifies the prefix of the output files.

Output files

The three files of t2.gene.pvalue.txt, t2.gene.var.pvalue.txt.gz, and t2.qq.pdf are the same as their counterparts with the same suffixes of the first tutorial. In addition, the results of conditional gene-based association tests are in t2.gene.assoc.condi.txt which contains nine columns:

Header

Description

Gene

Gene symbol

Chrom

Chromosome of the gene

StartPos

Start coordinate of the gene

EndPos

End coordinate of the gene

#Var

Number of variants within the gene

Group

LD group number. Conditional ECS tests were performed for genes within the same LD group.

ECSP

p-value of ECS

CondiECSP

p-value of the conditional gene-based association tests by conditional ECS

GeneScore

Gene’s selective-expression score. A gene with a high score will be given higher priority to enter the conditioning procedure.

Results of driver-tissue prioritizations are in t2.celltype.txt. This is a Wilcoxon rank-sum test which tests whether the selective expression median of the phenotype-associated genes is significantly higher than that of the other genes in the interrogated tissue. The file contains three columns:

Header

Description

TissueName

Name of the tissue being tested

Unadjusted(p)

Unadjusted p-values for the tissue-phenotype associations

Adjusted(p)

Adjusted p-values calculated by adjusting both selection bias and multiple testing

Median(IQR)SigVsAll

Median (interquartile range) expression of the conditionally significant genes and all the background genes

EMIC (gene-expression causal-effect inference)

EMIC inferences gene expressions’ causal effect on a complex phenotype with dependent expression quantitative loci by a robust median-based Mendelian randomization. SNPs with effects on both the phenotype and a gene are considered as instrumental variables (IVs) of the gene, which can be used to infer the gene’s expression effect on the phenotype. This analysis uses effect sizes of SNPs on the phenotype and genes’ expressions and outputs effect sizes and p-values of the expression effects on the phenotype. The tutorial command is:

java -Xmx4g -jar ../kggsee.jar \
  --sum-file ./scz_gwas_eur_chr1.tsv.gz \
  --saved-ref ./VCFRefhg19/ \
  --eqtl-file ./GTEx_v8_gene_BrainBA9.eqtl.txt.gz \
  --emic-plot-p 0.01 \
  --beta-col OR \
  --beta-type 2 \
  --emic \
  --out t3

Options and input files

Flag

Description

--sum-file

Specifies a whitespace delimitated file of GWAS summary statistics. In this analysis, in addition to the columns of SNP coordinates and p-values, two columns of SNP alleles (named A1 and A2 by default), a column of the effect allele (A1) frequency (named FRQ_U by default), and two columns of SNP effect sizes and their standard errors (named SE by default) are also needed.

--saved-ref

Specifies the directory of genotypes of reference population in KGGSEE object format, which is saved by the --keep-ref flag in the first tutorial.

--eqtl-file

Specifies a fasta-styled file of SNPs’ effects on gene expressions. Here GTEx_v8_gene_BrainBA9.eqtl.txt.gz is for gene-level EMIC. You can try GTEx_v8_transcript_BrainBA9.eqtl.txt.gz for a transcript-level EMIC.

--emic-plot-p

Specifies the p-value threshold for plotting a scatter plot.

--beta-col

Specifies the column name of effect sizes in the GWAS file.

--beta-type

Specifies the type of the effect sizes; here 2 means that it is the odds ratio for a qualitative phenotype.

--emic

Triggers the EMIC analysis.

--out

Specifies the prefix of the output files.

Output files

The numeric results of EMIC are saved in t3.emic.gene.txt. There are nine columns in the file:

Header

Description

Gene

The gene symbol

#Var

Number of IVs within the gene

minP_EMIC

p-value of EMIC. When a transcript-level EMIC is performed, this is the minimum p-value among all transcripts of the gene.

Details_EMIC

Each detailed result has four components in brackets: the number of IVs, the causal effect estimate and its standard error, and the p-value. When a transcript-level EMIC is performed, results for each transcript are listed.

Chrom

Chromosome of a gene

Pos

The coordinate of the IV with the lowest GWAS p-value

GWAS_Var_P

GWAS p-value of an IV

GWAS_Var_Beta

The phenotype association effect size of an IV

GWAS_Var_SE

Standard error of an effect size

The columns of t3.emic.gene.var.tsv.gz are the same as t3.emic.gene.txt. The difference is that, for each gene, in t3.emic.gene.txt, only the eQTL with the lowest GWAS p-value is output, while in turorial_3.emic.gene.var.tsv.gz, all eQTLs are output. In this tutorial, the file t3.emic.gene.PleiotropyFinemapping.txt is empty, we ignore it here.

File t3.qq.pdf saves the Q-Q plot for the GWAS p-values of IVs. File t3.emic.qq.pdf saves the Q-Q plot for the EMIC p-values.

File t3.scatterplots.emic.pdf saves the scatter plots of the genetic association with gene expression. Each gene with an EMIC p-value lower than 2.5E-3 (default threshold) is saved on a separate page of the PDF. A filled rectangle on the plots denotes an IV. The red rectangle denotes the most significant GWAS variant among all the IVs of a gene. The slope of the line represents the estimated causal effect. The color of an IV denotes the degree of the LD between the IVs and the most significant GWAS variant. The error bar in a rectangle denotes the standard error of the coefficient estimate. File t3.scatterplots.emic.txt saves the numeric results of the scatter plots in t3.scatterplots.emic.pdf.

EHE (gene-based heritability estimation)

Heritability is a measure of how well differences in people’s genes account for differences in their phenotypes. This tutorial estimates the heritability of each gene using GWAS summary statistics. The tutorial command is:

java -Xmx4g -jar ../kggsee.jar \
  --sum-file ./scz_gwas_eur_chr1.tsv.gz \
  --saved-ref ./VCFRefhg19/ \
  --case-col Nca \
  --control-col Nco \
  --gene-herit \
  --out t4

Options and input files

Flag

Description

--sum-file

Specifies a whitespace delimitated file of GWAS summary statistics. In this analysis, in addition to the columns of SNP coordinates and p-values, two columns of case and control sample sizes are also needed.

--saved-ref

Specifies the directory of the genotypes of the reference population in KGGSEE object format, which is saved by the --keep-ref flag in the first tutorial.

--case-col

Specifies the column name of the case sample size.

--control-col

Specifies the column name of the control sample size.

--gene-herit

Triggers gene-based association tests and estimations of gene heritability.

--out

Specifies the prefix of the output files.

Output files

The output files are generally the same as the first tutorial, except that, in t4.gene.pvalue.txt, t4.gene.var.pvalue.txt.gz, there are two more columns named Herit and HeritSE, which are the estimate and its standard error of a gene’s heritability.