Thesis:GWASLab preprint: He, Y., Koido, M., Shimmori, Y., Kamatani, Y. (2023). GWASLab: a Python package for processing and visualizing GWAS summary statistics. Preprint at Jxiv, 2023-5. https://doi.org/10.51094/jxiv.370

1. GWAS可视化学习

原文链接:gwaslab/examples/7_visualization/visualization_regional.ipynb at main · Cloufield/gwaslab

# 创建新环境
# conda env create -n gwaslab -c conda-forge python=3.9
# pip install gwaslab
# pip install wget
import wget
import inspect
import sys
import gwaslab as gl

########### 0. 下载(加载)(示例)数据 ###########
# download
wget.download("http://jenger.riken.jp/14/", out = "/data2/projects/bioinfo/tyliu/1_proteomics/MR/ckb_mr/coloc/plot/exam/t2d_bbj.txt.gz")

# load
mysumstats = gl.Sumstats("/data2/projects/bioinfo/tyliu/1_proteomics/MR/ckb_mr/coloc/plot/exam/t2d_bbj.txt.gz",
snpid = "SNP",
chrom = "CHR",
pos = "POS",
ea = "ALT",
nea = "REF",
neaf = "Frq",
beta = "BETA",
se = "SE",
p = "P",
direction = "Dir",
build = "19", # hg19/38, 更换为38可行
chrom_pat = "7", # 仅加载7号染色体
n = "N")
mysumstats.basic_check(verbose = False)

# print(inspect.getsource(mysumstats.basic_check)) # 函数源代码
# 检查SNP格式;检查CHR列是否输入正确;检查POS是否准确;等位基因是否标记正确;效应值、标准误、P值之间数量关系的检查;

1.0 可用文档

gl.check_available_ref()

1.1 曼哈顿图

########### 1. 使用Summary data绘制曼哈顿图(一条染色体) ###########
# print(inspect.getsource(mysumstats.plot_mqq))
mqq_fig, mqq_ax = mysumstats.plot_mqq(
mode="m",
anno="GENENAME",
anno_source="ensembl"
)
mqq_fig.savefig("/data2/projects/bioinfo/tyliu/1_proteomics/MR/ckb_mr/coloc/plot/exam/manhattan_plot.png")

基础曼哈顿图

1.2 区域图

# https://cloufield.github.io/gwaslab/RegionalPlot/

########### 2. 识别lead snp ###########
# 滑动窗口500Kb识别lead snp,阈值5e-8
mysumstats.get_lead()

########### 3. 绘制指定区域Regional plot ###########
region_fig, region_ax = mysumstats.plot_mqq(region=(7,156538803,157538803))

region_fig.savefig("/data2/projects/bioinfo/tyliu/1_proteomics/MR/ckb_mr/coloc/plot/exam/region_base_plot.png")

基础区域图

左图是简单的区域图,右图是Q-Q图,用于评估p值分布的合理性。图中$\lambda_{GC}$(基因组膨胀因子)用于衡量统计量的潜在膨胀或收缩。此处Q-Q图绘制的目标是观察到的p值分布与理论期望分布(假设无真实关联时的随机分布)。

  • X轴:理论期望的p值,即均匀分布下的p值;
  • Y轴:实际分布得到的p值;
  • 理想情况:所有点沿着对角线分布;
  • 偏离情况:1、整体高于对角线:此时存在真实关联信号或系统性误差(群体分层、技术偏差);2、整体低于对角线:可能存在过校正或统计效力不足的问题。

$\lambda_{GC}=\frac{\text{观察到的卡方统计量中位数}}{\text{理论卡方分布中位数(}0.456\mathrm{)}}$,当$\lambda_{GC}>1$时存在系统性膨胀;$\lambda_{GC}<1$时表示统计效力不足;$\lambda_{GC} \approx 1$时表示符合期望。

整体观察左右两图:

  1. 若区域图中有显著峰,并且Q-Q图中存在点明显偏离对角线,说明信息真实;
  2. 若Q-Q图整体膨胀,区域图中的峰可能包含假阳性。
########### 4. 绘制带有基因轨迹的Regional plot ###########
# https://github.com/Cloufield/gwaslab/blob/d2e4e4b88ae8b75a3416f622d1de657cef966ed1/src/gwaslab/viz_plot_mqqplot.py#L80C1-L84C26
# 指定Recombination的路径
# print(inspect.getsource(mysumstats.plot_mqq))
# 设置数据路径
region_withTRACE_fig, region_withTRACE_ax = mysumstats.plot_mqq(mode="r",region=(7,156538803,157538803),region_grid=True, rr_path="/data2/projects/bioinfo/tyliu/1_proteomics/MR/ckb_mr/coloc/plot/exam/recombination/hg19/genetic_map_GRCh37_chr7.txt",
rr_header_dict=None,
rr_chr_dict = None,
rr_lim=(0,100),
rr_ylabel=True)

region_withTRACE_fig.savefig("/data2/projects/bioinfo/tyliu/1_proteomics/MR/ckb_mr/coloc/plot/exam/region_withTRACE_plot.png")

基因轨迹区域图

########### 5. 绘制带有基因轨迹和LD information的Regional plot ###########
local_vcf_path = "/data2/projects/bioinfo/tyliu/1_proteomics/MR/ckb_mr/coloc/plot/exam/1k_eas_hg19/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz"

region_withTRACE_LD_fig, region_withTRACE_LD_ax = mysumstats.plot_mqq(mode="r",region=(7,156538803,157538803),region_grid=True, rr_path="/data2/projects/bioinfo/tyliu/1_proteomics/MR/ckb_mr/coloc/plot/exam/recombination/hg19/genetic_map_GRCh37_chr7.txt",
vcf_path = local_vcf_path)

region_withTRACE_LD_fig.savefig("/data2/projects/bioinfo/tyliu/1_proteomics/MR/ckb_mr/coloc/plot/exam/region_withTRACE_LD_plot.png")

进阶基因区域图

########### 6. Stacked Regional plot ###########
local_vcf_path_eas = "/data2/projects/bioinfo/tyliu/1_proteomics/MR/ckb_mr/coloc/plot/exam/1k_eas_hg19/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz"
local_vcf_path_eur = "/data2/projects/bioinfo/tyliu/1_proteomics/MR/ckb_mr/coloc/plot/exam/1k_eur_hg19/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz"

stacked_region_fig, stacked_region_ax = gl.plot_stacked_mqq(objects=[mysumstats,mysumstats],
vcfs=[local_vcf_path_eas, local_vcf_path_eur],
region=(7,156538803,157538803),
mode="r",
build="19", rr_path="/data2/projects/bioinfo/tyliu/1_proteomics/MR/ckb_mr/coloc/plot/exam/recombination/hg19/genetic_map_GRCh37_chr7.txt",
anno=True,
anno_style="right",anno_xshift=0.02,
anno_set=["7:156793450_G_GA","7:157038803_A_G"],
titles=["EAS","EUR"],
title_args={"size":20},
anno_args={"rotation":0})
stacked_region_fig.savefig("/data2/projects/bioinfo/tyliu/1_proteomics/MR/ckb_mr/coloc/plot/exam/stacked_plot.png")

stack

注:GWAS数据本身是EAS的数据,所以这里只是函数功能的展示,并不可以使用EUR的Reference进行LD信息的参考。

1.3 一些细节化参数

原网址:Manhattan and QQ plot - GWASLab

############ 常用参数 ############

region_ref="your_snpid" # 设置指定的Highlight的SNP位点,自此所有的regional plot围绕这个位点绘制
fontsize = 16 # 整体尺寸
title_fontsize = 18 # X Y轴坐标和字体的尺寸
fig_args={"figsize":(10,10) # 画布的尺寸

若遇到P极小的情况:

GWASLab包提供P值自动填补的功能:

# 例如:
mysumstats = gl.Sumstats("./500_cis_pqtl.csv",
sep = ",",
snpid = "id",
chrom = "CHROM",
pos = "GENPOS",
ea = "ALLELE1",
nea = "ALLELE0",
neaf = "A1FREQ",
beta = "BETA",
se = "SE",
p = "nom_p",
# direction = "Dir",
build = "38",
chrom_pat = "1"
)
mysumstats.basic_check(verbose = False)
mysumstats.fill_data(to_fill=["MLOG10P"], extreme=True) # 因为top位点的-log10p为500+, 所以通过-log10P绘制,而MLOG10P会自动根据数据中的beta se z值等自动填补
# 然后在绘图函数中设置scaled = TRUE即可
scaled=True