Step1: Genotype quality control (QC)
Tools: PLINK1 (http://zzz.bwh.harvard.edu/plink/)
1. Pre-imputation QC:
The variant-level QC:
Variants call rate < 95%
Minor allele frequency (MAF) < 0.001
Hardy-Weinberg equilibrium (HWE) P < 1 × 10-6
plink --bfile ${genotype_data} --geno 0.05 --maf 0.001 --hwe 1e-6 --make-bed --out ${output}
The sample-level QC:
Sex concordance check
plink --bfile ${genotype_data} --check-sex --out ${sexcheck}
Identity check (IBD > 0.1875)
plink --bfile ${genotype_data} --indep-pairwise 50 5 0.2 --out ${relatedness}
plink --bfile ${genotype_data} --extract ${relatedness.prune.in} --min 0.2 --genome --genome-full --out ${relatedness}
Excess heterozygosity (> mean 5SD)
plink --bfile ${genotype_data} --het --out ${homozygosity}
Calculate the observed heterozygosity rate per individual using the formula (N(NM) O(Hom))/N(NM).
Missing genotypes > 3%
plink --bfile ${genotype_data} --missing --out ${missingness}
Principal components analysis (PCA)
We removed the genomic regions with long-range LD (e.g. MHC region)2,3, which was listed on the website (https://genome.sph.umich.edu/wiki/Regions_of_high_linkage_disequilibrium_(LD)).
plink --bfile ${genotype_data} --exclude high-LD-regions.txt --range --make-bed --out ${genotype_data_rm_high-LD}
plink --bfile ${genotype_data_rm_high-LD} --indep-pairwise 1000 80 0.1 --out ${genotype_data_rm_high-LD}
plink --bfile ${genotype_data_rm_high-LD} --extract genotype_data_rm_high-LD.prune.in --make-bed --out ${genotype_data_prune}
plink --bfile ${genotype_data_prune} --pca --out ${genotype_data_prune_pca}
2. Imputation
Tools:
SHAPEIT24 (https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html)
IMPUTE25 (http://mathgen.stats.ox.ac.uk/impute/impute_v2.html)
Reference panel: 1000 Genomes (1KG)6 and SG10K7 projects
Shapeit
Shapeit --input-bed chr${chromosome}.bed chr${chromosome}.bim chr${chromosome}.fam --input-ref ${hapFile} ${legendFile} ${sampleFile} --exclude-snp ${excludeFile} --input-map ${mapFile} -O chr${chromosome}.phased --thread 4 --force
Imputation
impute2 -use_prephased_g -known_haps_g chr${chromosome}.phased.haps -m ${mapFile} -h ${hapFile} -l ${legendFile} -int $chunkStart $chunkEnd -Ne 20000 -o chr${chromosome}-${chunkStart}-${chunkEnd}.imputed
3. After-imputation QC
Sorting SNPs with following criterias from the snp-stats file generated by IMPUTE2:
MAF 0.01
Information score (INFO) ≥ 0.9
Step2: MRI data processing
1. Hippocampal and subfield volumes segmentation
Tools: FreeSurfer v7.08 (https://surfer.nmr.mgh.harvard.edu/)
recon-all -all -s ${SUBJECT}
segmentHA_T1.sh ${SUBJECT} ${SUBJECTS_DIR}
2. Median absolute deviation was calculated with Python (robust.mad)
3. Harmonization
Tools: ComBat harmonization9
function HAdata=harmonization_ROI(data,batch,covariates,method)
fprintf('--------HAdata=harmonize multi-batches(centers) ROI data based on Combat---------\n');
fprintf('Format: HAdata=harmonization_ROI(data,batch,covariates,method)\n');
fprintf('--data:\tdata need harmonization (mandatory),canbe mat variable, or csv,xls or text file\n');
fprintf('--batch:\tbatch (center) effects that need be removed(mandatory),vector or text file[Ndatas]\n');
fprintf('--covariates:\tcovariates of Biological information that need preserved (optional),matrix or text file[Ncovariates,Nsample]\n');
fprintf('--method;\tharmonization method (optional),''parametric''(default) or ''non-parametric''\n');
fprintf('-------Written by QinWen 20200707--------\n');
fprintf('This script depend on ''ComBatHarmonization''(by Jfortin1) at github:\nhttps://github.com/Jfortin1/ComBatHarmonization/tree/master/Matlab\n');
if nargin<1
data=spm_select(1,'any','Select a text(csv,xls,mat) data file needs harmonization:');
end
if nargin<2
batch=spm_select(1,'any','Select a text(mat) file defining the batch (center) effect:');
end
if nargin<3
covariates=spm_select(1,'any','Select a text(mat) file defining the Boilogical covariates:');
end
if nargin<4
method='parametric';
end
%check data
if ischar(data)
if exist(data,'file')
data_is_file=1;
[outdir,filename,ext]=fileparts(data);
switch ext
case '.txt'
data=load(data);
case '.csv'
data=csvread(data);
case '.xls'
data=xlsread(data);
case '.xlsx'
data=xlsread(data);
otherwise
data=load(data);
end
else
error('no exist data file %n',data);
end
end
%check batch
if exist(batch,'file')
batch=load(batch);
if isstruct(batch)
batch=struct2array(batch);
end
elseif exist(batch,'1')
batch=batch;
else
error('Bad format of batch \n');
end
%check covariates
if exist(covariates,'file')
covariates=load(covariates);
if isstruct(covariates)
covariates=struct2array(covariates);
end
end
if isempty(covariates)
covariates=[];
end
% maincode
HAdata=harmonization_run(data,batch,covariates,'',method);
if ~data_is_file
outpath=[pwd,filesep,'harmonized_data.mat'];
else
outpath=[outdir,filesep,'Ha_',filename,'.mat'];
end
save(outpath,'HAdata','batch','covariates');
4. Gaussian transformation
function outdata=gaussian_resmaple_Qin(data,outpath)
if nargin<2
outdir=pwd;
outpath=[outdir,filesep,'gauss_data.mat'];
end
if nargin<1
fprintf('=========resample data into Gaussain distribution=========\nFormat: outdata=gaussian_resmaple(data,outpath)');
fprintf('\nInput\n --data: input data [nSample nFeature]\n');
fprintf(' --outpath: output filepath(*.mat)\nOutput\n --outdata: output data\n');
fprintf('-------Written by Qin Wen at 20190828------\n');
return;
end
msize=size(data);
GS=randn(msize(1),1);
zGS=sort(zscore(GS));
resamp_data=[];
for f =1:msize(2)
fdata=data(:,f);
[ofdata,ind]=sort(fdata);
GS_fdata(ind)=zGS;
resamp_data(:,f)=GS_fdata;
end
outdata.resamp_data=resamp_data;
outdata.gauss_ref=zGS;
save(outpath,'-struct','outdata')
end
Step3: GWAS of hippocampal and subfield volumes
1. For autosomes
Tools: BGENIE v1.310,11 (https://jmarchini.org/bgenie/)
bgenie_v1.3_static2 --bgen ${bgen_file} --pheno ${pheno_file} --covar ${covar_file} --pvals --out ${output_file}
2. For X chromosome
Tools: PLINK(v.2.00)1 (http://zzz.bwh.harvard.edu/plink/)
plink2 --bfile ${genotype_data} --split-par 2699520 154931044 --make-bed --out ${output_file}
plink2 --bfile ${genotype_data} --pheno ${pheno_file} --glm hide-covar sex --covar ${cov_file} --out ${output_file}
Step4: Trans-ancestry meta-analysis
Tools: METASOFT v.212 (http://genetics.cs.ucla.edu/meta/)
python plink2metasoft.py ${output_file} ${GWAS summary statistics 1} ${GWAS summary statistics 2}
java -jar Metasoft.jar -input ${input_file} -mvalue -output ${output_file}
Step5: Plink clumping
Tools: PLINK1 (http://zzz.bwh.harvard.edu/plink/)
plink \
--bfile ${genotype_data}\
--clump-p1 1.13e-9 \
--clump-p2 1.13e-9 \
--clump-r2 0.1 \
--clump-kb 3000 \
--clump ${GWAS summary statistics} \
--clump-snp-field RSID \
--clump-field P \
--out ${output_file}
Step6: Genomic control inflation factor (Lambda GC) and linkage disequilibrium score regression (LDSC) intercept
Tools: LDSC13 (https://github.com/bulik/ldsc)
Covariate-adjusted LD score regression14 (https://github.com/immunogenomics/cov-ldsc)
python munge_sumstats.py --sumstats ${GWAS summary statistics} --out ${cleaned GWAS summary statistics}
python ldsc.py --h2 ${cleaned GWAS summary statistics} --ref-ld-chr ${LDscore} --w-ld-chr ${LDscore} --out ${output_file}
Step7: Statistical fine mapping
Tools: PAINTOR15 (https://github.com/gkichaev/PAINTOR_V3.0/wiki)
Calculate LD matrix:
plink --bfile ${genotype_data} --extract ${locus_SNP} --a1-allele ${effect_allele} --r square --out ${output_file}
Finemapping with mcmc model:
PAINTOR -input {finemapping_list} -in ${input_dir} -Zhead Z -LDname ld -out ${output_dir} -mcmc -annotations Coding
Finemapping with one causal assumption:
PAINTOR -input ${finemapping_list} -in ${input_dir} -Zhead Z -LDname ld -out ${output_dir} -enumerate 1 -annotations Coding
Step8: Polygenic score (PGS)
Tools: PRSice v.2.3.5 software16 (https://www.prsice.info/)
Rscript PRSice.R \
--dir ${PRSice_dir} \
--prsice PRSice_linux \
--base ${base_data} \
--ld ${genotype_data} \
--type bgen \
--target ${target_data} \
--thread 1 \
--stat BETA \
--binary-target F \
--pheno ${phenotype} \
--pheno-col pheno \
--cov ${cov_file}
--out ${output_file}
Step9: Colocalization analysis
Tools: Coloc 5.1.017 (https://chr1swallace.github.io/coloc/index.html)
d1=list(beta=in_file$BETA1,varbeta=in_file$SE1*in_file$SE1,snp=in_file$RSID,position=in_file$POS, type="quant", N=sample_size1, MAF=in_file$MAF1)
d2=list(beta=in_file$BETA2,varbeta=in_file$SE2*in_file$SE2,snp=in_file$RSID,position=in_file$POS, type="quant", N=sample_size2, MAF=in_file$MAF2)
my.res <- coloc.abf(dataset1=d1, dataset2=d2)