User:S&T Bioinformatics

Background:
Illumina chip uses the topbot strand design and thus the reference allele and the alternative allele defined in the chip is different from that in the sequencing data. Not only strand will be flipped but also the other types of inconcordance such as those that we observed in the 20131124_92PanALL chip data.

Table1. Type and frequency of inconcordance of the reference and alternative alleles between the chip data and the sequencing data (dbSNP).

math

percentage=\frac {{N}_{InconcordanceType}}{N_{total}}

math

math

N_{total}

math



math

{ N }_{InconcordanceType}

math



However, we always need to combine the chip data and the sequencing data together for the analysis like comparing the concordance of the chip data and the sequencing data, co-analyze the 1KGP data and the GenomeDK data, etc.

Using the flip method in plink can only fix the "StrandFalse" problem mentioned above and in this case, most of the chip data will be thrown out.

The above strand problem is discovered when trying to do admixture analysis using 20131124_92PanALL data and the 1KGP data -> All patterns are wrong.

Aim:
A pipeline that transfer the raw bed, bim, fam files of the chip to the sequencing data-compatible clean bed, bim, fam files.

Pip:
/project/DanishPanGenome/20131112_SampleEthnicity/20131210_TechnicalWork

or

/home/siyang/bin/commbin/20131210SampleEthnicity/CleanPip.sh

Example:
/project/DanishPanGenome/20131112_SampleEthnicity/20131210_TechnicalWork

or

/home/siyang/bin/commbin/20131210SampleEthnicity/example.sh

Preliminary requirements:
Install the following packages

1) plinkseq-0.08-x86_64

2) fastaFromBed from BedTools

2) GATK

All the above softwares have already been installed in the GenomeDK cluster.

Note: This shell will generate five step subshells and so far we have to run it step by step. Apologize for the hard-coding and we will try to find time to make it more user-friendly.

How does the CleanPip.sh work?
Step1: Plink to VCF

Convert the raw chip plink files(bed,bim,fam) into vcf files

Step2: Clean the data

i. arrange the chrID in chip data (chip has chr{1..25} which is not compatible with the reference which is {1..22}, X,Y,M...)

i. all the ambiguous alleles (A/T, C/G) which could not be strand-flipped will be skipped.

ii. remove the INDELs in the chip data, the multi-allelic SNPs and the SNPs located in the sex chromosome, non-polymorphic SNPs

iii. the SNP IDs in the chip will be updated based on the dbSNP137


 * 1) Note if we don't do this, plink will cry when merging the chip data with the sequencing data.

Step3: Re-arrange the ref/alt alleles in the chip based on the sequencing data and simultaneously, the genotype in the chip.

Step4: Clean again (same as step2)


 * 1) Note if we don't do this, plink will cry again.

Step5: VCF to plink files

Proof that the method is correct:
1. After applying the above pipeline to panAllGeno02Mind02HWE0001* plink bed bim fam files (by Jette), the concordance rate of sequencing and chip data of sample 1006-01 and 1006-02 are

0.998146086581797 and 0.998559064873503, respectively.

2. Admixture analysis using the post-treated data and the 1KGP data produce correct patterns.

Other stuff
It's weird that these two files are different

These are the raw bed, bim, fam uploaded to GenomeDK cluster by Simon.

/project/DanishPanGenome/PilotChip/panAll_update.* (bed, bim, fam)

Jette's clean data is uploaded by Siyang to the GenomeDK cluster in Nov 24.

/project/DanishPanGenome/20131112_SampleEthnicity/20131124_92PanALL/panAllGeno02Mind02HWE0001.*

The genotypes for any of the individual in these two files are different and the panAll_update has only 50% concordance with sequencing data (the inconcordant sites are all ./. in chip but 0/0, 0/1 or 1/1 in the sequencing data) but panAllGeno02Mind02HWE0001 has more than 99% concordance.

Table1. An overview of the chip data and sequencing data (pick 1006-01 and 1006-02 for example) RefAltSame: Concordance between the reference allele and alternative allele annotation between illumina chip and the sequencing data

RefAltGTSame: Concordance rate of the genotypes between sequencing data and chip data

So far, we can proceed using panAllGeno02Mind02HWE0001 but we will find time to look back to this problem later.