Help:Formatting

bold

bold

Pretreatment of chip data
Background:

Illumina chip uses the topbot strand design and thus the reference allele and the alternative allele defined in the chip is different from that in the sequencing data. Not only strand will be flipped but also the other types of inconcordance such as those that we observed in the 20131124_92PanALL chip data.

Table1. Type and frequency of inconcordance of the reference and alternative alleles between the chip data and the sequencing data (dbSNP).

math

percentage\quad =\quad \frac{ { N }_{inconcordance_{type }}}{N_{total}}

math

N_{total}: number of variants in the chip data

{ N }_{ inconcordance_{ type }: number of variants that display inconcordance between chip and dbSNP137 with regards to the definition of reference and alternative allele

math

math

\tilde{f}(\omega)=\int_{-\infty}^{\infty} f(t) e^{-i\omega t}\,dt

math

$$percentage\quad =\quad \frac { { N }_{ inconcordance_{ type } } }{ N_{total} }$$

$$N_{total}: number of variants in the chip data; { N }_{ inconcordance_{ type }: number of variants that display inconcordance between chip and dbSNP137 with regards to the definition of reference and alternative allele$$

However, we always need to combine the chip data and the sequencing data together for the analysis like comparing the concordance of the chip data and the sequencing data, co-analyze the 1KGP data and the GenomeDK data, etc.

Using the flip method in plink can only fix the "StrandFalse" problem mentioned above and in this case, most of the chip data will be thrown out.

The above strand problem is discovered when trying to do admixture analysis using 20131124_92PanALL data and the 1KGP data -> All patterns are wrong.

Aim:

A pipeline that transfer the raw bed, bim, fam files of the chip to the sequencing data-compatible clean bed, bim, fam files.

Method:

Pip:

/project/DanishPanGenome/20131112_SampleEthnicity/20131210_TechnicalWork

or

/home/siyang/bin/commbin/20131210SampleEthnicity/CleanPip.sh

Example:/project/DanishPanGenome/20131112_SampleEthnicity/20131210_TechnicalWork

or

/home/siyang/bin/commbin/20131210SampleEthnicity/example.sh

Preliminary requirements:

Install the following packages

1) plinkseq-0.08-x86_64

2) fastaFromBed from BedTools

2) GATK

All the above softwares have already been installed in the GenomeDK cluster.

Note: This shell will generate five step subshells and so far we have to run it step by step. Apologize for the hard-coding and we will try to find time to make it more user-friendly.

Step1: Plink to VCF

Convert the raw chip plink files(bed,bim,fam) into vcf files

Step2: Clean the data

i. arrange the chrID in chip data (chip has chr{1..25} which is not compatible with the reference which is {1..22}, X,Y,M...)

i. all the ambiguous alleles (A/T, C/G) which could not be strand-flipped will be skipped.

ii. remove the INDELs in the chip data, the multi-allelic SNPs and the SNPs located in the sex chromosome, non-polymorphic SNPs

iii. the SNP IDs in the chip will be updated based on the dbSNP137


 * 1) Note if we don't do this, plink will cry when merging the chip data with the sequencing data.

Step3: Re-arrange the ref/alt alleles in the chip based on the sequencing data and simultaneously, the genotype in the chip.

Step4: Clean again (same as step2)


 * 1) Note if we don't do this, plink will cry again.

Step5: VCF to plink files

Proof that the method is correct:

1. After applying the above pipeline to panAllGeno02Mind02HWE0001* plink bed bim fam files (by Jette), the concordance rate of sequencing and chip data of sample 1006-01 and 1006-02 are

0.998146086581797 and 0.998559064873503, respectively.

2. Admixture analysis using the post-treated data and the 1KGP data produce correct patterns.

Other stuff

It's weird that these two files are different

These are the raw bed, bim, fam uploaded to GenomeDK cluster by Simon.

/project/DanishPanGenome/PilotChip/panAll_update.* (bed, bim, fam)

Jette's clean data is uploaded by Siyang to the GenomeDK cluster in Nov 24.

/project/DanishPanGenome/20131112_SampleEthnicity/20131124_92PanALL/panAllGeno02Mind02HWE0001.*

The genotypes for any of the individual in these two files are different and the panAll_update has only 50% concordance with sequencing data (the inconcordant sites are all ./. in chip but 0/0, 0/1 or 1/1 in the sequencing data) but panAllGeno02Mind02HWE0001 has more than 99% concordance.

So far, we can proceed using panAllGeno02Mind02HWE0001 but we will find time to look back to this problem later.

Aim:

A pipeline that transfer the raw bed, bim, fam files of the chip to the sequencing data-compatible clean bed, bim, fam files.

Method:

Pip:

/project/DanishPanGenome/20131112_SampleEthnicity/20131210_TechnicalWork

or

/home/siyang/bin/commbin/20131210SampleEthnicity/CleanPip.sh

Example:/project/DanishPanGenome/20131112_SampleEthnicity/20131210_TechnicalWork

or

/home/siyang/bin/commbin/20131210SampleEthnicity/example.sh

Preliminary requirements:

Install the following packages

1) plinkseq-0.08-x86_64

2) fastaFromBed from BedTools

2) GATK

All the above softwares have already been installed in the GenomeDK cluster.

Note: This shell will generate five step subshells and so far we have to run it step by step. Apologize for the hard-coding and we will try to find time to make it more user-friendly.

Step1: Plink to VCF

Convert the raw chip plink files(bed,bim,fam) into vcf files

Step2: Clean the data

i. arrange the chrID in chip data (chip has chr{1..25} which is not compatible with the reference which is {1..22}, X,Y,M...)

i. all the ambiguous alleles (A/T, C/G) which could not be strand-flipped will be skipped.

ii. remove the INDELs in the chip data, the multi-allelic SNPs and the SNPs located in the sex chromosome, non-polymorphic SNPs

iii. the SNP IDs in the chip will be updated based on the dbSNP137


 * 1) Note if we don't do this, plink will cry when merging the chip data with the sequencing data.

Step3: Re-arrange the ref/alt alleles in the chip based on the sequencing data and simultaneously, the genotype in the chip.

Step4: Clean again (same as step2)


 * 1) Note if we don't do this, plink will cry again.

Step5: VCF to plink files

Proof that the method is correct:

1. After applying the above pipeline to panAllGeno02Mind02HWE0001* plink bed bim fam files (by Jette), the concordance rate of sequencing and chip data of sample 1006-01 and 1006-02 are

0.998146086581797 and 0.998559064873503, respectively.

2. Admixture analysis using the post-treated data and the 1KGP data produce correct patterns.

Other stuff

It's weird that these two files are different

These are the raw bed, bim, fam uploaded to GenomeDK cluster by Simon.

/project/DanishPanGenome/PilotChip/panAll_update.* (bed, bim, fam)

Jette's clean data is uploaded by Siyang to the GenomeDK cluster in Nov 24.

/project/DanishPanGenome/20131112_SampleEthnicity/20131124_92PanALL/panAllGeno02Mind02HWE0001.*

The genotypes for any of the individual in these two files are different and the panAll_update has only 50% concordance with sequencing data (the inconcordant sites are all ./. in chip but 0/0, 0/1 or 1/1 in the sequencing data) but panAllGeno02Mind02HWE0001 has more than 99% concordance.

You can format your text by using wiki markup. This consists of normal characters like asterisks, apostrophes or equal signs which have a special function in the wiki, sometimes depending on their position. For example, to format a word in italic, you include it in two pairs of apostrophes like.

Text formatting markup
{| class="wikitable" ! Description ! width=40% | You type ! width=40% | You get ! colspan="3" style="background: #ABE" | Character (inline) formatting – applies anywhere italic italic bold bold bold & italic bold & italic strike text &lt;nowiki&gt;no markup&lt;/nowiki&gt; no markup API&lt;nowiki/>extension API extension ! colspan="3" style="background: #ABE" | Section formatting – only at the beginning of the line
 * Italic text
 * Italic text
 * Bold text
 * Bold text
 * Bold and italic
 * Bold and italic
 * Strike text
 * Strike text
 * strike text
 * Escape wiki markup
 * Escape wiki markup
 * Escape wiki markup once
 * Escape wiki markup once
 * Headings of different levels
 * Headings of different levels

Level 6


Level 2

Level 3

Level 4

Level 5

Level 6

Text before
 * Horizontal rule
 * Horizontal rule

Text after Text before

Text after don't break levels. Any other start ends the list. don't break levels. Any other start ends the list. don't break levels.
 * Bullet list
 * Bullet list
 * Start each line
 * with an asterisk (*).
 * More asterisks give deeper
 * and deeper levels.
 * Line breaks
 * But jumping levels creates empty space.
 * Start each line
 * with an asterisk (*).
 * More asterisks give deeper
 * and deeper levels.
 * Line breaks
 * Line breaks
 * But jumping levels creates empty space.
 * Numbered list
 * Numbered list
 * 1) Start each line
 * 2) with a number sign (#).
 * 3) More number signs give deeper
 * 4) and deeper
 * 5) levels.
 * 6) Line breaks
 * 1) But jumping levels creates empty space.
 * 2) Blank lines

Any other start also ends the list. don't break levels.
 * 1) end the list and start another.
 * 1) Start each line
 * 2) with a number sign (#).
 * 3) More number signs give deeper
 * 4) and deeper
 * 5) levels.
 * 6) Line breaks
 * 1) Line breaks
 * 1) But jumping levels creates empty space.
 * 2) Blank lines

Any other start also ends the list.
 * 1) end the list and start another.
 * Definition list
 * item 1
 * definition 1
 * item 1
 * definition 1


 * item 2
 * definition 2-1
 * definition 2-2


 * item 1
 * definition 1
 * definition 1


 * item 2
 * definition 2-1
 * definition 2-2


 * Indent text
 * Single indent
 * Double indent
 * Multiple indent
 * Double indent
 * Multiple indent


 * Single indent
 * Double indent
 * Multiple indent
 * Multiple indent

of &lt;nowiki> &lt;/nowiki>
 * Mixture of different types of list
 * Mixture of different types of list
 * 1) one
 * 2) two
 * 3) * two point one
 * 4) * two point two
 * 5) three
 * 6) ; three item one
 * three def one
 * 1) four
 * four def one
 * this looks like a continuation
 * and is often used
 * instead
 * 1) five
 * 2) five sub 1
 * 3) five sub 1 sub 1
 * 4) five sub 2

of
 * 1) one
 * 2) two
 * 3) * two point one
 * 4) * two point two
 * 5) three
 * 6) ; three item one
 * three def one
 * 1) four
 * four def one
 * this looks like a continuation
 * and is often used
 * instead
 * instead

Start each line with a space. Text is preformatted and markups can be done.
 * 1) five
 * 2) five sub 1
 * 3) five sub 1 sub 1
 * 4) five sub 2
 * Preformatted text
 * Preformatted text

Start each line with a space. Text is preformatted and markups can be done. (before the ).
 * Preformatted text blocks
 * Start with a space in the first column,
 * Start with a space in the first column,

Then your block format will be   maintained. This is good for copying in code blocks:

def function: """documentation string"""

if True: print True else: print False Start with a space in the first column, (before the ).

Then your block format will be   maintained.

This is good for copying in code blocks:

def function: """documentation string"""

if True: print True else: print False
 * }

Paragraphs
MediaWiki ignores single line breaks. To start a new paragraph, leave an empty line. You can force a line break within a paragraph with the HTML tag.

HTML tags
Some HTML tags are allowed in MediaWiki, for example,  ,   and. These apply anywhere you insert them.

continued:

Inserting symbols
Symbols and other special characters not available on your keyboard can be inserted through a special sequence of characters. Those sequences are called HTML entities. For example, the following sequence (entity) &amp;rarr; when inserted will be shown as right arrow HTML symbol &rarr; and &amp;mdash; when inserted will be shown as an em dash HTML symbol &mdash;.

See the list of all HTML entities on the Wikipedia article List of HTML entities. Additionally, MediaWiki supports two non-standard entity reference sequences:  and   which are both considered equivalent to   which is a right-to-left mark. (Used when combining right to left languages with left to right languages in the same page.)

HTML tags and symbol entities displayed themselves (with and without interpreting them)

 * &amp;amp;euro; &rarr; &amp;euro;


 * &lt;span style="color: red; text-decoration: line-through;">Typo to be corrected&lt;/span> &rarr;  Typo to be corrected 


 * &amp;lt;span style="color: red; text-decoration: line-through;">Typo to be corrected&amp;lt;/span> &rarr; &lt;span style="color: red; text-decoration: line-through;">Typo to be corrected&lt;/span>

Nowiki for HTML
< nowiki /> can prohibit (HTML) tags: But not &amp; symbol escapes: To print &amp; symbol escapes as text, use "&amp;amp;" to replace the "&" character (eg. type "&amp;amp;nbsp;", which results in "&amp;nbsp;").
 * < pre> &rarr;
 * &< nowiki />amp; &rarr; &amp;

Other formatting
Beyond the text formatting markup shown hereinbefore, here are some other formatting references:


 * Links
 * Lists
 * Images
 * References - see Extension:Cite/Cite.php
 * Tables

You can find more references at Help:Contents.