Two step process :
Samtools mpileup
is used to deetct variants in all samples with sufficient coverage.
The resulting bcf files is then post-treated :
vcffilter
is used to remove variants with more than 2 alleles and some uncovered variantsbcftools
and written in a tsv file for easier parsing in RCommands for each genome :
#!/bin/bash
source /local/env/envsamtools.sh
source /local/env/envbcftools-1.3.sh
source /local/env/envvcftools-0.1.12b.sh
# Pileup
samtools mpileup -t DP,DPR -g -f buchnera.fasta /omaha-beach/cguyomar/genomecov/buchnera/*.buchnera.bam > buchnera.bcf
# Filtering
bcftools call -mv -Ov buchnera.bcf > buchnera_called.vcf
vcftools --vcf buchnera_called.vcf --max-alleles 2 --min-meanDP 10 --minDP 2 --max-missing 1 --out buchnera_vcftools --recode-INFO-all --recode
# Extracting DPR and cleaning
bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT\t[%DPR ]\n' buchnera_vcftools.recode.vcf > buchnera.DPRtab
sed -i -e 's/\#//' buchnera.DPRtab
sed -i -e 's/\[.\{1,2\}\]\/omaha-beach\/cguyomar\/genomecov\/buchnera\///g' buchnera.DPRtab
sed -i -e 's/.buchnera.bam:DPR//g' buchnera.DPRtab
#!/bin/bash
source /local/env/envsamtools.sh
source /local/env/envbcftools-1.3.sh
source /local/env/envvcftools-0.1.12b.sh
# Pileup
samtools mpileup -t DP,DPR -v -f hamiltonella.fasta /omaha-beach/cguyomar/genomecov/hamiltonella/*.hamiltonella.bam > hamiltonella.bcf
# Filtering
bcftools call -mv -Ov hamiltonella.bcf > hamiltonella_called.vcf
vcftools --vcf hamiltonella_called.vcf --max-alleles 2 --min-meanDP 10 --minDP 2 --max-missing 1 --out hamiltonella_vcftools --recode-INFO-all --recode
#Adding outgroup
bgzip -f hamiltonella_bemisia.vcf
tabix -f hamiltonella_bemisia.vcf.gz
bgzip -f hamiltonella_vcftools.recode.vcf
tabix -f hamiltonella_vcftools.recode.vcf.gz
vcf-merge hamiltonella_vcftools.recode.vcf.gz hamiltonella_bemisia.vcf.gz > merged.vcf
# Extracting DPR and cleaning
bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT\t[%DPR ]\n' merged.vcf > hamiltonella.DPRtab
sed -i -e 's/\#//' hamiltonella.DPRtab
sed -i -e 's/\[.\{1,2\}\]\/omaha-beach\/cguyomar\/genomecov\/hamiltonella\///g' hamiltonella.DPRtab
sed -i -e 's/.hamiltonella.bam:DPR//g' hamiltonella.DPRtab
sed -i -e 's/ \./ 10,0/g' hamiltonella.DPRtab
sed -i -e 's/\. /10,0 /g' hamiltonella.DPRtab
sed -i -e 's/\[.\{1,2\}\]sample1:DPR/Hbemisia/g' hamiltonella.DPRtab
#!/bin/bash
source /local/env/envbcftools-1.3.sh
source /local/env/envvcftools-0.1.12b.sh
source /local/env/envsamtools.sh
# Pileup
samtools mpileup -t DP,DPR -g -f regiella.fasta /omaha-beach/cguyomar/genomecov/regiella/*.regiella.bam > regiella.bcf
# Filtering
bcftools call -mv -Ov regiella.bcf > regiella_called.vcf
vcftools --vcf regiella_called.vcf --max-alleles 2 --min-meanDP 10 --minDP 2 --max-missing 1 --out regiella_vcftools --recode-INFO-all --recode
# Merge with vcf from outgroup mummer alignment
bgzip regiella_vcftools.recode.vcf
tabix regiella_vcftools.recode.vcf.gz
bgzip regiella_hamiltonella.vcf
tabix regiella_hamiltonella.vcf.gz
vcf-merge regiella_vcftools.recode.vcf.gz regiella_hamiltonella.vcf.gz > merged.vcf
# Extracting DPR and cleaning
bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT\t[%DPR ]\n' merged.vcf > regiella.DPRtab
sed -i -e 's/\#//' regiella.DPRtab
sed -i -e 's/\[.\{1,2\}\]\/omaha-beach\/cguyomar\/genomecov\/regiella\///g' regiella.DPRtab
sed -i -e 's/.regiella.bam:DPR//g' regiella.DPRtab
sed -i -e 's/ \./ 10,0/g' regiella.DPRtab
sed -i -e 's/\. /10,0 /g' regiella.DPRtab
sed -i -e 's/\[.\{1,2\}\]sample1:DPR/Hdefensa/g' regiella.DPRtab
#!/bin/bash
source /local/env/envbcftools-1.3.sh
source /local/env/envvcftools-0.1.12b.sh
source /local/env/envsamtools.sh
# Pileup
samtools mpileup -t DP,DPR -g -f serratia.fasta /omaha-beach/cguyomar/genomecov/serratia/*.serratia.bam > serratia.bcf
# Filtering
bcftools call -mv -Ov serratia.bcf > serratia_called.vcf
vcftools --vcf serratia_called.vcf --max-alleles 2 --min-meanDP 10 --minDP 2 --max-missing 1 --out serratia_vcftools --recode-INFO-all --recode
# Merge with vcf from outgroup mummer alignment
bgzip serratia_vcftools.recode.vcf
tabix serratia_vcftools.recode.vcf.gz
bgzip serratia_Sct.vcf
tabix serratia_Sct.vcf.gz
vcf-merge serratia_vcftools.recode.vcf.gz serratia_Sct.vcf.gz > merged.vcf
# Extracting DPR and cleaning
bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT\t[%DPR ]\n' merged.vcf > serratia.DPRtab
sed -i -e 's/\#//' serratia.DPRtab
sed -i -e 's/\[.\{1,2\}\]\/omaha-beach\/cguyomar\/genomecov\/serratia\///g' serratia.DPRtab
sed -i -e 's/.serratia.bam:DPR//g' serratia.DPRtab
sed -i -e 's/ \./ 10,0/g' serratia.DPRtab
sed -i -e 's/\. /10,0 /g' serratia.DPRtab
sed -i -e 's/\[.\{1,2\}\]sample1:DPR/Sct/g' serratia.DPRtab
#!/bin/bash
source /local/env/envbcftools-1.3.sh
source /local/env/envvcftools-0.1.12b.sh
source /local/env/envsamtools.sh
# Pileup
samtools mpileup -t DP,DPR -g -f fukatsuia.fasta /omaha-beach/cguyomar/genomecov/fukatsuia/*.fukatsuia.bam > fukatsuia.bcf
# Filtering
bcftools call -mv -Ov fukatsuia.bcf > fukatsuia_called.vcf
vcftools --vcf fukatsuia_called.vcf --max-alleles 2 --min-meanDP 10 --minDP 2 --max-missing 1 --out fukatsuia_vcftools --recode-INFO-all --recode
# Merge with vcf from outgroup mummer alignment
bgzip -f fukatsuia_vcftools.recode.vcf
tabix -f fukatsuia_vcftools.recode.vcf.gz
bgzip -f fukatsuia_regiella.vcf
tabix -f fukatsuia_regiella.vcf.gz
vcf-merge fukatsuia_vcftools.recode.vcf.gz fukatsuia_regiella.vcf.gz > merged.vcf
# Extracting DPR and cleaning
bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT\t[%DPR ]\n' merged.vcf > fukatsuia.DPRtab
sed -i -e 's/\#//' fukatsuia.DPRtab
sed -i -e 's/\[.\{1,2\}\]\/omaha-beach\/cguyomar\/genomecov\/fukatsuia\///g' fukatsuia.DPRtab
sed -i -e 's/.fukatsuia.bam:DPR//g' fukatsuia.DPRtab
sed -i -e 's/ \./ 10,0/g' fukatsuia.DPRtab
sed -i -e 's/\. /10,0 /g' fukatsuia.DPRtab
sed -i -e 's/\[.\{1,2\}\]sample1:DPR/Regiella/g' fukatsuia.DPRtab
#!/bin/bash
source /local/env/envsamtools-1.6.sh
source /local/env/envbcftools-1.3.sh
source /local/env/envvcftools-0.1.12b.sh
samtools mpileup -t DP,DPR -v -f spiroplasma2.fasta /omaha-beach/cguyomar/genomecov/spiroplasma/*.spiroplasma.bam > spiroplasma.bcf
# Filtering
bcftools call -mv -Ov spiroplasma.bcf > spiroplasma_called.vcf
vcftools --vcf spiroplasma_called.vcf --max-alleles 2 --min-meanDP 10 --minDP 2 --max-missing 1 --out spiroplasma_vcftools --recode-INFO-all --recode
# Extracting DPR and cleaning
bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT\t[%DPR ]\n' spiroplasma_vcftools.recode.vcf > spiroplasma.DPRtab
sed -i -e 's/\#//' spiroplasma.DPRtab
sed -i -e 's/\[.\{1,2\}\]\/omaha-beach\/cguyomar\/genomecov\/spiroplasma\///g' spiroplasma.DPRtab
sed -i -e 's/.spiroplasma.bam:DPR//g' spiroplasma.DPRtab
#!/bin/bash
source /local/env/envsamtools.sh
source /local/env/envbcftools-1.3.sh
source /local/env/envvcftools-0.1.12b.sh
samtools mpileup -t DP,DPR -g -f rickettsia.fasta /omaha-beach/cguyomar/genomecov/rickettsia/*.rickettsia.bam > rickettsia.bcf
# Filtering
bcftools call -mv -Ov rickettsia.bcf > rickettsia_called.vcf
vcftools --vcf rickettsia_called.vcf --max-alleles 2 --min-meanDP 10 --minDP 2 --max-missing 1 --out rickettsia_vcftools --recode-INFO-all --recode
# Merge with vcf from outgroup mummer alignment
bgzip -f rickettsia_vcftools.recode.vcf
tabix -f rickettsia_vcftools.recode.vcf.gz
bgzip -f rickettsia_bemisia.vcf
tabix -f rickettsia_bemisia.vcf.gz
vcf-merge rickettsia_vcftools.recode.vcf.gz rickettsia_bemisia.vcf.gz > merged.vcf
# Extracting DPR and cleaning
bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT\t[%DPR ]\n' merged.vcf > rickettsia.DPRtab
sed -i -e 's/\#//' rickettsia.DPRtab
sed -i -e 's/\[.\{1,2\}\]\/omaha-beach\/cguyomar\/genomecov\/rickettsia\///g' rickettsia.DPRtab
sed -i -e 's/.rickettsia.bam:DPR//g' rickettsia.DPRtab
sed -i -e 's/ \./ 10,0/g' rickettsia.DPRtab
sed -i -e 's/\. /10,0 /g' rickettsia.DPRtab
sed -i -e 's/\[.\{1,2\}\]sample1:DPR/Rbemisia/g' rickettsia.DPRtab
#!/bin/bash
source /local/env/envsamtools.sh
source /local/env/envbcftools-1.3.sh
source /local/env/envvcftools-0.1.12b.sh
# Pileup
samtools mpileup -t DP,DPR -g -f rickettsiella.fasta /omaha-beach/cguyomar/genomecov/rickettsiella/*.rickettsiella.bam > rickettsiella.bcf
# Filtering
bcftools call -mv -Ov rickettsiella.bcf > rickettsiella_called.vcf
vcftools --vcf rickettsiella_called.vcf --max-alleles 2 --min-meanDP 10 --minDP 2 --max-missing 1 --out rickettsiella_vcftools --recode-INFO-all --recode
# Extracting DPR and cleaning
bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT\t[%DPR ]\n' rickettsiella_vcftools.recode.vcf > rickettsiella.DPRtab
sed -i -e 's/\#//' rickettsiella.DPRtab
sed -i -e 's/\[.\{1,2\}\]\/omaha-beach\/cguyomar\/genomecov\/rickettsiella\///g' rickettsiella.DPRtab
sed -i -e 's/.rickettsiella.bam:DPR//g' rickettsiella.DPRtab
#!/bin/bash
source /local/env/envsamtools.sh
source /local/env/envbcftools-1.3.sh
source /local/env/envvcftools-0.1.12b.sh
# Pileup
samtools mpileup -t DP,DPR -v -f mitochondrion.fasta /omaha-beach/cguyomar/genomecov/mitochondrie/*.mito.bam > mitochondrie.bcf
# Filtering
bcftools call -mv -Ov mitochondrie.bcf > mitochondrie_called.vcf
vcftools --vcf mitochondrie_called.vcf --max-alleles 2 --min-meanDP 10 --minDP 2 --max-missing 1 --out mitochondrie_vcftools --recode-INFO-all --recode
# Extracting DPR and cleaning
bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT\t[%DPR ]\n' mitochondrie_vcftools.recode.vcf > mitochondrie.DPRtab
sed -i -e 's/\#//' mitochondrie.DPRtab
sed -i -e 's/\[.\{1,2\}\]\/omaha-beach\/cguyomar\/genomecov\/mitochondrie\///g' mitochondrie.DPRtab
sed -i -e 's/.mitochondrie.bam:DPR//g' mitochondrie.DPRtab
In order to insert an outgroup reference genome as a sample in the vcf file, the outgroup genome is aligned with the reference genome with mummer
.
The resulting mummer alignment is converted to a vcf file, which is merged to the main vcf.
Commands for outgroup vcf creation :
#!/bin/bash
# $1 :ref genome
# $2 : aligned genome
# $3 :prefix
nucmer -p $3 $1 $2
show-snps -Clr -x 1 -T $3.delta > $3.snps
python ~/bin/MUMmerSNPs2VCF_DPR.py $3.snps $3.vcf
MUMmerSNPs2VCF_DPR.py
is a version of MUMmerSNPs2VCF_DPR.py
(https://github.com/liangjiaoxue/PythonNGSTools) modified to add arbitrary read coverage information (DPR tag) to fit the pipeline.
Allele coverage file are parsed using the script cleaning.R
(script)
The following filters are applied for each sample :
Performed using nj_phylogeny.R
(script), can be run with command line : Rscript nj_phylogeny.R symbiont
Main steps :
This second approach is used when an annotated reference genome is available.
esearch
and efetch
utilities to the ipg format (identical protein groups), which contains the gene coordinates on the genome.This step is performed by the gene_phylogeny.R
script and associated functions
Main steps :
get_alt_seq
functionEach fasta file is aligned using MAFFT (version 7.310)
for file in *.fasta ; do linsi $file > $file.aligned ; done
Aligned fasta files are concatenated using ElConcatenero (https://github.com/ODiogoSilva/ElConcatenero)
ElConcatenero.py -if fasta -of fasta -in *.aligned -o concatenation
RaxML is then used to infer the phylogenetic tree
raxml -f a -# 1000 -x 123 -p 123 -m GTRGAMMA -s concatenation.fas.reduced -n symbiont_raxml