Phylogenetic analyses of symbiont genomes

Variant detection
Gene-set phylogeny using maximumum likelyhood based inference

Search of membrane specific proteins
Computing gene sequences for each sample
Phylogenetic analyses

Phylogenetic analyses of symbiont genomes

Two step process :

Variant calling on reference genomes (run on cluster)
Phylogenetic analysis, either by whole genome neighbor joining if no annotation is available, or by gene set selection and Maximum Likelyhood phylogenetic inference (run locally)

Variant detection

Running samtools mpileup

Samtools mpileup is used to deetct variants in all samples with sufficient coverage.

The resulting bcf files is then post-treated :

vcffilter is used to remove variants with more than 2 alleles and some uncovered variants
If necessary, an outgroup is generated by a mummer alignment converted to vcf, and merged to the main vcf file (see below).
DPR field (read coverage) is extracted using bcftools and written in a tsv file for easier parsing in R
Some cleaning is made, mainly in the file header

Commands for each genome :

#!/bin/bash

source /local/env/envsamtools.sh
source /local/env/envbcftools-1.3.sh
source /local/env/envvcftools-0.1.12b.sh

# Pileup
samtools mpileup -t DP,DPR -g -f buchnera.fasta /omaha-beach/cguyomar/genomecov/buchnera/*.buchnera.bam > buchnera.bcf

# Filtering
bcftools call -mv -Ov  buchnera.bcf > buchnera_called.vcf 
vcftools --vcf buchnera_called.vcf --max-alleles 2 --min-meanDP 10 --minDP 2 --max-missing 1 --out buchnera_vcftools --recode-INFO-all --recode 

# Extracting DPR and cleaning
bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT\t[%DPR ]\n' buchnera_vcftools.recode.vcf > buchnera.DPRtab

sed -i -e 's/\#//' buchnera.DPRtab
sed   -i -e  's/\[.\{1,2\}\]\/omaha-beach\/cguyomar\/genomecov\/buchnera\///g' buchnera.DPRtab
sed  -i -e 's/.buchnera.bam:DPR//g' buchnera.DPRtab

#!/bin/bash

source /local/env/envsamtools.sh
source /local/env/envbcftools-1.3.sh
source /local/env/envvcftools-0.1.12b.sh

# Pileup
samtools mpileup -t DP,DPR -v -f hamiltonella.fasta /omaha-beach/cguyomar/genomecov/hamiltonella/*.hamiltonella.bam > hamiltonella.bcf

# Filtering
bcftools call -mv -Ov  hamiltonella.bcf > hamiltonella_called.vcf 
vcftools --vcf hamiltonella_called.vcf --max-alleles 2 --min-meanDP 10 --minDP 2 --max-missing 1 --out hamiltonella_vcftools --recode-INFO-all --recode 

#Adding outgroup
bgzip -f hamiltonella_bemisia.vcf
tabix -f hamiltonella_bemisia.vcf.gz
bgzip -f hamiltonella_vcftools.recode.vcf 
tabix -f hamiltonella_vcftools.recode.vcf.gz 
vcf-merge  hamiltonella_vcftools.recode.vcf.gz hamiltonella_bemisia.vcf.gz > merged.vcf

# Extracting DPR and cleaning
bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT\t[%DPR ]\n' merged.vcf > hamiltonella.DPRtab

sed -i -e 's/\#//' hamiltonella.DPRtab
sed   -i -e  's/\[.\{1,2\}\]\/omaha-beach\/cguyomar\/genomecov\/hamiltonella\///g' hamiltonella.DPRtab
sed  -i -e 's/.hamiltonella.bam:DPR//g' hamiltonella.DPRtab
sed -i -e 's/ \./ 10,0/g' hamiltonella.DPRtab
sed -i -e 's/\. /10,0 /g' hamiltonella.DPRtab
sed -i -e  's/\[.\{1,2\}\]sample1:DPR/Hbemisia/g' hamiltonella.DPRtab

#!/bin/bash
    
source /local/env/envbcftools-1.3.sh
source /local/env/envvcftools-0.1.12b.sh
source /local/env/envsamtools.sh

# Pileup
samtools mpileup -t DP,DPR -g -f regiella.fasta /omaha-beach/cguyomar/genomecov/regiella/*.regiella.bam > regiella.bcf

# Filtering
bcftools call -mv -Ov  regiella.bcf > regiella_called.vcf 
vcftools --vcf regiella_called.vcf --max-alleles 2 --min-meanDP 10 --minDP 2 --max-missing 1 --out regiella_vcftools --recode-INFO-all --recode 

# Merge with vcf from outgroup mummer alignment
bgzip regiella_vcftools.recode.vcf
tabix regiella_vcftools.recode.vcf.gz
bgzip regiella_hamiltonella.vcf
tabix regiella_hamiltonella.vcf.gz
vcf-merge regiella_vcftools.recode.vcf.gz regiella_hamiltonella.vcf.gz > merged.vcf

# Extracting DPR and cleaning

bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT\t[%DPR ]\n' merged.vcf > regiella.DPRtab

sed -i -e 's/\#//' regiella.DPRtab
sed   -i -e  's/\[.\{1,2\}\]\/omaha-beach\/cguyomar\/genomecov\/regiella\///g' regiella.DPRtab
sed  -i -e 's/.regiella.bam:DPR//g' regiella.DPRtab
sed -i -e 's/ \./ 10,0/g' regiella.DPRtab
sed -i -e 's/\. /10,0 /g' regiella.DPRtab
sed -i -e  's/\[.\{1,2\}\]sample1:DPR/Hdefensa/g' regiella.DPRtab

#!/bin/bash
    
source /local/env/envbcftools-1.3.sh
source /local/env/envvcftools-0.1.12b.sh
source /local/env/envsamtools.sh

# Pileup
samtools mpileup -t DP,DPR -g -f serratia.fasta /omaha-beach/cguyomar/genomecov/serratia/*.serratia.bam > serratia.bcf

# Filtering
bcftools call -mv -Ov  serratia.bcf > serratia_called.vcf 
vcftools --vcf serratia_called.vcf --max-alleles 2 --min-meanDP 10 --minDP 2 --max-missing 1 --out serratia_vcftools --recode-INFO-all --recode

# Merge with vcf from outgroup mummer alignment
bgzip serratia_vcftools.recode.vcf
tabix serratia_vcftools.recode.vcf.gz
bgzip serratia_Sct.vcf
tabix serratia_Sct.vcf.gz
vcf-merge serratia_vcftools.recode.vcf.gz serratia_Sct.vcf.gz > merged.vcf

# Extracting DPR and cleaning

bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT\t[%DPR ]\n' merged.vcf > serratia.DPRtab

sed -i -e 's/\#//' serratia.DPRtab
sed   -i -e  's/\[.\{1,2\}\]\/omaha-beach\/cguyomar\/genomecov\/serratia\///g' serratia.DPRtab
sed  -i -e 's/.serratia.bam:DPR//g' serratia.DPRtab

sed -i -e 's/ \./ 10,0/g' serratia.DPRtab
sed -i -e 's/\. /10,0 /g' serratia.DPRtab
sed -i -e  's/\[.\{1,2\}\]sample1:DPR/Sct/g' serratia.DPRtab

#!/bin/bash
    
source /local/env/envbcftools-1.3.sh
source /local/env/envvcftools-0.1.12b.sh
source /local/env/envsamtools.sh

# Pileup
samtools mpileup -t DP,DPR -g -f fukatsuia.fasta /omaha-beach/cguyomar/genomecov/fukatsuia/*.fukatsuia.bam > fukatsuia.bcf

# Filtering
bcftools call -mv -Ov  fukatsuia.bcf > fukatsuia_called.vcf 
vcftools --vcf fukatsuia_called.vcf --max-alleles 2 --min-meanDP 10 --minDP 2 --max-missing 1 --out fukatsuia_vcftools --recode-INFO-all --recode 

# Merge with vcf from outgroup mummer alignment
bgzip -f fukatsuia_vcftools.recode.vcf
tabix -f fukatsuia_vcftools.recode.vcf.gz
bgzip -f fukatsuia_regiella.vcf
tabix -f fukatsuia_regiella.vcf.gz

vcf-merge fukatsuia_vcftools.recode.vcf.gz fukatsuia_regiella.vcf.gz > merged.vcf

# Extracting DPR and cleaning

bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT\t[%DPR ]\n' merged.vcf > fukatsuia.DPRtab
sed -i -e 's/\#//' fukatsuia.DPRtab
sed   -i -e  's/\[.\{1,2\}\]\/omaha-beach\/cguyomar\/genomecov\/fukatsuia\///g' fukatsuia.DPRtab
sed  -i -e 's/.fukatsuia.bam:DPR//g' fukatsuia.DPRtab

sed -i -e 's/ \./ 10,0/g' fukatsuia.DPRtab
sed -i -e 's/\. /10,0 /g' fukatsuia.DPRtab

sed -i -e  's/\[.\{1,2\}\]sample1:DPR/Regiella/g' fukatsuia.DPRtab

#!/bin/bash

source /local/env/envsamtools-1.6.sh
source /local/env/envbcftools-1.3.sh
source /local/env/envvcftools-0.1.12b.sh

samtools mpileup -t DP,DPR -v -f spiroplasma2.fasta /omaha-beach/cguyomar/genomecov/spiroplasma/*.spiroplasma.bam > spiroplasma.bcf

# Filtering
bcftools call -mv -Ov  spiroplasma.bcf > spiroplasma_called.vcf 
vcftools --vcf spiroplasma_called.vcf --max-alleles 2 --min-meanDP 10 --minDP 2 --max-missing 1 --out spiroplasma_vcftools --recode-INFO-all --recode 

# Extracting DPR and cleaning
bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT\t[%DPR ]\n' spiroplasma_vcftools.recode.vcf > spiroplasma.DPRtab
sed -i -e 's/\#//' spiroplasma.DPRtab
sed   -i -e  's/\[.\{1,2\}\]\/omaha-beach\/cguyomar\/genomecov\/spiroplasma\///g' spiroplasma.DPRtab
sed  -i -e 's/.spiroplasma.bam:DPR//g' spiroplasma.DPRtab

#!/bin/bash

source /local/env/envsamtools.sh
source /local/env/envbcftools-1.3.sh
source /local/env/envvcftools-0.1.12b.sh

samtools mpileup -t DP,DPR -g -f rickettsia.fasta /omaha-beach/cguyomar/genomecov/rickettsia/*.rickettsia.bam > rickettsia.bcf

# Filtering
bcftools call -mv -Ov  rickettsia.bcf > rickettsia_called.vcf 
vcftools --vcf rickettsia_called.vcf --max-alleles 2 --min-meanDP 10 --minDP 2 --max-missing 1 --out rickettsia_vcftools --recode-INFO-all --recode 

# Merge with vcf from outgroup mummer alignment
bgzip -f rickettsia_vcftools.recode.vcf
tabix -f rickettsia_vcftools.recode.vcf.gz
bgzip -f rickettsia_bemisia.vcf
tabix -f rickettsia_bemisia.vcf.gz
vcf-merge rickettsia_vcftools.recode.vcf.gz rickettsia_bemisia.vcf.gz > merged.vcf

# Extracting DPR and cleaning
bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT\t[%DPR ]\n' merged.vcf > rickettsia.DPRtab

sed -i -e 's/\#//' rickettsia.DPRtab
sed   -i -e  's/\[.\{1,2\}\]\/omaha-beach\/cguyomar\/genomecov\/rickettsia\///g' rickettsia.DPRtab
sed  -i -e 's/.rickettsia.bam:DPR//g' rickettsia.DPRtab
sed -i -e 's/ \./ 10,0/g' rickettsia.DPRtab
sed -i -e 's/\. /10,0 /g' rickettsia.DPRtab
sed -i -e  's/\[.\{1,2\}\]sample1:DPR/Rbemisia/g' rickettsia.DPRtab

#!/bin/bash

source /local/env/envsamtools.sh
source /local/env/envbcftools-1.3.sh
source /local/env/envvcftools-0.1.12b.sh

# Pileup
samtools mpileup -t DP,DPR -g -f rickettsiella.fasta /omaha-beach/cguyomar/genomecov/rickettsiella/*.rickettsiella.bam > rickettsiella.bcf

# Filtering
bcftools call -mv -Ov  rickettsiella.bcf > rickettsiella_called.vcf 
vcftools --vcf rickettsiella_called.vcf --max-alleles 2 --min-meanDP 10 --minDP 2 --max-missing 1 --out rickettsiella_vcftools --recode-INFO-all --recode 

# Extracting DPR and cleaning
bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT\t[%DPR ]\n' rickettsiella_vcftools.recode.vcf > rickettsiella.DPRtab
sed -i -e 's/\#//' rickettsiella.DPRtab
sed   -i -e  's/\[.\{1,2\}\]\/omaha-beach\/cguyomar\/genomecov\/rickettsiella\///g' rickettsiella.DPRtab
sed  -i -e 's/.rickettsiella.bam:DPR//g' rickettsiella.DPRtab

#!/bin/bash

source /local/env/envsamtools.sh
source /local/env/envbcftools-1.3.sh
source /local/env/envvcftools-0.1.12b.sh

# Pileup
samtools mpileup -t DP,DPR -v -f mitochondrion.fasta /omaha-beach/cguyomar/genomecov/mitochondrie/*.mito.bam > mitochondrie.bcf

# Filtering
bcftools call -mv -Ov  mitochondrie.bcf > mitochondrie_called.vcf 
vcftools --vcf mitochondrie_called.vcf --max-alleles 2 --min-meanDP 10 --minDP 2 --max-missing 1 --out mitochondrie_vcftools --recode-INFO-all --recode 

# Extracting DPR and cleaning
bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT\t[%DPR ]\n' mitochondrie_vcftools.recode.vcf > mitochondrie.DPRtab

sed -i -e 's/\#//' mitochondrie.DPRtab
sed   -i -e  's/\[.\{1,2\}\]\/omaha-beach\/cguyomar\/genomecov\/mitochondrie\///g' mitochondrie.DPRtab
sed  -i -e 's/.mitochondrie.bam:DPR//g' mitochondrie.DPRtab

Adding outgroup

In order to insert an outgroup reference genome as a sample in the vcf file, the outgroup genome is aligned with the reference genome with mummer.

The resulting mummer alignment is converted to a vcf file, which is merged to the main vcf.

Commands for outgroup vcf creation :

#!/bin/bash

# $1 :ref genome
# $2 : aligned genome
# $3 :prefix 

nucmer -p $3 $1 $2
show-snps -Clr -x 1  -T $3.delta  > $3.snps
python ~/bin/MUMmerSNPs2VCF_DPR.py  $3.snps $3.vcf

MUMmerSNPs2VCF_DPR.py is a version of MUMmerSNPs2VCF_DPR.py (https://github.com/liangjiaoxue/PythonNGSTools) modified to add arbitrary read coverage information (DPR tag) to fit the pipeline.

Cleaning output

Allele coverage file are parsed using the script cleaning.R (script)

The following filters are applied for each sample :

Alleles covered with less or equal than 3 reads are removed
Alleles with a frequency inferior to 10% are removed
For Hamiltonella, Regiella and Fukatsuia, variants located in a region of blast identity greater than 80% are removed
Samples for which more than 5% of variants are equally covered are removed. This concerns samples with low coverage and high polymorphism

Genome wide phylogeny using neighbor-joining

Performed using nj_phylogeny.R (script), can be run with command line : Rscript nj_phylogeny.R symbiont

Main steps :

Load filtered allele coverage file and reconstruct associated tables
Assign allele for each site and sample (most covered allele)
Compute distance matrix
Perform neighbor joining and return a phylogenetic tree
Plot and save data

Gene-set phylogeny using maximumum likelyhood based inference

This second approach is used when an annotated reference genome is available.

Search of membrane specific proteins

Query of membrane specific protein on UniProt, the ‘transcript’ field is saved.
Transcript IDs are converted with esearch and efetch utilities to the ipg format (identical protein groups), which contains the gene coordinates on the genome.

Computing gene sequences for each sample

This step is performed by the gene_phylogeny.R script and associated functions

Main steps :

Loading cleaned allele coverage
Merging overlaping indels, which cannot be applied simultaneously
Reading selected genes from a directory (one file per ipg), and removing duplicates
Computing gene sequences for each sample using get_alt_seq function
Output resulting fasta file