Wednesday 21 January 2015

GBrowse_syn data preparation and load into the database

Data preparation:

Here I’m going to use Satsuma and ColaAlignSatsuma programs to align two genomes. Satsuma is quite faster than ClustalW or BLAST (ref: http://bioinformatics.oxfordjournals.org/content/26/9/1145.long). Both executables can be found here(http://satsuma.sourceforge.net/). Let’s assume if we have genome_A.fasta and genme_B.fasta files.

./Satsuma -q genome_A.fasta  –t genome_B.fasta –o satsuma_output_folder
./ColaAlignSatsuma -q genome_A.fasta  -t  genome_B.fasta –s satsuma_output_folder/satsuma_summary.out  –o cola_output.aln

This will generate output file (cola_output.aln) similar to following format.

####
genome_B_sequence_x +       1476 AGAATGGGCAAATTCTTAGAGACAGAAAGTAGAATAAAGGTTATTGTGAGATGGGAGGAAGGGG 1539
                         |||||||||| | || ||||||||||        || ||||||   |||| || | |  |||||
genome_A_sequence_y +        170 AGAATGGGCATAGTCGTAGAGACAGA-------GTAGAGGTTACCATGAGTTGAGGGTGAGGGG 226
####
genome_B_sequence_x +       1760 AGAAAAGAAAATG 1772
                         |||  ||||||||
genome_A_sequence_y +        446 AGAGGAGAAAATG 458

Now I will use following lines to convert above file into GBrowse synteny format

#Add -sep- for every 3rd consecutive lines
awk 'BEGIN { RS="\0" } { gsub(/\n{3,}/,"\nCLUSTAL\n")}1' cola_output.align | awk NF  > 2

#Remove octothorpes
awk '{gsub("####","");print}' 2 | awk NF > 3

#remove empty aligns
awk '$5!="" || $1=="CLUSTAL" || substr($1,1,1)=="|"' 3 > 4

#remove consecutive duplicate lines. Main goal is to remove CLUSTAL duplicates
sed -i '$!N; /^\(.*\)\n\1$/!P; D' 4

#Add genome_B to target sequence line
awk '{if(substr($1,1,4)=="CLUS"){x=NR; next} if(NR==x+1){print "CLUSTAL\ngenome_B-"$1"("$2")/"$3"-"$5"\t"$4}else{print}}' 4 > 5

#Add genome_A to query sequence line
awk '{if(substr($1,1,4)=="CLUS"){x=NR;} if(NR==x+3){print "genome_A-"field1[$1]"("$2")/"$3"-"$5"\t"$4}else{print}}' 5 > 6

#Replace pipes with asterisks and switch last two lines
awk  '{if(substr($NF,1,1)=="|"){x=$0;next;}else{gsub("[|]","*",x);print $0"\n"x;x=""}}' 6 | awk NF > 7

#Remove all tmp files
rm 2 3 4 5 6

Final output looks like following

CLUSTAL
genome_B-genome_B_sequence_x(+)/1476-1539       AGAATGGGCAAATTCTTAGAGACAGAAAGTAGAATAAAGGTTATTGTGAGATGGGAGGAAGGGG
genome_A-(+)/170-226    AGAATGGGCATAGTCGTAGAGACAGA-------GTAGAGGTTACCATGAGTTGAGGGTGAGGGG
                         ********** * ** **********        ** ******   **** ** * *  *****
CLUSTAL
genome_B-genome_B_sequence_x(+)/1760-1772       AGAAAAGAAAATG
genome_A-(+)/446-458    AGAGGAGAAAATG
                         ***  ********

**************************************************************************
Personal note: I had to translate sequence ids using separate files. Therefore I used following lines
#Add -CLUSTAL- for every 3rd consecutive lines
awk 'BEGIN { RS="\0" } { gsub(/\n{3,}/,"\nCLUSTAL\n")}1' populus_tremula_201_populus_tremuloides.aligns | awk NF  > 2

##Remove octothorpes
awk '{gsub("####","");print}' 2 | awk NF > 3

##remove empty aligns
awk '$5!="" || $1=="CLUSTAL" || substr($1,1,1)=="|"' 3 > 4

#sed 'N;s/^\( *\)CLUSTAL\n\( *\)CLUSTAL$/\1CLUSTAL\2/;t;P;D' 4 > 5

#Replace sequence ids using first lift over ids
awk '{if(substr($1,1,4)=="CLUS"){x=NR}}NR == FNR {field1[$1]=$2; next}{if(NR==x+1 || NR==length(field1)+1){if($5!=""){print "potra-"field1[$1]"("$2")/"$3"-"$5"\t"$4}}else{print}}' potra_lift 5 > 6

#Replace sequence ids using second lift over ids
awk '{if(substr($1,1,4)=="CLUS"){x=NR}}NR == FNR {field1[$1]=$2; next}{if(NR==x+3 || NR==length(field1)+3){print "potrs-"field1[$1]"("$2")/"$3"-"$5"\t"$4}else{print}}' potrs_lift 6 > 7

#Replace pine with asterisk and switch second and thrird lines
awk  '{if(substr($NF,1,1)=="|"){x=$0;next;}else{gsub("[|]","*",x);print $0"\n"x;x=""}}' 7 | awk NF > 8

#Add CLUSTAL to first line
awk '{if(NR==1){print "CLUSTAL\n"$0}else{print}}' 8 > 9

#Remove consecutive CLUSTAL duplicates
sed -i '$!N; /^\(.*\)\n\1$/!P; D' 9 > 10

#delete last line if necessary
 sed -i '$ d' 10

OR

awk 'BEGIN { RS="\0" } { gsub(/\n{3,}/,"\nCLUSTAL\n")}1' populus_trichocarpa_populus_tremula_201.align | awk NF  > 2
awk '{gsub("####","");print}' 2 | awk NF > 3
awk '$5!="" || $1=="CLUSTAL" || substr($1,1,1)=="|"' 3 > 4
awk '{if(substr($1,1,4)=="CLUS"){x=NR; next} if(NR==x+1){print "CLUSTAL\npotri-"$1"("$2")/"$3"-"$5"\t"$4}else{print}}' 4 > 5
awk '{if(substr($1,1,4)=="CLUS"){x=NR}}NR == FNR {field1[$1]=$2; next}{if(NR==x+3 || NR==length(field1)+4){print "potra-"field1[$1]"("$2")/"$3"-"$5"\t"$4}else{print}}' potra_lift 5 > 6
sed -i '$!N; /^\(.*\)\n\1$/!P; D' 6
awk  '{if(substr($NF,1,1)=="|"){x=$0;next;}else{gsub("[|]","*",x);print $0"\n"x;x=""}}' 6 | awk NF > 7

**************************************************************************
We can verify and convert the above format into standard clustalw using following Perl script.

#!/usr/bin/perl -w
use Bio::AlignIO;

$in = Bio::AlignIO->new(-file => "7" ,
                         -format => 'clustalw', verbose => 0);
$out = Bio::AlignIO->new(-file => ">clustal_output.aln",
                         -format => 'clustalw', verbose => 0);
$in->verbose(0);
$out->verbose(0);
while ( my $aln = $in->next_aln ) {
  $aln->verbose(-1);
  $out->write_aln($aln);
}

Load into the database:
There are two options to load final output (clustal_output.aln) into GBrowse_syn database

1     1.)We can use clustal2hit.pl script and then load to DB using gbrowse_syn_load_alignment_database.pl
clustal2hit.pl clustal_output.aln > clustal_output.tsv
gbrowse_syn_load_alignment_database.pl –u username –p password –d synteny_database –c –v clustal_output.tsv

2     2.)Directly load into GBrowse synteny database using gbrowse_syn_load_alignments_msa.pl
gbrowse_syn_load_alignments_msa.pl –u username –p password –d synteny_database –c –v clustal_output.aln

Load gff3 files into Database

##Create samtools index from fasta files
Samtools faidx genome_B.fasta
Samtools faidx genome_A.fasta

##convert to gff3
awk '{print $1"\tplantgenie\tscaffold\t1\t"$2"\t.\t.\t.\tName="$1}' genome_B.fasta.fai > genome_B.gff3
awk '{print $1"\tplantgenie\tscaffold\t1\t"$2"\t.\t.\t.\tName="$1}' genome_A.fasta.fai > genome_a.gff3

##Load into database
bp_seqfeature_load –u username –p password -d genome_B -c -f genome_B.fasta genome_B.gff3
bp_seqfeature_load –u username –p password -d genome_A -c -f genome_A.fasta genome_A.gff3

For more details about GBrowse_syn configuration, please follow this(http://gmod.org/wiki/GBrowse_syn_Configuration) tutorial.