Bioinformatics and Interaction Design: GBrowse_syn data preparation and load into the database

Data preparation:

Here I’m going to use Satsuma and ColaAlignSatsuma programs to align two genomes. Satsuma is quite faster than ClustalW or BLAST (ref: http://bioinformatics.oxfordjournals.org/content/26/9/1145.long). Both executables can be found here(http://satsuma.sourceforge.net/). Let’s assume if we have genome_A.fasta and genme_B.fasta files.

./Satsuma -q genome_A.fasta –t genome_B.fasta –o satsuma_output_folder

./ColaAlignSatsuma -q genome_A.fasta -t genome_B.fasta –s satsuma_output_folder/satsuma_summary.out –o cola_output.aln

This will generate output file (cola_output.aln) similar to following format.

####

genome_B_sequence_x + 1476 AGAATGGGCAAATTCTTAGAGACAGAAAGTAGAATAAAGGTTATTGTGAGATGGGAGGAAGGGG 1539

|||||||||| | || |||||||||| || |||||| |||| || | | |||||

genome_A_sequence_y + 170 AGAATGGGCATAGTCGTAGAGACAGA-------GTAGAGGTTACCATGAGTTGAGGGTGAGGGG 226

####

genome_B_sequence_x + 1760 AGAAAAGAAAATG 1772

||| ||||||||

genome_A_sequence_y + 446 AGAGGAGAAAATG 458

Now I will use following lines to convert above file into GBrowse synteny format

#Add -sep- for every 3rd consecutive lines

awk 'BEGIN { RS="\0" } { gsub(/\n{3,}/,"\nCLUSTAL\n")}1' cola_output.align | awk NF > 2

#Remove octothorpes

awk '{gsub("####","");print}' 2 | awk NF > 3

#remove empty aligns

awk '$5!="" || $1=="CLUSTAL" || substr($1,1,1)=="|"' 3 > 4

#remove consecutive duplicate lines. Main goal is to remove CLUSTAL duplicates

sed -i '$!N; /^$.*$\n\1$/!P; D' 4

#Add genome_B to target sequence line

awk '{if(substr($1,1,4)=="CLUS"){x=NR; next} if(NR==x+1){print "CLUSTAL\ngenome_B-"$1"("$2")/"$3"-"$5"\t"$4}else{print}}' 4 > 5

#Add genome_A to query sequence line

awk '{if(substr($1,1,4)=="CLUS"){x=NR;} if(NR==x+3){print "genome_A-"field1[$1]"("$2")/"$3"-"$5"\t"$4}else{print}}' 5 > 6

#Replace pipes with asterisks and switch last two lines

awk '{if(substr($NF,1,1)=="|"){x=$0;next;}else{gsub("[|]","*",x);print $0"\n"x;x=""}}' 6 | awk NF > 7

#Remove all tmp files

rm 2 3 4 5 6

Final output looks like following

CLUSTAL

genome_B-genome_B_sequence_x(+)/1476-1539 AGAATGGGCAAATTCTTAGAGACAGAAAGTAGAATAAAGGTTATTGTGAGATGGGAGGAAGGGG

genome_A-(+)/170-226 AGAATGGGCATAGTCGTAGAGACAGA-------GTAGAGGTTACCATGAGTTGAGGGTGAGGGG

********** * ** ********** ** ****** **** ** * * *****

CLUSTAL

genome_B-genome_B_sequence_x(+)/1760-1772 AGAAAAGAAAATG

genome_A-(+)/446-458 AGAGGAGAAAATG

*** ********

**************************************************************************

Personal note: I had to translate sequence ids using separate files. Therefore I used following lines

#Add -CLUSTAL- for every 3rd consecutive lines

awk 'BEGIN { RS="\0" } { gsub(/\n{3,}/,"\nCLUSTAL\n")}1' populus_tremula_201_populus_tremuloides.aligns | awk NF > 2

##Remove octothorpes

awk '{gsub("####","");print}' 2 | awk NF > 3

##remove empty aligns

awk '$5!="" || $1=="CLUSTAL" || substr($1,1,1)=="|"' 3 > 4

#sed 'N;s/^$ *$CLUSTAL\n$ *$CLUSTAL$/\1CLUSTAL\2/;t;P;D' 4 > 5

#Replace sequence ids using first lift over ids

awk '{if(substr($1,1,4)=="CLUS"){x=NR}}NR == FNR {field1[$1]=$2; next}{if(NR==x+1 || NR==length(field1)+1){if($5!=""){print "potra-"field1[$1]"("$2")/"$3"-"$5"\t"$4}}else{print}}' potra_lift 5 > 6

#Replace sequence ids using second lift over ids

awk '{if(substr($1,1,4)=="CLUS"){x=NR}}NR == FNR {field1[$1]=$2; next}{if(NR==x+3 || NR==length(field1)+3){print "potrs-"field1[$1]"("$2")/"$3"-"$5"\t"$4}else{print}}' potrs_lift 6 > 7

#Replace pine with asterisk and switch second and thrird lines

awk '{if(substr($NF,1,1)=="|"){x=$0;next;}else{gsub("[|]","*",x);print $0"\n"x;x=""}}' 7 | awk NF > 8

#Add CLUSTAL to first line

awk '{if(NR==1){print "CLUSTAL\n"$0}else{print}}' 8 > 9

#Remove consecutive CLUSTAL duplicates

sed -i '$!N; /^$.*$\n\1$/!P; D' 9 > 10

#delete last line if necessary

sed -i '$ d' 10

awk 'BEGIN { RS="\0" } { gsub(/\n{3,}/,"\nCLUSTAL\n")}1' populus_trichocarpa_populus_tremula_201.align | awk NF > 2

awk '{gsub("####","");print}' 2 | awk NF > 3

awk '$5!="" || $1=="CLUSTAL" || substr($1,1,1)=="|"' 3 > 4

awk '{if(substr($1,1,4)=="CLUS"){x=NR; next} if(NR==x+1){print "CLUSTAL\npotri-"$1"("$2")/"$3"-"$5"\t"$4}else{print}}' 4 > 5

awk '{if(substr($1,1,4)=="CLUS"){x=NR}}NR == FNR {field1[$1]=$2; next}{if(NR==x+3 || NR==length(field1)+4){print "potra-"field1[$1]"("$2")/"$3"-"$5"\t"$4}else{print}}' potra_lift 5 > 6

sed -i '$!N; /^$.*$\n\1$/!P; D' 6

awk '{if(substr($NF,1,1)=="|"){x=$0;next;}else{gsub("[|]","*",x);print $0"\n"x;x=""}}' 6 | awk NF > 7

**************************************************************************

We can verify and convert the above format into standard clustalw using following Perl script.

#!/usr/bin/perl -w

use Bio::AlignIO;

$in = Bio::AlignIO->new(-file => "7" ,

-format => 'clustalw', verbose => 0);

$out = Bio::AlignIO->new(-file => ">clustal_output.aln",

-format => 'clustalw', verbose => 0);

$in->verbose(0);

$out->verbose(0);

while ( my $aln = $in->next_aln ) {

$aln->verbose(-1);

$out->write_aln($aln);

}

Load into the database:

There are two options to load final output (clustal_output.aln) into GBrowse_syn database

1 1.)We can use clustal2hit.pl script and then load to DB using gbrowse_syn_load_alignment_database.pl

clustal2hit.pl clustal_output.aln > clustal_output.tsv

gbrowse_syn_load_alignment_database.pl –u username –p password –d synteny_database –c –v clustal_output.tsv

2 2.)Directly load into GBrowse synteny database using gbrowse_syn_load_alignments_msa.pl

gbrowse_syn_load_alignments_msa.pl –u username –p password –d synteny_database –c –v clustal_output.aln

Load gff3 files into Database

##Create samtools index from fasta files

Samtools faidx genome_B.fasta

Samtools faidx genome_A.fasta

##convert to gff3

awk '{print $1"\tplantgenie\tscaffold\t1\t"$2"\t.\t.\t.\tName="$1}' genome_B.fasta.fai > genome_B.gff3

awk '{print $1"\tplantgenie\tscaffold\t1\t"$2"\t.\t.\t.\tName="$1}' genome_A.fasta.fai > genome_a.gff3

##Load into database

bp_seqfeature_load –u username –p password -d genome_B -c -f genome_B.fasta genome_B.gff3

bp_seqfeature_load –u username –p password -d genome_A -c -f genome_A.fasta genome_A.gff3

For more details about GBrowse_syn configuration, please follow this(http://gmod.org/wiki/GBrowse_syn_Configuration) tutorial.

Bioinformatics and Interaction Design

Wednesday, 21 January 2015

GBrowse_syn data preparation and load into the database

No comments:

Post a Comment