Data preparation:
Here I’m going to use Satsuma and ColaAlignSatsuma programs
to align two genomes. Satsuma is quite faster than ClustalW or BLAST (ref:
http://bioinformatics.oxfordjournals.org/content/26/9/1145.long). Both
executables can be found here(http://satsuma.sourceforge.net/).
Let’s assume if we have genome_A.fasta and genme_B.fasta files.
./Satsuma -q genome_A.fasta –t genome_B.fasta –o satsuma_output_folder
./ColaAlignSatsuma -q genome_A.fasta -t
genome_B.fasta –s satsuma_output_folder/satsuma_summary.out –o cola_output.aln
This will generate output file (cola_output.aln) similar to
following format.
####
genome_B_sequence_x +
1476 AGAATGGGCAAATTCTTAGAGACAGAAAGTAGAATAAAGGTTATTGTGAGATGGGAGGAAGGGG
1539
|||||||||| | || ||||||||||
|| |||||| |||| || | | |||||
genome_A_sequence_y +
170 AGAATGGGCATAGTCGTAGAGACAGA-------GTAGAGGTTACCATGAGTTGAGGGTGAGGGG 226
####
genome_B_sequence_x + 1760 AGAAAAGAAAATG 1772
||| ||||||||
genome_A_sequence_y +
446 AGAGGAGAAAATG 458
Now I will use following lines to convert above file into
GBrowse synteny format
#Add
-sep- for every 3rd consecutive lines
awk 'BEGIN { RS="\0" } {
gsub(/\n{3,}/,"\nCLUSTAL\n")}1' cola_output.align | awk NF >
2
#Remove
octothorpes
awk
'{gsub("####","");print}' 2 | awk NF > 3
#remove
empty aligns
awk '$5!="" || $1=="CLUSTAL"
|| substr($1,1,1)=="|"' 3 > 4
#remove
consecutive duplicate lines. Main goal is to remove CLUSTAL duplicates
sed -i '$!N;
/^\(.*\)\n\1$/!P; D' 4
#Add
genome_B to target sequence line
awk '{if(substr($1,1,4)=="CLUS"){x=NR;
next} if(NR==x+1){print "CLUSTAL\ngenome_B-"$1"("$2")/"$3"-"$5"\t"$4}else{print}}'
4 > 5
#Add
genome_A to query sequence line
awk '{if(substr($1,1,4)=="CLUS"){x=NR;}
if(NR==x+3){print
"genome_A-"field1[$1]"("$2")/"$3"-"$5"\t"$4}else{print}}'
5 > 6
#Replace
pipes with asterisks and switch last two lines
awk
'{if(substr($NF,1,1)=="|"){x=$0;next;}else{gsub("[|]","*",x);print
$0"\n"x;x=""}}' 6 | awk NF > 7
#Remove
all tmp files
rm
2 3 4 5 6
Final output looks like following
CLUSTAL
genome_B-genome_B_sequence_x(+)/1476-1539
AGAATGGGCAAATTCTTAGAGACAGAAAGTAGAATAAAGGTTATTGTGAGATGGGAGGAAGGGG
genome_A-(+)/170-226
AGAATGGGCATAGTCGTAGAGACAGA-------GTAGAGGTTACCATGAGTTGAGGGTGAGGGG
********** * **
********** ** ****** **** ** * *
*****
CLUSTAL
genome_B-genome_B_sequence_x(+)/1760-1772 AGAAAAGAAAATG
genome_A-(+)/446-458 AGAGGAGAAAATG
*** ********
**************************************************************************
Personal
note: I had to translate sequence ids using separate files. Therefore I used following
lines
#Add -CLUSTAL- for every 3rd consecutive lines
awk 'BEGIN { RS="\0" } {
gsub(/\n{3,}/,"\nCLUSTAL\n")}1'
populus_tremula_201_populus_tremuloides.aligns | awk NF > 2
##Remove octothorpes
awk
'{gsub("####","");print}' 2 | awk NF > 3
##remove empty aligns
awk '$5!="" ||
$1=="CLUSTAL" || substr($1,1,1)=="|"' 3 > 4
#sed 'N;s/^\( *\)CLUSTAL\n\(
*\)CLUSTAL$/\1CLUSTAL\2/;t;P;D' 4 > 5
#Replace sequence ids using first lift over ids
awk
'{if(substr($1,1,4)=="CLUS"){x=NR}}NR == FNR {field1[$1]=$2;
next}{if(NR==x+1 || NR==length(field1)+1){if($5!=""){print
"potra-"field1[$1]"("$2")/"$3"-"$5"\t"$4}}else{print}}'
potra_lift 5 > 6
#Replace sequence ids using second lift over
ids
awk '{if(substr($1,1,4)=="CLUS"){x=NR}}NR
== FNR {field1[$1]=$2; next}{if(NR==x+3 || NR==length(field1)+3){print
"potrs-"field1[$1]"("$2")/"$3"-"$5"\t"$4}else{print}}'
potrs_lift 6 > 7
#Replace pine with asterisk and switch second
and thrird lines
awk
'{if(substr($NF,1,1)=="|"){x=$0;next;}else{gsub("[|]","*",x);print
$0"\n"x;x=""}}' 7 | awk NF > 8
#Add CLUSTAL to first line
awk '{if(NR==1){print
"CLUSTAL\n"$0}else{print}}' 8 > 9
#Remove consecutive
CLUSTAL duplicates
sed -i '$!N; /^\(.*\)\n\1$/!P; D' 9 > 10
#delete last line if necessary
sed -i
'$ d' 10
OR
awk 'BEGIN {
RS="\0" } { gsub(/\n{3,}/,"\nCLUSTAL\n")}1' populus_trichocarpa_populus_tremula_201.align
| awk NF > 2
awk
'{gsub("####","");print}' 2 | awk NF > 3
awk '$5!="" ||
$1=="CLUSTAL" || substr($1,1,1)=="|"' 3 > 4
awk
'{if(substr($1,1,4)=="CLUS"){x=NR; next} if(NR==x+1){print
"CLUSTAL\npotri-"$1"("$2")/"$3"-"$5"\t"$4}else{print}}'
4 > 5
awk
'{if(substr($1,1,4)=="CLUS"){x=NR}}NR == FNR {field1[$1]=$2;
next}{if(NR==x+3 || NR==length(field1)+4){print "potra-"field1[$1]"("$2")/"$3"-"$5"\t"$4}else{print}}'
potra_lift 5 > 6
sed -i '$!N;
/^\(.*\)\n\1$/!P; D' 6
awk
'{if(substr($NF,1,1)=="|"){x=$0;next;}else{gsub("[|]","*",x);print
$0"\n"x;x=""}}' 6 | awk NF > 7
**************************************************************************
We can verify and convert the above format into standard
clustalw using following Perl script.
#!/usr/bin/perl
-w
use
Bio::AlignIO;
$in =
Bio::AlignIO->new(-file => "7" ,
-format => 'clustalw', verbose
=> 0);
$out =
Bio::AlignIO->new(-file => ">clustal_output.aln",
-format =>
'clustalw', verbose => 0);
$in->verbose(0);
$out->verbose(0);
while ( my
$aln = $in->next_aln ) {
$aln->verbose(-1);
$out->write_aln($aln);
}
Load into the database:
There are
two options to load final output (clustal_output.aln) into GBrowse_syn database
1 1.)We
can use clustal2hit.pl script and then load to DB using
gbrowse_syn_load_alignment_database.pl
clustal2hit.pl clustal_output.aln > clustal_output.tsv
gbrowse_syn_load_alignment_database.pl
–u username –p password –d synteny_database –c –v clustal_output.tsv
2 2.)Directly
load into GBrowse synteny database using gbrowse_syn_load_alignments_msa.pl
gbrowse_syn_load_alignments_msa.pl
–u username –p password –d synteny_database –c –v clustal_output.aln
Load gff3
files into Database
##Create samtools index from fasta files
Samtools
faidx genome_B.fasta
Samtools
faidx genome_A.fasta
##convert to gff3
awk
'{print $1"\tplantgenie\tscaffold\t1\t"$2"\t.\t.\t.\tName="$1}'
genome_B.fasta.fai > genome_B.gff3
awk
'{print
$1"\tplantgenie\tscaffold\t1\t"$2"\t.\t.\t.\tName="$1}'
genome_A.fasta.fai > genome_a.gff3
##Load
into database
bp_seqfeature_load –u username –p password -d genome_B
-c -f genome_B.fasta genome_B.gff3
bp_seqfeature_load
–u username –p password -d genome_A -c -f genome_A.fasta genome_A.gff3
For more details about GBrowse_syn configuration, please follow
this(http://gmod.org/wiki/GBrowse_syn_Configuration) tutorial.
No comments:
Post a Comment