Bioinformatics and Interaction Design: 2015

Here are the steps:

$ head DoubleMCL.txt

GeneID GeneFamily TotalGenes

Artha|AT1G11290 F00001G001 91

Artha|AT1G15510 F00001G001 91

Artha|AT1G16480 F00001G001 91

Artha|AT1G18485 F00001G001 91

Artha|AT2G27610 F00001G001 91

Artha|AT3G02010 F00001G001 91

Artha|AT3G02330 F00001G001 91

Artha|AT3G03580 F00001G001 91

Artha|AT3G16610 F00001G001 91

#Reorganize columns

$ awk 'NR!=1{split($1,a,"[|]");print a[1]"-"$2"\t"a[1]"\t"a[2]}' DoubleMCL.txt > gene_family.tsv

Potrs-F20394G001 Potrs Potrs148330g33501.1

Potrs-F20395G001 Potrs Potrs148631g31943.1

Potrs-F20396G001 Potrs Potrs148776g37409.1

Potrs-F20397G001 Potrs Potrs149124g31224.1

Potrs-F20398G001 Potrs Potrs149639g33624.1

Potrs-F20399G001 Potrs Potrs149792g30885.1

Potrs-F20400G001 Potrs Potrs150685g28586.2

Potrs-F20401G001 Potrs Potrs160539g32979.1

Potrs-F20402G001 Potrs Potrs163163g28886.1

#Add Potri prefix, extract only the .1 transcripts and remove .1

$ awk '{split($1,a,"-");if($2=="Artha"){print $1"\t"$3}else{if(substr($3,length($3)-1,length($3))==".1"){split($3,b,".");if(a[1]=="Potri"){print $1"\tPotri"b[1]}else{print $1"\t"b[1]}}}}' gene_family.tsv > gene_family_results.tsv

Artha-F00001G001 AT1G11290

Artha-F00001G001 AT1G15510

Artha-F00001G001 AT1G16480

Artha-F00001G001 AT1G18485

Artha-F00001G001 AT2G27610

Potrs-F20393G001 Potrs148273g33392

Potrs-F20394G001 Potrs148330g33501

Potrs-F20395G001 Potrs148631g31943

Potrs-F20396G001 Potrs148776g37409

#concat ids by gene family ids*

$ awk '{a[$1]=a[$1]?a[$1]";"$2:$2;}END{for (i in a)print i"\t"a[i];}' gene_family_results.tsv > gene_family_load _db.tsv

Artha-F13614G001 AT1G44941

Potrs-F01031G005 Potrs008446g12584

Potri-F04001G002 Potri004G194300

Potri-F04898G001 Potri001G209800;Potri003G020500

Potrs-F05903G001 Potrs001577g02793;Potrs001577g35836

Artha-F02860G001 ATCG00120;ATMG01190

Artha-F08292G001 AT4G23885

Potri-F05987G001 Potri005G171000

#Add potri plus “.” and trailing character “;” (Error correction step)

awk '{split($1,a,"-");gsub("Potri","Potri.");print a[2]"\t"a[1]"\t"$2";"}' gene_family_load _db.tsv > db_out.txt

F13614G001 Artha AT1G44941;

F01031G005 Potrs Potrs008446g12584;

F04001G002 Potri Potri.004G194300;

F04898G001 Potri Potri.001G209800;Potri.003G020500;

F05903G001 Potrs Potrs001577g02793;Potrs001577g35836;

F02860G001 Artha ATCG00120;ATMG01190;

F08292G001 Artha AT4G23885;

F05987G001 Potri Potri.005G171000;

F03293G001 Potri Potri.010G158000;

F05987G002 Potri Potri.002G090100;

#Aspen genefamily sequences

awk '{gsub("Artha[|]||Potrs[|]||Potra[|]","");print}' aspen.pep | awk '{if(substr($1,1,1)==">"){split($1,a,".");print a[1]}else{print}}' | awk '{gsub("Potri[|]","Potri.");print}' > aspen_pep.fna

Earlier I used following simple Python script instead this* awk one-liner.

#!/usr/bin/python

def parse(file, store):

f = open(file, 'r')

dic = {}

for i in f:

i = i.strip("\n")

val = i.split("\t")

try:

dic[val[0]] = dic[val[0]] + ";"+ val[1]

except KeyError:

dic[val[0]] = val[1]

f.close()

f = open(store, 'w')

for i in dic.keys():

string = i+"\t"+dic[i]

#print string

f.write(string+"\n")

f.close

if __name__=="__main__":

import sys

if len(sys.argv) > 1:

file = sys.argv[1]

store = sys.argv[2]

parse(file, store)

else:

sys.exit("No input")

Data preparation:

Here I’m going to use Satsuma and ColaAlignSatsuma programs to align two genomes. Satsuma is quite faster than ClustalW or BLAST (ref: http://bioinformatics.oxfordjournals.org/content/26/9/1145.long). Both executables can be found here(http://satsuma.sourceforge.net/). Let’s assume if we have genome_A.fasta and genme_B.fasta files.

./Satsuma -q genome_A.fasta –t genome_B.fasta –o satsuma_output_folder

./ColaAlignSatsuma -q genome_A.fasta -t genome_B.fasta –s satsuma_output_folder/satsuma_summary.out –o cola_output.aln

This will generate output file (cola_output.aln) similar to following format.

####

genome_B_sequence_x + 1476 AGAATGGGCAAATTCTTAGAGACAGAAAGTAGAATAAAGGTTATTGTGAGATGGGAGGAAGGGG 1539

|||||||||| | || |||||||||| || |||||| |||| || | | |||||

genome_A_sequence_y + 170 AGAATGGGCATAGTCGTAGAGACAGA-------GTAGAGGTTACCATGAGTTGAGGGTGAGGGG 226

####

genome_B_sequence_x + 1760 AGAAAAGAAAATG 1772

||| ||||||||

genome_A_sequence_y + 446 AGAGGAGAAAATG 458

Now I will use following lines to convert above file into GBrowse synteny format

#Add -sep- for every 3rd consecutive lines

awk 'BEGIN { RS="\0" } { gsub(/\n{3,}/,"\nCLUSTAL\n")}1' cola_output.align | awk NF > 2

#Remove octothorpes

awk '{gsub("####","");print}' 2 | awk NF > 3

#remove empty aligns

awk '$5!="" || $1=="CLUSTAL" || substr($1,1,1)=="|"' 3 > 4

#remove consecutive duplicate lines. Main goal is to remove CLUSTAL duplicates

sed -i '$!N; /^$.*$\n\1$/!P; D' 4

#Add genome_B to target sequence line

awk '{if(substr($1,1,4)=="CLUS"){x=NR; next} if(NR==x+1){print "CLUSTAL\ngenome_B-"$1"("$2")/"$3"-"$5"\t"$4}else{print}}' 4 > 5

#Add genome_A to query sequence line

awk '{if(substr($1,1,4)=="CLUS"){x=NR;} if(NR==x+3){print "genome_A-"field1[$1]"("$2")/"$3"-"$5"\t"$4}else{print}}' 5 > 6

#Replace pipes with asterisks and switch last two lines

awk '{if(substr($NF,1,1)=="|"){x=$0;next;}else{gsub("[|]","*",x);print $0"\n"x;x=""}}' 6 | awk NF > 7

#Remove all tmp files

rm 2 3 4 5 6

Final output looks like following

CLUSTAL

genome_B-genome_B_sequence_x(+)/1476-1539 AGAATGGGCAAATTCTTAGAGACAGAAAGTAGAATAAAGGTTATTGTGAGATGGGAGGAAGGGG

genome_A-(+)/170-226 AGAATGGGCATAGTCGTAGAGACAGA-------GTAGAGGTTACCATGAGTTGAGGGTGAGGGG

********** * ** ********** ** ****** **** ** * * *****

CLUSTAL

genome_B-genome_B_sequence_x(+)/1760-1772 AGAAAAGAAAATG

genome_A-(+)/446-458 AGAGGAGAAAATG

*** ********

**************************************************************************

Personal note: I had to translate sequence ids using separate files. Therefore I used following lines

#Add -CLUSTAL- for every 3rd consecutive lines

awk 'BEGIN { RS="\0" } { gsub(/\n{3,}/,"\nCLUSTAL\n")}1' populus_tremula_201_populus_tremuloides.aligns | awk NF > 2

##Remove octothorpes

awk '{gsub("####","");print}' 2 | awk NF > 3

##remove empty aligns

awk '$5!="" || $1=="CLUSTAL" || substr($1,1,1)=="|"' 3 > 4

#sed 'N;s/^$ *$CLUSTAL\n$ *$CLUSTAL$/\1CLUSTAL\2/;t;P;D' 4 > 5

#Replace sequence ids using first lift over ids

awk '{if(substr($1,1,4)=="CLUS"){x=NR}}NR == FNR {field1[$1]=$2; next}{if(NR==x+1 || NR==length(field1)+1){if($5!=""){print "potra-"field1[$1]"("$2")/"$3"-"$5"\t"$4}}else{print}}' potra_lift 5 > 6

#Replace sequence ids using second lift over ids

awk '{if(substr($1,1,4)=="CLUS"){x=NR}}NR == FNR {field1[$1]=$2; next}{if(NR==x+3 || NR==length(field1)+3){print "potrs-"field1[$1]"("$2")/"$3"-"$5"\t"$4}else{print}}' potrs_lift 6 > 7

#Replace pine with asterisk and switch second and thrird lines

awk '{if(substr($NF,1,1)=="|"){x=$0;next;}else{gsub("[|]","*",x);print $0"\n"x;x=""}}' 7 | awk NF > 8

#Add CLUSTAL to first line

awk '{if(NR==1){print "CLUSTAL\n"$0}else{print}}' 8 > 9

#Remove consecutive CLUSTAL duplicates

sed -i '$!N; /^$.*$\n\1$/!P; D' 9 > 10

#delete last line if necessary

sed -i '$ d' 10

awk 'BEGIN { RS="\0" } { gsub(/\n{3,}/,"\nCLUSTAL\n")}1' populus_trichocarpa_populus_tremula_201.align | awk NF > 2

awk '{gsub("####","");print}' 2 | awk NF > 3

awk '$5!="" || $1=="CLUSTAL" || substr($1,1,1)=="|"' 3 > 4

awk '{if(substr($1,1,4)=="CLUS"){x=NR; next} if(NR==x+1){print "CLUSTAL\npotri-"$1"("$2")/"$3"-"$5"\t"$4}else{print}}' 4 > 5

awk '{if(substr($1,1,4)=="CLUS"){x=NR}}NR == FNR {field1[$1]=$2; next}{if(NR==x+3 || NR==length(field1)+4){print "potra-"field1[$1]"("$2")/"$3"-"$5"\t"$4}else{print}}' potra_lift 5 > 6

sed -i '$!N; /^$.*$\n\1$/!P; D' 6

awk '{if(substr($NF,1,1)=="|"){x=$0;next;}else{gsub("[|]","*",x);print $0"\n"x;x=""}}' 6 | awk NF > 7

**************************************************************************

We can verify and convert the above format into standard clustalw using following Perl script.

#!/usr/bin/perl -w

use Bio::AlignIO;

$in = Bio::AlignIO->new(-file => "7" ,

-format => 'clustalw', verbose => 0);

$out = Bio::AlignIO->new(-file => ">clustal_output.aln",

-format => 'clustalw', verbose => 0);

$in->verbose(0);

$out->verbose(0);

while ( my $aln = $in->next_aln ) {

$aln->verbose(-1);

$out->write_aln($aln);

}

Load into the database:

There are two options to load final output (clustal_output.aln) into GBrowse_syn database

1 1.)We can use clustal2hit.pl script and then load to DB using gbrowse_syn_load_alignment_database.pl

clustal2hit.pl clustal_output.aln > clustal_output.tsv

gbrowse_syn_load_alignment_database.pl –u username –p password –d synteny_database –c –v clustal_output.tsv

2 2.)Directly load into GBrowse synteny database using gbrowse_syn_load_alignments_msa.pl

gbrowse_syn_load_alignments_msa.pl –u username –p password –d synteny_database –c –v clustal_output.aln

Load gff3 files into Database

##Create samtools index from fasta files

Samtools faidx genome_B.fasta

Samtools faidx genome_A.fasta

##convert to gff3

awk '{print $1"\tplantgenie\tscaffold\t1\t"$2"\t.\t.\t.\tName="$1}' genome_B.fasta.fai > genome_B.gff3

awk '{print $1"\tplantgenie\tscaffold\t1\t"$2"\t.\t.\t.\tName="$1}' genome_A.fasta.fai > genome_a.gff3

##Load into database

bp_seqfeature_load –u username –p password -d genome_B -c -f genome_B.fasta genome_B.gff3

bp_seqfeature_load –u username –p password -d genome_A -c -f genome_A.fasta genome_A.gff3

For more details about GBrowse_syn configuration, please follow this(http://gmod.org/wiki/GBrowse_syn_Configuration) tutorial.

Bioinformatics and Interaction Design

Friday, 12 June 2015

Aspen Data Preparation

Wednesday, 21 January 2015

GBrowse_syn data preparation and load into the database