Friday, 12 June 2015

Aspen Data Preparation

Here are the steps:
$ head DoubleMCL.txt
GeneID   GeneFamily    TotalGenes
Artha|AT1G11290    F00001G001    91
Artha|AT1G15510    F00001G001    91
Artha|AT1G16480    F00001G001    91
Artha|AT1G18485    F00001G001    91
Artha|AT2G27610    F00001G001    91
Artha|AT3G02010    F00001G001    91
Artha|AT3G02330    F00001G001    91
Artha|AT3G03580    F00001G001    91
Artha|AT3G16610    F00001G001    91

#Reorganize columns
$ awk 'NR!=1{split($1,a,"[|]");print a[1]"-"$2"\t"a[1]"\t"a[2]}' DoubleMCL.txt > gene_family.tsv
Potrs-F20394G001   Potrs    Potrs148330g33501.1
Potrs-F20395G001   Potrs    Potrs148631g31943.1
Potrs-F20396G001   Potrs    Potrs148776g37409.1
Potrs-F20397G001   Potrs    Potrs149124g31224.1
Potrs-F20398G001   Potrs    Potrs149639g33624.1
Potrs-F20399G001   Potrs    Potrs149792g30885.1
Potrs-F20400G001   Potrs    Potrs150685g28586.2
Potrs-F20401G001   Potrs    Potrs160539g32979.1
Potrs-F20402G001   Potrs    Potrs163163g28886.1

#Add Potri prefix, extract only the .1 transcripts and remove .1
$ awk '{split($1,a,"-");if($2=="Artha"){print $1"\t"$3}else{if(substr($3,length($3)-1,length($3))==".1"){split($3,b,".");if(a[1]=="Potri"){print $1"\tPotri"b[1]}else{print $1"\t"b[1]}}}}' gene_family.tsv > gene_family_results.tsv
Artha-F00001G001   AT1G11290
Artha-F00001G001   AT1G15510
Artha-F00001G001   AT1G16480
Artha-F00001G001   AT1G18485
Artha-F00001G001   AT2G27610
.
.
Potrs-F20393G001   Potrs148273g33392
Potrs-F20394G001   Potrs148330g33501
Potrs-F20395G001   Potrs148631g31943
Potrs-F20396G001   Potrs148776g37409

#concat ids by gene family ids*
$ awk '{a[$1]=a[$1]?a[$1]";"$2:$2;}END{for (i in a)print i"\t"a[i];}' gene_family_results.tsv > gene_family_load _db.tsv
Artha-F13614G001   AT1G44941
Potrs-F01031G005   Potrs008446g12584
Potri-F04001G002   Potri004G194300
Potri-F04898G001   Potri001G209800;Potri003G020500
Potrs-F05903G001   Potrs001577g02793;Potrs001577g35836
Artha-F02860G001   ATCG00120;ATMG01190
Artha-F08292G001   AT4G23885
Potri-F05987G001   Potri005G171000

#Add potri plus “.” and trailing character “;” (Error correction step)
awk '{split($1,a,"-");gsub("Potri","Potri.");print a[2]"\t"a[1]"\t"$2";"}' gene_family_load _db.tsv > db_out.txt
F13614G001    Artha    AT1G44941;
F01031G005    Potrs    Potrs008446g12584;
F04001G002    Potri    Potri.004G194300;
F04898G001    Potri    Potri.001G209800;Potri.003G020500;
F05903G001    Potrs    Potrs001577g02793;Potrs001577g35836;
F02860G001    Artha    ATCG00120;ATMG01190;
F08292G001    Artha    AT4G23885;
F05987G001    Potri    Potri.005G171000;
F03293G001    Potri    Potri.010G158000;
F05987G002    Potri    Potri.002G090100;

#Aspen genefamily sequences
awk '{gsub("Artha[|]||Potrs[|]||Potra[|]","");print}' aspen.pep | awk '{if(substr($1,1,1)==">"){split($1,a,".");print a[1]}else{print}}' | awk '{gsub("Potri[|]","Potri.");print}' > aspen_pep.fna


Earlier I used following simple Python script instead this* awk one-liner.
#!/usr/bin/python
def parse(file, store):
        f = open(file, 'r')
        dic = {}
        for i in f:
                i = i.strip("\n")
                val = i.split("\t")
                try:
                        dic[val[0]] = dic[val[0]] + ";"+ val[1]
                except KeyError:
                        dic[val[0]] = val[1]
        f.close()
        f = open(store, 'w')
        for i in dic.keys():
                string = i+"\t"+dic[i]
                #print string
                f.write(string+"\n")
        f.close

if __name__=="__main__":
        import sys
        if len(sys.argv) > 1:
                file = sys.argv[1]
                store = sys.argv[2]
                parse(file, store)
        else:
                sys.exit("No input")


Wednesday, 21 January 2015

GBrowse_syn data preparation and load into the database

Data preparation:

Here I’m going to use Satsuma and ColaAlignSatsuma programs to align two genomes. Satsuma is quite faster than ClustalW or BLAST (ref: http://bioinformatics.oxfordjournals.org/content/26/9/1145.long). Both executables can be found here(http://satsuma.sourceforge.net/). Let’s assume if we have genome_A.fasta and genme_B.fasta files.

./Satsuma -q genome_A.fasta  –t genome_B.fasta –o satsuma_output_folder
./ColaAlignSatsuma -q genome_A.fasta  -t  genome_B.fasta –s satsuma_output_folder/satsuma_summary.out  –o cola_output.aln

This will generate output file (cola_output.aln) similar to following format.

####
genome_B_sequence_x +       1476 AGAATGGGCAAATTCTTAGAGACAGAAAGTAGAATAAAGGTTATTGTGAGATGGGAGGAAGGGG 1539
                         |||||||||| | || ||||||||||        || ||||||   |||| || | |  |||||
genome_A_sequence_y +        170 AGAATGGGCATAGTCGTAGAGACAGA-------GTAGAGGTTACCATGAGTTGAGGGTGAGGGG 226
####
genome_B_sequence_x +       1760 AGAAAAGAAAATG 1772
                         |||  ||||||||
genome_A_sequence_y +        446 AGAGGAGAAAATG 458

Now I will use following lines to convert above file into GBrowse synteny format

#Add -sep- for every 3rd consecutive lines
awk 'BEGIN { RS="\0" } { gsub(/\n{3,}/,"\nCLUSTAL\n")}1' cola_output.align | awk NF  > 2

#Remove octothorpes
awk '{gsub("####","");print}' 2 | awk NF > 3

#remove empty aligns
awk '$5!="" || $1=="CLUSTAL" || substr($1,1,1)=="|"' 3 > 4

#remove consecutive duplicate lines. Main goal is to remove CLUSTAL duplicates
sed -i '$!N; /^\(.*\)\n\1$/!P; D' 4

#Add genome_B to target sequence line
awk '{if(substr($1,1,4)=="CLUS"){x=NR; next} if(NR==x+1){print "CLUSTAL\ngenome_B-"$1"("$2")/"$3"-"$5"\t"$4}else{print}}' 4 > 5

#Add genome_A to query sequence line
awk '{if(substr($1,1,4)=="CLUS"){x=NR;} if(NR==x+3){print "genome_A-"field1[$1]"("$2")/"$3"-"$5"\t"$4}else{print}}' 5 > 6

#Replace pipes with asterisks and switch last two lines
awk  '{if(substr($NF,1,1)=="|"){x=$0;next;}else{gsub("[|]","*",x);print $0"\n"x;x=""}}' 6 | awk NF > 7

#Remove all tmp files
rm 2 3 4 5 6

Final output looks like following

CLUSTAL
genome_B-genome_B_sequence_x(+)/1476-1539       AGAATGGGCAAATTCTTAGAGACAGAAAGTAGAATAAAGGTTATTGTGAGATGGGAGGAAGGGG
genome_A-(+)/170-226    AGAATGGGCATAGTCGTAGAGACAGA-------GTAGAGGTTACCATGAGTTGAGGGTGAGGGG
                         ********** * ** **********        ** ******   **** ** * *  *****
CLUSTAL
genome_B-genome_B_sequence_x(+)/1760-1772       AGAAAAGAAAATG
genome_A-(+)/446-458    AGAGGAGAAAATG
                         ***  ********

**************************************************************************
Personal note: I had to translate sequence ids using separate files. Therefore I used following lines
#Add -CLUSTAL- for every 3rd consecutive lines
awk 'BEGIN { RS="\0" } { gsub(/\n{3,}/,"\nCLUSTAL\n")}1' populus_tremula_201_populus_tremuloides.aligns | awk NF  > 2

##Remove octothorpes
awk '{gsub("####","");print}' 2 | awk NF > 3

##remove empty aligns
awk '$5!="" || $1=="CLUSTAL" || substr($1,1,1)=="|"' 3 > 4

#sed 'N;s/^\( *\)CLUSTAL\n\( *\)CLUSTAL$/\1CLUSTAL\2/;t;P;D' 4 > 5

#Replace sequence ids using first lift over ids
awk '{if(substr($1,1,4)=="CLUS"){x=NR}}NR == FNR {field1[$1]=$2; next}{if(NR==x+1 || NR==length(field1)+1){if($5!=""){print "potra-"field1[$1]"("$2")/"$3"-"$5"\t"$4}}else{print}}' potra_lift 5 > 6

#Replace sequence ids using second lift over ids
awk '{if(substr($1,1,4)=="CLUS"){x=NR}}NR == FNR {field1[$1]=$2; next}{if(NR==x+3 || NR==length(field1)+3){print "potrs-"field1[$1]"("$2")/"$3"-"$5"\t"$4}else{print}}' potrs_lift 6 > 7

#Replace pine with asterisk and switch second and thrird lines
awk  '{if(substr($NF,1,1)=="|"){x=$0;next;}else{gsub("[|]","*",x);print $0"\n"x;x=""}}' 7 | awk NF > 8

#Add CLUSTAL to first line
awk '{if(NR==1){print "CLUSTAL\n"$0}else{print}}' 8 > 9

#Remove consecutive CLUSTAL duplicates
sed -i '$!N; /^\(.*\)\n\1$/!P; D' 9 > 10

#delete last line if necessary
 sed -i '$ d' 10

OR

awk 'BEGIN { RS="\0" } { gsub(/\n{3,}/,"\nCLUSTAL\n")}1' populus_trichocarpa_populus_tremula_201.align | awk NF  > 2
awk '{gsub("####","");print}' 2 | awk NF > 3
awk '$5!="" || $1=="CLUSTAL" || substr($1,1,1)=="|"' 3 > 4
awk '{if(substr($1,1,4)=="CLUS"){x=NR; next} if(NR==x+1){print "CLUSTAL\npotri-"$1"("$2")/"$3"-"$5"\t"$4}else{print}}' 4 > 5
awk '{if(substr($1,1,4)=="CLUS"){x=NR}}NR == FNR {field1[$1]=$2; next}{if(NR==x+3 || NR==length(field1)+4){print "potra-"field1[$1]"("$2")/"$3"-"$5"\t"$4}else{print}}' potra_lift 5 > 6
sed -i '$!N; /^\(.*\)\n\1$/!P; D' 6
awk  '{if(substr($NF,1,1)=="|"){x=$0;next;}else{gsub("[|]","*",x);print $0"\n"x;x=""}}' 6 | awk NF > 7

**************************************************************************
We can verify and convert the above format into standard clustalw using following Perl script.

#!/usr/bin/perl -w
use Bio::AlignIO;

$in = Bio::AlignIO->new(-file => "7" ,
                         -format => 'clustalw', verbose => 0);
$out = Bio::AlignIO->new(-file => ">clustal_output.aln",
                         -format => 'clustalw', verbose => 0);
$in->verbose(0);
$out->verbose(0);
while ( my $aln = $in->next_aln ) {
  $aln->verbose(-1);
  $out->write_aln($aln);
}

Load into the database:
There are two options to load final output (clustal_output.aln) into GBrowse_syn database

1     1.)We can use clustal2hit.pl script and then load to DB using gbrowse_syn_load_alignment_database.pl
clustal2hit.pl clustal_output.aln > clustal_output.tsv
gbrowse_syn_load_alignment_database.pl –u username –p password –d synteny_database –c –v clustal_output.tsv

2     2.)Directly load into GBrowse synteny database using gbrowse_syn_load_alignments_msa.pl
gbrowse_syn_load_alignments_msa.pl –u username –p password –d synteny_database –c –v clustal_output.aln

Load gff3 files into Database

##Create samtools index from fasta files
Samtools faidx genome_B.fasta
Samtools faidx genome_A.fasta

##convert to gff3
awk '{print $1"\tplantgenie\tscaffold\t1\t"$2"\t.\t.\t.\tName="$1}' genome_B.fasta.fai > genome_B.gff3
awk '{print $1"\tplantgenie\tscaffold\t1\t"$2"\t.\t.\t.\tName="$1}' genome_A.fasta.fai > genome_a.gff3

##Load into database
bp_seqfeature_load –u username –p password -d genome_B -c -f genome_B.fasta genome_B.gff3
bp_seqfeature_load –u username –p password -d genome_A -c -f genome_A.fasta genome_A.gff3

For more details about GBrowse_syn configuration, please follow this(http://gmod.org/wiki/GBrowse_syn_Configuration) tutorial.