Bioinformatics and Interaction Design

Wednesday 25 May 2016

Merge two files by coulm ids using AWK

Input file1

Potrx000002g00010.0 GO:0005576
Potrx000002g00010.0 GO:0005618
Potrx000002g00010.0 GO:0005634
Potrx000002g00010.0 GO:0006952
Potrx000002g00010.0 GO:0009698
Potrx000002g00010.0 GO:1901564
Potrx000002g00010.0 GO:1901576
Potrx000002g00020.0 GO:0005576
Potrx000002g00020.0 GO:0005618
Potrx000002g00020.0 GO:0005739

Input file2

GO:0000001 mitochondrion inheritance biological_process
GO:0000002 mitochondrial genome maintenance biological_process
GO:0000003 reproduction biological_process
GO:0000005 obsolete ribosomal chaperone activity molecular_function
GO:0000006 high-affinity zinc uptake transporter activity molecular_function
GO:0000007 low-affinity zinc ion transmembrane activity molecular_function
GO:0000008 obsolete thioredoxin molecular_function
GO:0000009 alpha-1,6-mannosyltransferase activity molecular_function
GO:0000010 trans-hexaprenyltranstransferase activity molecular_function
GO:0000011 vacuole inheritance biological_process

Both are tab separated files and number of rows in file1 is larger than number of rows file 2. Nowe we are going to merge both files by file1 column 2 id and file 2 column 1 id.

Solution

awk '
BEGIN{FS=OFS="\t"} # Field seperator and Output Fiels seerator equal to \t
NR==FNR # FNR is the current record number in the current file. FNR is #incremented each time a new record is read. NR is the number #of input records.
{
file_1_array[$1]=$0 OFS $2;next; # Add file1 into an array
}
{
for(x in file_1_array) # Loop through the file1 big array
{if(x==$2) print $1,$2,$3, file_1_array[x]} # print when it macthes
}'
file2 file1

All in one line
awk 'BEGIN{FS=OFS="\t"}NR==FNR{file_1_array[$1]=$0 OFS $2;next;}{for(x in file_1_array){if(x==$2) print $1, file_1_array[x]}}' file2 file1 |head

Output
Potrx000002g00010.0 GO:0005576 extracellular region cellular_component extracellular region
Potrx000002g00020.0 GO:0005576 extracellular region cellular_component extracellular region
Potrx000002g00030.0 GO:0005634 nucleus cellular_component nucleus
Potrx000003g00010.0 GO:0005634 nucleus cellular_component nucleus
Potrx000003g00020.0 GO:0005634 nucleus cellular_component nucleus
Potrx000006g00010.0 GO:0005576 extracellular region cellular_component extracellular region
Potrx000008g00010.0 GO:0005743 mitochondrial inner membrane cellular_component mitochondrial inner membrane
Potrx000013g00010.0 GO:0000325 plant-type vacuole cellular_component plant-type vacuole
Potrx000014g00010.0 GO:0005743 mitochondrial inner membrane cellular_component mitochondrial inner membrane

Potrx000015g00010.0 GO:0005783 endoplasmic reticulum cellular_component endoplasmic reticulum

Friday 12 June 2015

Aspen Data Preparation

Here are the steps:

$ head DoubleMCL.txt

GeneID GeneFamily TotalGenes

Artha|AT1G11290 F00001G001 91

Artha|AT1G15510 F00001G001 91

Artha|AT1G16480 F00001G001 91

Artha|AT1G18485 F00001G001 91

Artha|AT2G27610 F00001G001 91

Artha|AT3G02010 F00001G001 91

Artha|AT3G02330 F00001G001 91

Artha|AT3G03580 F00001G001 91

Artha|AT3G16610 F00001G001 91

#Reorganize columns

$ awk 'NR!=1{split($1,a,"[|]");print a[1]"-"$2"\t"a[1]"\t"a[2]}' DoubleMCL.txt > gene_family.tsv

Potrs-F20394G001 Potrs Potrs148330g33501.1

Potrs-F20395G001 Potrs Potrs148631g31943.1

Potrs-F20396G001 Potrs Potrs148776g37409.1

Potrs-F20397G001 Potrs Potrs149124g31224.1

Potrs-F20398G001 Potrs Potrs149639g33624.1

Potrs-F20399G001 Potrs Potrs149792g30885.1

Potrs-F20400G001 Potrs Potrs150685g28586.2

Potrs-F20401G001 Potrs Potrs160539g32979.1

Potrs-F20402G001 Potrs Potrs163163g28886.1

#Add Potri prefix, extract only the .1 transcripts and remove .1

$ awk '{split($1,a,"-");if($2=="Artha"){print $1"\t"$3}else{if(substr($3,length($3)-1,length($3))==".1"){split($3,b,".");if(a[1]=="Potri"){print $1"\tPotri"b[1]}else{print $1"\t"b[1]}}}}' gene_family.tsv > gene_family_results.tsv

Artha-F00001G001 AT1G11290

Artha-F00001G001 AT1G15510

Artha-F00001G001 AT1G16480

Artha-F00001G001 AT1G18485

Artha-F00001G001 AT2G27610

Potrs-F20393G001 Potrs148273g33392

Potrs-F20394G001 Potrs148330g33501

Potrs-F20395G001 Potrs148631g31943

Potrs-F20396G001 Potrs148776g37409

#concat ids by gene family ids*

$ awk '{a[$1]=a[$1]?a[$1]";"$2:$2;}END{for (i in a)print i"\t"a[i];}' gene_family_results.tsv > gene_family_load _db.tsv

Artha-F13614G001 AT1G44941

Potrs-F01031G005 Potrs008446g12584

Potri-F04001G002 Potri004G194300

Potri-F04898G001 Potri001G209800;Potri003G020500

Potrs-F05903G001 Potrs001577g02793;Potrs001577g35836

Artha-F02860G001 ATCG00120;ATMG01190

Artha-F08292G001 AT4G23885

Potri-F05987G001 Potri005G171000

#Add potri plus “.” and trailing character “;” (Error correction step)

awk '{split($1,a,"-");gsub("Potri","Potri.");print a[2]"\t"a[1]"\t"$2";"}' gene_family_load _db.tsv > db_out.txt

F13614G001 Artha AT1G44941;

F01031G005 Potrs Potrs008446g12584;

F04001G002 Potri Potri.004G194300;

F04898G001 Potri Potri.001G209800;Potri.003G020500;

F05903G001 Potrs Potrs001577g02793;Potrs001577g35836;

F02860G001 Artha ATCG00120;ATMG01190;

F08292G001 Artha AT4G23885;

F05987G001 Potri Potri.005G171000;

F03293G001 Potri Potri.010G158000;

F05987G002 Potri Potri.002G090100;

#Aspen genefamily sequences

awk '{gsub("Artha[|]||Potrs[|]||Potra[|]","");print}' aspen.pep | awk '{if(substr($1,1,1)==">"){split($1,a,".");print a[1]}else{print}}' | awk '{gsub("Potri[|]","Potri.");print}' > aspen_pep.fna

Earlier I used following simple Python script instead this* awk one-liner.

#!/usr/bin/python

def parse(file, store):

f = open(file, 'r')

dic = {}

for i in f:

i = i.strip("\n")

val = i.split("\t")

try:

dic[val[0]] = dic[val[0]] + ";"+ val[1]

except KeyError:

dic[val[0]] = val[1]

f.close()

f = open(store, 'w')

for i in dic.keys():

string = i+"\t"+dic[i]

#print string

f.write(string+"\n")

f.close

if __name__=="__main__":

import sys

if len(sys.argv) > 1:

file = sys.argv[1]

store = sys.argv[2]

parse(file, store)

else:

sys.exit("No input")

Wednesday 21 January 2015

GBrowse_syn data preparation and load into the database

Data preparation:

Here I’m going to use Satsuma and ColaAlignSatsuma programs to align two genomes. Satsuma is quite faster than ClustalW or BLAST (ref: http://bioinformatics.oxfordjournals.org/content/26/9/1145.long). Both executables can be found here(http://satsuma.sourceforge.net/). Let’s assume if we have genome_A.fasta and genme_B.fasta files.

./Satsuma -q genome_A.fasta –t genome_B.fasta –o satsuma_output_folder

./ColaAlignSatsuma -q genome_A.fasta -t genome_B.fasta –s satsuma_output_folder/satsuma_summary.out –o cola_output.aln

This will generate output file (cola_output.aln) similar to following format.

####

genome_B_sequence_x + 1476 AGAATGGGCAAATTCTTAGAGACAGAAAGTAGAATAAAGGTTATTGTGAGATGGGAGGAAGGGG 1539

|||||||||| | || |||||||||| || |||||| |||| || | | |||||

genome_A_sequence_y + 170 AGAATGGGCATAGTCGTAGAGACAGA-------GTAGAGGTTACCATGAGTTGAGGGTGAGGGG 226

####

genome_B_sequence_x + 1760 AGAAAAGAAAATG 1772

||| ||||||||

genome_A_sequence_y + 446 AGAGGAGAAAATG 458

Now I will use following lines to convert above file into GBrowse synteny format

#Add -sep- for every 3rd consecutive lines

awk 'BEGIN { RS="\0" } { gsub(/\n{3,}/,"\nCLUSTAL\n")}1' cola_output.align | awk NF > 2

#Remove octothorpes

awk '{gsub("####","");print}' 2 | awk NF > 3

#remove empty aligns

awk '$5!="" || $1=="CLUSTAL" || substr($1,1,1)=="|"' 3 > 4

#remove consecutive duplicate lines. Main goal is to remove CLUSTAL duplicates

sed -i '$!N; /^$.*$\n\1$/!P; D' 4

#Add genome_B to target sequence line

awk '{if(substr($1,1,4)=="CLUS"){x=NR; next} if(NR==x+1){print "CLUSTAL\ngenome_B-"$1"("$2")/"$3"-"$5"\t"$4}else{print}}' 4 > 5

#Add genome_A to query sequence line

awk '{if(substr($1,1,4)=="CLUS"){x=NR;} if(NR==x+3){print "genome_A-"field1[$1]"("$2")/"$3"-"$5"\t"$4}else{print}}' 5 > 6

#Replace pipes with asterisks and switch last two lines

awk '{if(substr($NF,1,1)=="|"){x=$0;next;}else{gsub("[|]","*",x);print $0"\n"x;x=""}}' 6 | awk NF > 7

#Remove all tmp files

rm 2 3 4 5 6

Final output looks like following

CLUSTAL

genome_B-genome_B_sequence_x(+)/1476-1539 AGAATGGGCAAATTCTTAGAGACAGAAAGTAGAATAAAGGTTATTGTGAGATGGGAGGAAGGGG

genome_A-(+)/170-226 AGAATGGGCATAGTCGTAGAGACAGA-------GTAGAGGTTACCATGAGTTGAGGGTGAGGGG

********** * ** ********** ** ****** **** ** * * *****

CLUSTAL

genome_B-genome_B_sequence_x(+)/1760-1772 AGAAAAGAAAATG

genome_A-(+)/446-458 AGAGGAGAAAATG

*** ********

**************************************************************************

Personal note: I had to translate sequence ids using separate files. Therefore I used following lines

#Add -CLUSTAL- for every 3rd consecutive lines

awk 'BEGIN { RS="\0" } { gsub(/\n{3,}/,"\nCLUSTAL\n")}1' populus_tremula_201_populus_tremuloides.aligns | awk NF > 2

##Remove octothorpes

awk '{gsub("####","");print}' 2 | awk NF > 3

##remove empty aligns

awk '$5!="" || $1=="CLUSTAL" || substr($1,1,1)=="|"' 3 > 4

#sed 'N;s/^$ *$CLUSTAL\n$ *$CLUSTAL$/\1CLUSTAL\2/;t;P;D' 4 > 5

#Replace sequence ids using first lift over ids

awk '{if(substr($1,1,4)=="CLUS"){x=NR}}NR == FNR {field1[$1]=$2; next}{if(NR==x+1 || NR==length(field1)+1){if($5!=""){print "potra-"field1[$1]"("$2")/"$3"-"$5"\t"$4}}else{print}}' potra_lift 5 > 6

#Replace sequence ids using second lift over ids

awk '{if(substr($1,1,4)=="CLUS"){x=NR}}NR == FNR {field1[$1]=$2; next}{if(NR==x+3 || NR==length(field1)+3){print "potrs-"field1[$1]"("$2")/"$3"-"$5"\t"$4}else{print}}' potrs_lift 6 > 7

#Replace pine with asterisk and switch second and thrird lines

awk '{if(substr($NF,1,1)=="|"){x=$0;next;}else{gsub("[|]","*",x);print $0"\n"x;x=""}}' 7 | awk NF > 8

#Add CLUSTAL to first line

awk '{if(NR==1){print "CLUSTAL\n"$0}else{print}}' 8 > 9

#Remove consecutive CLUSTAL duplicates

sed -i '$!N; /^$.*$\n\1$/!P; D' 9 > 10

#delete last line if necessary

sed -i '$ d' 10

awk 'BEGIN { RS="\0" } { gsub(/\n{3,}/,"\nCLUSTAL\n")}1' populus_trichocarpa_populus_tremula_201.align | awk NF > 2

awk '{gsub("####","");print}' 2 | awk NF > 3

awk '$5!="" || $1=="CLUSTAL" || substr($1,1,1)=="|"' 3 > 4

awk '{if(substr($1,1,4)=="CLUS"){x=NR; next} if(NR==x+1){print "CLUSTAL\npotri-"$1"("$2")/"$3"-"$5"\t"$4}else{print}}' 4 > 5

awk '{if(substr($1,1,4)=="CLUS"){x=NR}}NR == FNR {field1[$1]=$2; next}{if(NR==x+3 || NR==length(field1)+4){print "potra-"field1[$1]"("$2")/"$3"-"$5"\t"$4}else{print}}' potra_lift 5 > 6

sed -i '$!N; /^$.*$\n\1$/!P; D' 6

awk '{if(substr($NF,1,1)=="|"){x=$0;next;}else{gsub("[|]","*",x);print $0"\n"x;x=""}}' 6 | awk NF > 7

**************************************************************************

We can verify and convert the above format into standard clustalw using following Perl script.

#!/usr/bin/perl -w

use Bio::AlignIO;

$in = Bio::AlignIO->new(-file => "7" ,

-format => 'clustalw', verbose => 0);

$out = Bio::AlignIO->new(-file => ">clustal_output.aln",

-format => 'clustalw', verbose => 0);

$in->verbose(0);

$out->verbose(0);

while ( my $aln = $in->next_aln ) {

$aln->verbose(-1);

$out->write_aln($aln);

}

Load into the database:

There are two options to load final output (clustal_output.aln) into GBrowse_syn database

1 1.)We can use clustal2hit.pl script and then load to DB using gbrowse_syn_load_alignment_database.pl

clustal2hit.pl clustal_output.aln > clustal_output.tsv

gbrowse_syn_load_alignment_database.pl –u username –p password –d synteny_database –c –v clustal_output.tsv

2 2.)Directly load into GBrowse synteny database using gbrowse_syn_load_alignments_msa.pl

gbrowse_syn_load_alignments_msa.pl –u username –p password –d synteny_database –c –v clustal_output.aln

Load gff3 files into Database

##Create samtools index from fasta files

Samtools faidx genome_B.fasta

Samtools faidx genome_A.fasta

##convert to gff3

awk '{print $1"\tplantgenie\tscaffold\t1\t"$2"\t.\t.\t.\tName="$1}' genome_B.fasta.fai > genome_B.gff3

awk '{print $1"\tplantgenie\tscaffold\t1\t"$2"\t.\t.\t.\tName="$1}' genome_A.fasta.fai > genome_a.gff3

##Load into database

bp_seqfeature_load –u username –p password -d genome_B -c -f genome_B.fasta genome_B.gff3

bp_seqfeature_load –u username –p password -d genome_A -c -f genome_A.fasta genome_A.gff3

For more details about GBrowse_syn configuration, please follow this(http://gmod.org/wiki/GBrowse_syn_Configuration) tutorial.

Wednesday 27 November 2013

GBrowse Bio::DB::BigFile




wget http://hgdownload.cse.ucsc.edu/admin/jksrc.zip

unzip jksrc.zip

cd kent/src/lib

echo $MACHTYPE

x86_64

export MACHTYPE=x86_64

make CXXFLAGS=-fPIC CFLAGS=-fPIC CPPFLAGS=-fPIC

cd..

 export KENT_SRC=`pwd`

sudo perl -MCPAN -e 'install Bio::DB::BigFile'

Thursday 31 October 2013

Simple awk function to read file line by line and merge columns using unique id

Here is the input:




MA_9270965g0010 PF00010 Helix-loop-helix DNA-binding domain
MA_10437060g0010        PF00082;PF05922 Subtilase family;Peptidase inhibitor I9

Here is the function:


#Start reading file line by line using tab field seperator

awk 'BEGIN{FS="\t"}{

 #split second and third column by ";"

 second_field_array_length=split($2,second_field_array,";");

 third_field_array_length=split($3,third_field_array,";");

 concat_str="";

 #Loop the second column array and merge with 3rd column consecutive element

 for(i=1;i<=second_field_array_length;i++){

  #concat with ";" when count is greater than 0

  if(i>1){

   concat_str=concat_str";"second_field_array[i]"-"third_field_array[i]

  }else{

   concat_str=second_field_array[i]"-"third_field_array[i]

  }

 }

print $1,concat_str

}' Pabies1.0-Pfam-update.txt | head

Here is the output:




MA_9270965g0010 PF00010-Helix-loop-helix DNA-binding domain
MA_10437060g0010 PF00082-Subtilase family;PF05922-Peptidase inhibitor I9

Saturday 23 February 2013

Galaxy Bowttie/BWA/BLAST tools/dataset restricted access per user

Add following line to bowtie_wrapper.xml/bwa_wrapper.xml


      --email=$__user_email__

Add following lines to bowtie_wrapper.py/bwa_wrapper.py


    parser.add_option( '', '--email', dest='email', help='email' )

Above line should add before the options, args) = parser.parse_args() line After the stdout = '' add following lines.
if options.ref in "[dataset paths]" :
if options.email in "[email addresses]":
sys.stdout.write( 'Permission granted! ' )
else:
stop_err( 'You dont have permissions!' )
else:
sys.stdout.write( 'Permission granted! ' )

Retart the Galaxy! Now you can restrcit tool datasets per user without having two Galaxy instances.

/Chanaka

Tuesday 5 February 2013

Symlink all files from a base directory to a target directory

for f in $(ls -d /base/*); do ln -s $f /target; done && ls -al /target

About blog

It's nice to have a strong coexistence between Bioinformatics and Interaction design. Why? Most of the biologist are not computer experts but they need computer support, new technologies/tools to continue their research. therefore, interaction design can play major role in bioinformatics by designing user centric, simple, interactive tools.

In this blog, I am trying to share information, that allow to share with public. This is my personal blog. The views expressed on these pages are mine alone and not those of my employer.