Wednesday 25 May 2016

Merge two files by coulm ids using AWK

Input file1

Potrx000002g00010.0 GO:0005576
Potrx000002g00010.0 GO:0005618
Potrx000002g00010.0 GO:0005634
Potrx000002g00010.0 GO:0006952
Potrx000002g00010.0 GO:0009698
Potrx000002g00010.0 GO:1901564
Potrx000002g00010.0 GO:1901576
Potrx000002g00020.0 GO:0005576
Potrx000002g00020.0 GO:0005618
Potrx000002g00020.0 GO:0005739

Input file2

GO:0000001 mitochondrion inheritance biological_process
GO:0000002 mitochondrial genome maintenance biological_process
GO:0000003 reproduction biological_process
GO:0000005 obsolete ribosomal chaperone activity molecular_function
GO:0000006 high-affinity zinc uptake transporter activity molecular_function
GO:0000007 low-affinity zinc ion transmembrane activity molecular_function
GO:0000008 obsolete thioredoxin molecular_function
GO:0000009 alpha-1,6-mannosyltransferase activity molecular_function
GO:0000010 trans-hexaprenyltranstransferase activity molecular_function
GO:0000011 vacuole inheritance biological_process

Both are tab separated files and number of rows in file1 is larger than  number of rows file 2. Nowe we are going to merge both files by file1 column 2 id and file 2 column 1 id.

Solution

awk '
BEGIN{FS=OFS="\t"}  # Field seperator and Output Fiels seerator equal to \t
NR==FNR             # FNR is the current record number in the current file. FNR is                     #incremented each time a new record is read. NR is the number                       #of input records.
 {
  file_1_array[$1]=$0 OFS $2;next; # Add file1 into an array
 }
  {
   for(x in file_1_array)          # Loop through the file1 big array
     {if(x==$2) print $1,$2,$3, file_1_array[x]} # print when it macthes
   }' 
file2 file1

All in one line
awk 'BEGIN{FS=OFS="\t"}NR==FNR{file_1_array[$1]=$0 OFS $2;next;}{for(x in file_1_array){if(x==$2) print $1, file_1_array[x]}}' file2 file1 |head 

Output
Potrx000002g00010.0 GO:0005576 extracellular region cellular_component extracellular region
Potrx000002g00020.0 GO:0005576 extracellular region cellular_component extracellular region
Potrx000002g00030.0 GO:0005634 nucleus cellular_component nucleus
Potrx000003g00010.0 GO:0005634 nucleus cellular_component nucleus
Potrx000003g00020.0 GO:0005634 nucleus cellular_component nucleus
Potrx000006g00010.0 GO:0005576 extracellular region cellular_component extracellular region
Potrx000008g00010.0 GO:0005743 mitochondrial inner membrane cellular_component mitochondrial inner membrane
Potrx000013g00010.0 GO:0000325 plant-type vacuole cellular_component plant-type vacuole
Potrx000014g00010.0 GO:0005743 mitochondrial inner membrane cellular_component mitochondrial inner membrane

Potrx000015g00010.0 GO:0005783 endoplasmic reticulum cellular_component endoplasmic reticulum

Friday 12 June 2015

Aspen Data Preparation

Here are the steps:
$ head DoubleMCL.txt
GeneID   GeneFamily    TotalGenes
Artha|AT1G11290    F00001G001    91
Artha|AT1G15510    F00001G001    91
Artha|AT1G16480    F00001G001    91
Artha|AT1G18485    F00001G001    91
Artha|AT2G27610    F00001G001    91
Artha|AT3G02010    F00001G001    91
Artha|AT3G02330    F00001G001    91
Artha|AT3G03580    F00001G001    91
Artha|AT3G16610    F00001G001    91

#Reorganize columns
$ awk 'NR!=1{split($1,a,"[|]");print a[1]"-"$2"\t"a[1]"\t"a[2]}' DoubleMCL.txt > gene_family.tsv
Potrs-F20394G001   Potrs    Potrs148330g33501.1
Potrs-F20395G001   Potrs    Potrs148631g31943.1
Potrs-F20396G001   Potrs    Potrs148776g37409.1
Potrs-F20397G001   Potrs    Potrs149124g31224.1
Potrs-F20398G001   Potrs    Potrs149639g33624.1
Potrs-F20399G001   Potrs    Potrs149792g30885.1
Potrs-F20400G001   Potrs    Potrs150685g28586.2
Potrs-F20401G001   Potrs    Potrs160539g32979.1
Potrs-F20402G001   Potrs    Potrs163163g28886.1

#Add Potri prefix, extract only the .1 transcripts and remove .1
$ awk '{split($1,a,"-");if($2=="Artha"){print $1"\t"$3}else{if(substr($3,length($3)-1,length($3))==".1"){split($3,b,".");if(a[1]=="Potri"){print $1"\tPotri"b[1]}else{print $1"\t"b[1]}}}}' gene_family.tsv > gene_family_results.tsv
Artha-F00001G001   AT1G11290
Artha-F00001G001   AT1G15510
Artha-F00001G001   AT1G16480
Artha-F00001G001   AT1G18485
Artha-F00001G001   AT2G27610
.
.
Potrs-F20393G001   Potrs148273g33392
Potrs-F20394G001   Potrs148330g33501
Potrs-F20395G001   Potrs148631g31943
Potrs-F20396G001   Potrs148776g37409

#concat ids by gene family ids*
$ awk '{a[$1]=a[$1]?a[$1]";"$2:$2;}END{for (i in a)print i"\t"a[i];}' gene_family_results.tsv > gene_family_load _db.tsv
Artha-F13614G001   AT1G44941
Potrs-F01031G005   Potrs008446g12584
Potri-F04001G002   Potri004G194300
Potri-F04898G001   Potri001G209800;Potri003G020500
Potrs-F05903G001   Potrs001577g02793;Potrs001577g35836
Artha-F02860G001   ATCG00120;ATMG01190
Artha-F08292G001   AT4G23885
Potri-F05987G001   Potri005G171000

#Add potri plus “.” and trailing character “;” (Error correction step)
awk '{split($1,a,"-");gsub("Potri","Potri.");print a[2]"\t"a[1]"\t"$2";"}' gene_family_load _db.tsv > db_out.txt
F13614G001    Artha    AT1G44941;
F01031G005    Potrs    Potrs008446g12584;
F04001G002    Potri    Potri.004G194300;
F04898G001    Potri    Potri.001G209800;Potri.003G020500;
F05903G001    Potrs    Potrs001577g02793;Potrs001577g35836;
F02860G001    Artha    ATCG00120;ATMG01190;
F08292G001    Artha    AT4G23885;
F05987G001    Potri    Potri.005G171000;
F03293G001    Potri    Potri.010G158000;
F05987G002    Potri    Potri.002G090100;

#Aspen genefamily sequences
awk '{gsub("Artha[|]||Potrs[|]||Potra[|]","");print}' aspen.pep | awk '{if(substr($1,1,1)==">"){split($1,a,".");print a[1]}else{print}}' | awk '{gsub("Potri[|]","Potri.");print}' > aspen_pep.fna


Earlier I used following simple Python script instead this* awk one-liner.
#!/usr/bin/python
def parse(file, store):
        f = open(file, 'r')
        dic = {}
        for i in f:
                i = i.strip("\n")
                val = i.split("\t")
                try:
                        dic[val[0]] = dic[val[0]] + ";"+ val[1]
                except KeyError:
                        dic[val[0]] = val[1]
        f.close()
        f = open(store, 'w')
        for i in dic.keys():
                string = i+"\t"+dic[i]
                #print string
                f.write(string+"\n")
        f.close

if __name__=="__main__":
        import sys
        if len(sys.argv) > 1:
                file = sys.argv[1]
                store = sys.argv[2]
                parse(file, store)
        else:
                sys.exit("No input")


Wednesday 21 January 2015

GBrowse_syn data preparation and load into the database

Data preparation:

Here I’m going to use Satsuma and ColaAlignSatsuma programs to align two genomes. Satsuma is quite faster than ClustalW or BLAST (ref: http://bioinformatics.oxfordjournals.org/content/26/9/1145.long). Both executables can be found here(http://satsuma.sourceforge.net/). Let’s assume if we have genome_A.fasta and genme_B.fasta files.

./Satsuma -q genome_A.fasta  –t genome_B.fasta –o satsuma_output_folder
./ColaAlignSatsuma -q genome_A.fasta  -t  genome_B.fasta –s satsuma_output_folder/satsuma_summary.out  –o cola_output.aln

This will generate output file (cola_output.aln) similar to following format.

####
genome_B_sequence_x +       1476 AGAATGGGCAAATTCTTAGAGACAGAAAGTAGAATAAAGGTTATTGTGAGATGGGAGGAAGGGG 1539
                         |||||||||| | || ||||||||||        || ||||||   |||| || | |  |||||
genome_A_sequence_y +        170 AGAATGGGCATAGTCGTAGAGACAGA-------GTAGAGGTTACCATGAGTTGAGGGTGAGGGG 226
####
genome_B_sequence_x +       1760 AGAAAAGAAAATG 1772
                         |||  ||||||||
genome_A_sequence_y +        446 AGAGGAGAAAATG 458

Now I will use following lines to convert above file into GBrowse synteny format

#Add -sep- for every 3rd consecutive lines
awk 'BEGIN { RS="\0" } { gsub(/\n{3,}/,"\nCLUSTAL\n")}1' cola_output.align | awk NF  > 2

#Remove octothorpes
awk '{gsub("####","");print}' 2 | awk NF > 3

#remove empty aligns
awk '$5!="" || $1=="CLUSTAL" || substr($1,1,1)=="|"' 3 > 4

#remove consecutive duplicate lines. Main goal is to remove CLUSTAL duplicates
sed -i '$!N; /^\(.*\)\n\1$/!P; D' 4

#Add genome_B to target sequence line
awk '{if(substr($1,1,4)=="CLUS"){x=NR; next} if(NR==x+1){print "CLUSTAL\ngenome_B-"$1"("$2")/"$3"-"$5"\t"$4}else{print}}' 4 > 5

#Add genome_A to query sequence line
awk '{if(substr($1,1,4)=="CLUS"){x=NR;} if(NR==x+3){print "genome_A-"field1[$1]"("$2")/"$3"-"$5"\t"$4}else{print}}' 5 > 6

#Replace pipes with asterisks and switch last two lines
awk  '{if(substr($NF,1,1)=="|"){x=$0;next;}else{gsub("[|]","*",x);print $0"\n"x;x=""}}' 6 | awk NF > 7

#Remove all tmp files
rm 2 3 4 5 6

Final output looks like following

CLUSTAL
genome_B-genome_B_sequence_x(+)/1476-1539       AGAATGGGCAAATTCTTAGAGACAGAAAGTAGAATAAAGGTTATTGTGAGATGGGAGGAAGGGG
genome_A-(+)/170-226    AGAATGGGCATAGTCGTAGAGACAGA-------GTAGAGGTTACCATGAGTTGAGGGTGAGGGG
                         ********** * ** **********        ** ******   **** ** * *  *****
CLUSTAL
genome_B-genome_B_sequence_x(+)/1760-1772       AGAAAAGAAAATG
genome_A-(+)/446-458    AGAGGAGAAAATG
                         ***  ********

**************************************************************************
Personal note: I had to translate sequence ids using separate files. Therefore I used following lines
#Add -CLUSTAL- for every 3rd consecutive lines
awk 'BEGIN { RS="\0" } { gsub(/\n{3,}/,"\nCLUSTAL\n")}1' populus_tremula_201_populus_tremuloides.aligns | awk NF  > 2

##Remove octothorpes
awk '{gsub("####","");print}' 2 | awk NF > 3

##remove empty aligns
awk '$5!="" || $1=="CLUSTAL" || substr($1,1,1)=="|"' 3 > 4

#sed 'N;s/^\( *\)CLUSTAL\n\( *\)CLUSTAL$/\1CLUSTAL\2/;t;P;D' 4 > 5

#Replace sequence ids using first lift over ids
awk '{if(substr($1,1,4)=="CLUS"){x=NR}}NR == FNR {field1[$1]=$2; next}{if(NR==x+1 || NR==length(field1)+1){if($5!=""){print "potra-"field1[$1]"("$2")/"$3"-"$5"\t"$4}}else{print}}' potra_lift 5 > 6

#Replace sequence ids using second lift over ids
awk '{if(substr($1,1,4)=="CLUS"){x=NR}}NR == FNR {field1[$1]=$2; next}{if(NR==x+3 || NR==length(field1)+3){print "potrs-"field1[$1]"("$2")/"$3"-"$5"\t"$4}else{print}}' potrs_lift 6 > 7

#Replace pine with asterisk and switch second and thrird lines
awk  '{if(substr($NF,1,1)=="|"){x=$0;next;}else{gsub("[|]","*",x);print $0"\n"x;x=""}}' 7 | awk NF > 8

#Add CLUSTAL to first line
awk '{if(NR==1){print "CLUSTAL\n"$0}else{print}}' 8 > 9

#Remove consecutive CLUSTAL duplicates
sed -i '$!N; /^\(.*\)\n\1$/!P; D' 9 > 10

#delete last line if necessary
 sed -i '$ d' 10

OR

awk 'BEGIN { RS="\0" } { gsub(/\n{3,}/,"\nCLUSTAL\n")}1' populus_trichocarpa_populus_tremula_201.align | awk NF  > 2
awk '{gsub("####","");print}' 2 | awk NF > 3
awk '$5!="" || $1=="CLUSTAL" || substr($1,1,1)=="|"' 3 > 4
awk '{if(substr($1,1,4)=="CLUS"){x=NR; next} if(NR==x+1){print "CLUSTAL\npotri-"$1"("$2")/"$3"-"$5"\t"$4}else{print}}' 4 > 5
awk '{if(substr($1,1,4)=="CLUS"){x=NR}}NR == FNR {field1[$1]=$2; next}{if(NR==x+3 || NR==length(field1)+4){print "potra-"field1[$1]"("$2")/"$3"-"$5"\t"$4}else{print}}' potra_lift 5 > 6
sed -i '$!N; /^\(.*\)\n\1$/!P; D' 6
awk  '{if(substr($NF,1,1)=="|"){x=$0;next;}else{gsub("[|]","*",x);print $0"\n"x;x=""}}' 6 | awk NF > 7

**************************************************************************
We can verify and convert the above format into standard clustalw using following Perl script.

#!/usr/bin/perl -w
use Bio::AlignIO;

$in = Bio::AlignIO->new(-file => "7" ,
                         -format => 'clustalw', verbose => 0);
$out = Bio::AlignIO->new(-file => ">clustal_output.aln",
                         -format => 'clustalw', verbose => 0);
$in->verbose(0);
$out->verbose(0);
while ( my $aln = $in->next_aln ) {
  $aln->verbose(-1);
  $out->write_aln($aln);
}

Load into the database:
There are two options to load final output (clustal_output.aln) into GBrowse_syn database

1     1.)We can use clustal2hit.pl script and then load to DB using gbrowse_syn_load_alignment_database.pl
clustal2hit.pl clustal_output.aln > clustal_output.tsv
gbrowse_syn_load_alignment_database.pl –u username –p password –d synteny_database –c –v clustal_output.tsv

2     2.)Directly load into GBrowse synteny database using gbrowse_syn_load_alignments_msa.pl
gbrowse_syn_load_alignments_msa.pl –u username –p password –d synteny_database –c –v clustal_output.aln

Load gff3 files into Database

##Create samtools index from fasta files
Samtools faidx genome_B.fasta
Samtools faidx genome_A.fasta

##convert to gff3
awk '{print $1"\tplantgenie\tscaffold\t1\t"$2"\t.\t.\t.\tName="$1}' genome_B.fasta.fai > genome_B.gff3
awk '{print $1"\tplantgenie\tscaffold\t1\t"$2"\t.\t.\t.\tName="$1}' genome_A.fasta.fai > genome_a.gff3

##Load into database
bp_seqfeature_load –u username –p password -d genome_B -c -f genome_B.fasta genome_B.gff3
bp_seqfeature_load –u username –p password -d genome_A -c -f genome_A.fasta genome_A.gff3

For more details about GBrowse_syn configuration, please follow this(http://gmod.org/wiki/GBrowse_syn_Configuration) tutorial.



Wednesday 27 November 2013

GBrowse Bio::DB::BigFile


wget http://hgdownload.cse.ucsc.edu/admin/jksrc.zip
unzip jksrc.zip
cd kent/src/lib
echo $MACHTYPE
x86_64
export MACHTYPE=x86_64
make CXXFLAGS=-fPIC CFLAGS=-fPIC CPPFLAGS=-fPIC
cd..
export KENT_SRC=`pwd`
sudo perl -MCPAN -e 'install Bio::DB::BigFile'

Thursday 31 October 2013

Simple awk function to read file line by line and merge columns using unique id

Here is the input:

MA_9270965g0010 PF00010 Helix-loop-helix DNA-binding domain
MA_10437060g0010        PF00082;PF05922 Subtilase family;Peptidase inhibitor I9



Here is the function:
#Start reading file line by line using tab field seperator
awk 'BEGIN{FS="\t"}{
 #split second and third column by ";"
 second_field_array_length=split($2,second_field_array,";");
 third_field_array_length=split($3,third_field_array,";");
 concat_str="";
 #Loop the second column array and merge with 3rd column consecutive element
 for(i=1;i<=second_field_array_length;i++){
  #concat with ";" when count is greater than 0
  if(i>1){
   concat_str=concat_str";"second_field_array[i]"-"third_field_array[i]
  }else{
   concat_str=second_field_array[i]"-"third_field_array[i]
  }
 }
print $1,concat_str
}' Pabies1.0-Pfam-update.txt | head


Here is the output:

MA_9270965g0010 PF00010-Helix-loop-helix DNA-binding domain
MA_10437060g0010 PF00082-Subtilase family;PF05922-Peptidase inhibitor I9


Saturday 23 February 2013

Galaxy Bowttie/BWA/BLAST tools/dataset restricted access per user

Add following line to bowtie_wrapper.xml/bwa_wrapper.xml 
--email=$__user_email__ 

Add following lines to bowtie_wrapper.py/bwa_wrapper.py 
parser.add_option( '', '--email', dest='email', help='email' ) 

Above line should add before the options, args) = parser.parse_args() line After the stdout = '' add following lines.
if options.ref in "[dataset paths]" :
  if options.email in "[email addresses]":
    sys.stdout.write( 'Permission granted! ' )
  else:
    stop_err( 'You dont have permissions!' ) 
else: 
 sys.stdout.write( 'Permission granted! ' ) 

Retart the Galaxy! Now you can restrcit tool datasets per user without having two Galaxy instances.

/Chanaka

Tuesday 5 February 2013

Symlink all files from a base directory to a target directory


for f in $(ls -d /base/*); do ln -s $f /target; done && ls -al /target