Bioinformatics and Interaction Design: October 2013

Thursday, 31 October 2013

Simple awk function to read file line by line and merge columns using unique id

Here is the input:




MA_9270965g0010 PF00010 Helix-loop-helix DNA-binding domain
MA_10437060g0010        PF00082;PF05922 Subtilase family;Peptidase inhibitor I9

Here is the function:


#Start reading file line by line using tab field seperator

awk 'BEGIN{FS="\t"}{

 #split second and third column by ";"

 second_field_array_length=split($2,second_field_array,";");

 third_field_array_length=split($3,third_field_array,";");

 concat_str="";

 #Loop the second column array and merge with 3rd column consecutive element

 for(i=1;i<=second_field_array_length;i++){

  #concat with ";" when count is greater than 0

  if(i>1){

   concat_str=concat_str";"second_field_array[i]"-"third_field_array[i]

  }else{

   concat_str=second_field_array[i]"-"third_field_array[i]

  }

 }

print $1,concat_str

}' Pabies1.0-Pfam-update.txt | head

Here is the output:




MA_9270965g0010 PF00010-Helix-loop-helix DNA-binding domain
MA_10437060g0010 PF00082-Subtilase family;PF05922-Peptidase inhibitor I9

About blog

It's nice to have a strong coexistence between Bioinformatics and Interaction design. Why? Most of the biologist are not computer experts but they need computer support, new technologies/tools to continue their research. therefore, interaction design can play major role in bioinformatics by designing user centric, simple, interactive tools.

In this blog, I am trying to share information, that allow to share with public. This is my personal blog. The views expressed on these pages are mine alone and not those of my employer.