Thursday 31 October 2013

Simple awk function to read file line by line and merge columns using unique id

Here is the input:

MA_9270965g0010 PF00010 Helix-loop-helix DNA-binding domain
MA_10437060g0010        PF00082;PF05922 Subtilase family;Peptidase inhibitor I9



Here is the function:
#Start reading file line by line using tab field seperator
awk 'BEGIN{FS="\t"}{
 #split second and third column by ";"
 second_field_array_length=split($2,second_field_array,";");
 third_field_array_length=split($3,third_field_array,";");
 concat_str="";
 #Loop the second column array and merge with 3rd column consecutive element
 for(i=1;i<=second_field_array_length;i++){
  #concat with ";" when count is greater than 0
  if(i>1){
   concat_str=concat_str";"second_field_array[i]"-"third_field_array[i]
  }else{
   concat_str=second_field_array[i]"-"third_field_array[i]
  }
 }
print $1,concat_str
}' Pabies1.0-Pfam-update.txt | head


Here is the output:

MA_9270965g0010 PF00010-Helix-loop-helix DNA-binding domain
MA_10437060g0010 PF00082-Subtilase family;PF05922-Peptidase inhibitor I9