Bioinformatics and Interaction Design: Simple awk function to read file line by line and merge columns using unique id

Thursday, 31 October 2013

Simple awk function to read file line by line and merge columns using unique id

Here is the input:




MA_9270965g0010 PF00010 Helix-loop-helix DNA-binding domain
MA_10437060g0010        PF00082;PF05922 Subtilase family;Peptidase inhibitor I9

Here is the function:


#Start reading file line by line using tab field seperator

awk 'BEGIN{FS="\t"}{

 #split second and third column by ";"

 second_field_array_length=split($2,second_field_array,";");

 third_field_array_length=split($3,third_field_array,";");

 concat_str="";

 #Loop the second column array and merge with 3rd column consecutive element

 for(i=1;i<=second_field_array_length;i++){

  #concat with ";" when count is greater than 0

  if(i>1){

   concat_str=concat_str";"second_field_array[i]"-"third_field_array[i]

  }else{

   concat_str=second_field_array[i]"-"third_field_array[i]

  }

 }

print $1,concat_str

}' Pabies1.0-Pfam-update.txt | head

Here is the output:




MA_9270965g0010 PF00010-Helix-loop-helix DNA-binding domain
MA_10437060g0010 PF00082-Subtilase family;PF05922-Peptidase inhibitor I9

Bioinformatics and Interaction Design

Thursday, 31 October 2013

Simple awk function to read file line by line and merge columns using unique id

No comments:

Post a Comment