Wednesday 25 May 2016

Merge two files by coulm ids using AWK

Input file1

Potrx000002g00010.0 GO:0005576
Potrx000002g00010.0 GO:0005618
Potrx000002g00010.0 GO:0005634
Potrx000002g00010.0 GO:0006952
Potrx000002g00010.0 GO:0009698
Potrx000002g00010.0 GO:1901564
Potrx000002g00010.0 GO:1901576
Potrx000002g00020.0 GO:0005576
Potrx000002g00020.0 GO:0005618
Potrx000002g00020.0 GO:0005739

Input file2

GO:0000001 mitochondrion inheritance biological_process
GO:0000002 mitochondrial genome maintenance biological_process
GO:0000003 reproduction biological_process
GO:0000005 obsolete ribosomal chaperone activity molecular_function
GO:0000006 high-affinity zinc uptake transporter activity molecular_function
GO:0000007 low-affinity zinc ion transmembrane activity molecular_function
GO:0000008 obsolete thioredoxin molecular_function
GO:0000009 alpha-1,6-mannosyltransferase activity molecular_function
GO:0000010 trans-hexaprenyltranstransferase activity molecular_function
GO:0000011 vacuole inheritance biological_process

Both are tab separated files and number of rows in file1 is larger than  number of rows file 2. Nowe we are going to merge both files by file1 column 2 id and file 2 column 1 id.

Solution

awk '
BEGIN{FS=OFS="\t"}  # Field seperator and Output Fiels seerator equal to \t
NR==FNR             # FNR is the current record number in the current file. FNR is                     #incremented each time a new record is read. NR is the number                       #of input records.
 {
  file_1_array[$1]=$0 OFS $2;next; # Add file1 into an array
 }
  {
   for(x in file_1_array)          # Loop through the file1 big array
     {if(x==$2) print $1,$2,$3, file_1_array[x]} # print when it macthes
   }' 
file2 file1

All in one line
awk 'BEGIN{FS=OFS="\t"}NR==FNR{file_1_array[$1]=$0 OFS $2;next;}{for(x in file_1_array){if(x==$2) print $1, file_1_array[x]}}' file2 file1 |head 

Output
Potrx000002g00010.0 GO:0005576 extracellular region cellular_component extracellular region
Potrx000002g00020.0 GO:0005576 extracellular region cellular_component extracellular region
Potrx000002g00030.0 GO:0005634 nucleus cellular_component nucleus
Potrx000003g00010.0 GO:0005634 nucleus cellular_component nucleus
Potrx000003g00020.0 GO:0005634 nucleus cellular_component nucleus
Potrx000006g00010.0 GO:0005576 extracellular region cellular_component extracellular region
Potrx000008g00010.0 GO:0005743 mitochondrial inner membrane cellular_component mitochondrial inner membrane
Potrx000013g00010.0 GO:0000325 plant-type vacuole cellular_component plant-type vacuole
Potrx000014g00010.0 GO:0005743 mitochondrial inner membrane cellular_component mitochondrial inner membrane

Potrx000015g00010.0 GO:0005783 endoplasmic reticulum cellular_component endoplasmic reticulum