Thursday, November 18, 2010

Awk made easy..

AWK SYNTAX:
-----------

awk 'BEGIN{xxx} {yyy} END{zzz}'

xxx - Do all the initializations
yyy - Do the processing/calculations on the input data
zzz - Do the final action, like printing etc.,

Basically BEGIN & END are special blocks.
The yyy block is the main part which will be excuted for each input line.

EXAMPLES:
----------

1) echo "5 10" | awk 'BEGIN{SUM=0} {SUM=SUM+$1+$2} END{print SUM}' //One input line
- This will print 15.
- SUM will be initialized to 0 once in the beginning.
- $1,$2 - first and second argument
- 'print SUM' will print the value once in the END.

2) echo -e "5 10 \n 15 20" | awk 'BEGIN{SUM=0} {SUM=SUM+$1+$2} END{print SUM}' //Two input lines
- This will print 50.
- Basically if there are multiple lines, then awk will do the calculation part(MIDDLE) for each input line.
- But the BEGIN will be executed only once in the beginning. That's why SUM is not reset to 0 everytime.
- Sameway END will be executed only once in the end, That's why the result is printed only once.

MORE EXAMPLES:
--------------

All the three sections in awk are optional. This is explained in the following cases.

1) If you don't have anything to initialize, then you can skip the BEGIN part.
E.g., echo -e "5 10 \n 15 20" | awk '{SUM=SUM+$1} END{print SUM}'
- This will print 20. Sum of the first colum in the two input rows.

2) If you don't have anything to do in the end, then you can skip the END part.
E.g., echo -e "5 10 \n 15 20" | awk '{SUM=SUM+$1}'
- This will just add the first column, but does't print anything. Kind of useless.
E.g., echo -e "5 10 \n 15 20" | awk '{print $1}'
- This is very useful. It just prints the first column of each input row.
Output: 5
15
- This is especially useful in scripting, when you want to extract a column or something like that.

3) If you just want to do something at the end, then you can skip the calculation part also.
E.g., echo -e "5 10 \n 15 20" | awk 'END {print $2}'
- This will print 20, which is the 2nd argument of the last line. This is also pretty useless, without the calculation part.


SOME REAL WORLD EXAMPLES:
--------------------------

1) If you want to extract only the size of the files in 'ls -l' output and print it:
E.g., ls -l | awk '{print $5}' //Very easy right.

2) If you want to extract the size of the files in 'ls -l' and print only the Total size:
E.g., ls -l | awk '{SUM=SUM+$5} END{ print "Total="SUM }'
Print the size in KiloBytes: ( c like printf )
E.g., ls -l | awk '{SUM=SUM+$5} END{ printf("Total= %0.3f\n",SUM/1024) }'

3) If you want to print the size & name of the individual files and also the Total size:
E.g., ls -l | awk '{SUM=SUM+$5 ; print $5,$8} END{print "Total="SUM}'
Note that here we are printing in the calculation part also, which is executed once for each input line.
So we get the individual file sizes and names.
In the END, we print the total size.

Special Variables in awk:
FS - Field seperator ( default blank spaces )
NR - Line number
FNR - Line number in current file
$0 - All the fields in a line


4) In the above examples, 'ls -l' output is space seperated. What if it is seperated by a ":" or some other delimiter ?
Then we have to specify the delimiter manually in BEGIN block. (just like -d option for cut)
E.g., cat /etc/passwd | awk 'BEGIN {FS=":"} {print $1}'
- This will print all the usernames from the /etc/passwd file.
- Note the FS variable. It is the variable used to specify Field Seperator.


5) Preceeding each line in a file with its line number:
E.g., awk '{ print FNR,$0 }' /etc/passwd
- This will output all the lines preceeded by its line number. You can specify multiple files also.

6) Pass variables from SHELL to AWK . Use the -v option and assign the shell variables.
E.g., ls | awk -v a=$PWD '{ printf("%s/%s\n",a,$0) }'



Note: These are some very basic usages of awk. awk is a very powerful tool with a lot of functionality.
Once you get comfortable with these basic usecases, you can easily extend your knowledge of awk further.