Getting Familiar with awk

While the shell provides most of the tools anyone could ever need to do just about anything that could ever need doing, there's always something else out there that goes beyond plain shell programming. One addition to UNIX/Linux is particularly fascinating since it is a programming language in itself while offering all the power of the shell. I'm talking about awk of course.
awk is best suited for building, searching and modifying UNIX/Linux style databases. A UNIX/Linux database is typically composed of records divided into fields. A record is usually a line terminated by a carriage return. A field is a string of characters within a record that is separated from every other field by a "white space" (a space, tab or newline) or some other delimiter. A good example of a database is the /etc/passwd file. It is plain text, it's delimited into fields by the colon character and each line is a separate record.
bil:AbCdEfGhIjKl:106:1:Bill Conner:/usr/bill:/bin/ksh
The first field goes from the beginning of the line up to the delimiter character (here, a colon, :). There are 7 fields in this record. By using awk, you can view only the record and fields that interest you,
awk 'BEGIN {FS=":"} {print $1,$5,$7}' /etc/passwd
Here awk is simply called as a command followed by the single quote. Everything up to the last single quote is an argument to awk. The reason for the single quotes is that the shell does no interpreting of any characters within the quotes; everything is passed to awk just the way you typed it in. Without the single quotes, the shell would have interpreted the curly braces ({}) and accused you of a syntax error, it would have seen the dollar signs followed by numbers as positional parameter variables and it would have interpreted the word BEGIN and the characters {FS=":"} as separate arguments. There would have been massive confusion. Thus, the single quotes.
An awk command or program can consist of three pieces: that which is processed before any records or fields, the processing of the records and fields and finally, stuff you might want to add after all the other processing is finished. In the command above, the BEGIN statement tells awk to do everything between the curly braces before doing anything else. In this case we're telling awk to substitute the colon for the default white-space as the field separator. This forces awk to delimit fields in the records to be processed next with a colon. The right curly brace terminates the BEGIN statement. While the line can be continued, it's easier to begin a new line for each awk statement, it's also easier to read later when you've forgotten what the program is supposed to do.
The next line is also enclosed in the curly braces and is the body of the program where the record and field processing gets done.
{print $1,$5,$7}' /etc/passwd
Here we are just printing some fields from every record in the file /etc/passwd. The first field is the login name of a user, the fifth field is the user's full name and the seventh field is the shell that user is using (count 'em).
bil:ZHq23!rtEE:1:Bill Conner:/usr/bill:/bin/ksh
Ok, this isn't very impressive, but it gets better. Suppose I only want to look at one record (line) in /etc/passwd instead of all of them. Regular expressions are supported so it's possible to restrict awk to just processing records that contain the regexp,
awk 'BEGIN {FS=":"} $1 ~ /bil/ {print $1,$5,$7}' /etc/passwd
The regexp is read as "If the value of the first field of the current record matches the regular expression, do the command inside the curly braces. Otherwise get the next record". The $1 identifies the first field (remember these constructs are not the same as the shell - they are talking to awk only). The tilde character (~) is shorthand for "check the operands on either side to see if they match". The regexp is enclosed by the forward slash characters (/). Since /etc/passwd has many records, we have used a regexp to restrict awk's processing to just the record that matches the regexp, this should return just one line. All the regexp notations are valid so you have access to the full suite of pattern matching capabilities to ensure that awk processes only what you want to see.
This next command is more ambitious though it really doesn't do anything special (it's yet another way to list files in a directory),
ls -la "$1"|awk ' BEGIN { OFS="\t\t" print "\nFile\t\t\tSize\tType"} { sum=sum+$5 } NR > 1 {printf "%-14s %12d %s\n",$9,$5,$1} END { print "\nTotal bytes in "NR-1" Files: "sum "\nTotal Blocks: "sum/512}'
This time a command is piped into awk whereas the last time awk was given a file name to operate on. For readability the commands are on separate lines.
There is a BEGIN statement again but this time we don't need to change the field separator (FS) since the ls command uses white space to delimit its output. But we're doing things a little differently this time. The ls command output is separated by spaces so there's no good way to line up the output except to use tabs so we tell awk that the output field separator (OFS) should be the tab character. We're using two tabs (\t\t) to allow room for long file names in the output.
Next I decided to add a header for each column of output,
print "\nFile\t\t\tSize\tType"
This instructs awk to print a newline (\n) so that the header won't display right below the command line. Next some tabs (\t) are inserted to align the headers with the columns and interspersed throughout are the heading titles, File, Size and Type.
The next line defines an awk variable called sum as having the value of the current contents of sum plus the value of whatever is in field five ($5) of the current record. Obviously we're expecting field five to contain a number. If a variable is not initialized, awk assigns it zero if it's numeric or null ("") if it's a character string, so all we need to do here is name the variable to initialize it.
{ sum=sum+$5 }
Since the variable sum wasn't initialized it starts out as zero plus the value of whatever's in field five - it has the value of $5.
Next we use one of awk's built-in variables to begin the print command at the second line of output. Since the first line is record number one (NR=1), line two will be NR+1 and since we're wanting to display every line -except- line one, NR > 1 will do all the work. The line reads, "If the number of the current record (NR) is greater than 1 do the following command on the current line (record)".
This time we're using printf instead of print because printf (an awk built-in) can format the output. Notice that the syntax is just like the C syntax for printf.
NR > 1 {printf "%-14s %12d %s\n",$9,$5,$1}
The string to print is enclosed in the double quotes. The percent sign (%) is used to define the format of the output. %s means to print a string, %d will print a decimal number. These formatting directives can have a count so %-14s means to print the first 14 characters. The minus sign (-) before the count means to left-justify the formatted string, without the minus sign the column is right-justified (the form we'd want for decimal numbers).
So we tell awk to print three fields, the first will be a left-justified field of 14 characters, the next will be a right-justified field of 12 digits and the final field is a string of indeterminate length. Once the output format is defined, close the quotes ("), type a comma and then list the fields to format. The first format, %-14s affects field $9, the next is for $5 and so on.
After all that we're ready to wrap up the program. This time we use the END statement to tell awk that, after it has finished processing everything, we want it to display the following,
END { print "\nTotal bytes in "NR-1" Files: "sum "\nTotal Blocks: "sum/512}'
We don't need printf this time so we go back to plain ol' print to display a newline (\n), a string of characters and then the value of the built-in variable NR minus 1 (we are ignoring line one). Next we display another string and the value of the variable sum that we defined earlier and which should contain the value of the sum of $5 for every record processed. And finally we display a block count which is the variable sum divided by 512.
Remember that the entire awk program is enclosed within the single quotes, each command line is enclosed within the curly braces and each print or printf statement requires the string to print to be in double quotes. If you get a syntax error look for a missing '.".{,}.
The program is based on the fixed output of the ls command, ls -la total 259 drwxr-xr-x 18 bil other 1136 Nov 4 22:51 . drwxrwxrwx 33 root sys 560 Oct 7 10:14 .. -rw-r--r-- 1 bil other 105 Mar 4 1993 .exrc -r-------- 1 bil other 0 Nov 5 16:27 .lastlogin drwxr-xr-x 2 bil other 64 Dec 3 1992 .mail_file -rw-r--r-- 1 bil other 306 Mar 5 1993 .mailrc . . .
Notice that there are 9, white-space separated, fields. Can you see why the first line (record) was excluded in the program? The output of the awk program uses only three of those fields to create this listing,

File Size Type
ALIAS.1 1718 -rw-r--r--
awk.art 9069 -rwxr-xr-x
scripts 5008 drwxr-xr-x
set_pass 117 -rwxr-xr-x
systems 0 -rw-r--r--
t.2 32 -rw-r--r--
t.resp 15 -rw-r--r--
tar 39516 -rw-r--r--
tmp 32 drwxr-xr-x
Total bytes in 37 Files: 95161 Total Blocks: 185.861
The heading defined in the BEGIN statement comes first, then the results of the prinf command and finally the results of the print command in the END statement. Surprisingly, this little awk program runs pretty fast proving that awk's processing capabilities are very efficient. While this may not be the best use of of awk, it should be sufficient to show the basics and, more importantly, that it's worth experimenting with.
Once you've used awk to create some simple utilities, you've probably discovered that, while it is picky about syntax, it's not that hard to use. If all you used it for was quickly database searches or command output formatting you're ahead of most UNIX/Linux users. But awk, like the shell, is full of features that most people never use and fewer still even know exist.
As mentioned before, awk recognizes regular expressions (regexp) in all their variety (see the manual entry for ed for a complete description of regexp). In typical UNIX/Linux fashion, a regular expression is defined within the forward slashes (/) and can be very complex,
/^[Bb].[a-z].*sh$/
Here's an example of a simple regexp. The caret (^) means that the regexp begins at the beginning of a line. If omitted, the regexp can be anywhere on the line. Next the square brackets ([]) enclose members of a set of characters, in this case -either- an upper case B or lower case b. Each set of square brackets represents a single character.
Next is the dot which means to match exactly one character and then one character from the set of all lower case characters, [a-z]. The next dot is like the first - it matches exactly one character. Next the asterisk (*) matches anywhere from zero to some maximum number of characters (how many characters are matched depends on how long the line is). Next the characters "sh" are matched as themselves (they have no special meaning so they are just what they appear to be). Finally comes the dollar sign ($) which means the end of the current line. Since the dollar sign immediately follows the characters, sh, the current line must end with those characters.
The regexp above then, reads, "Find a line that begins with either B or b followed by at least one character (alphabetic, numeric or punctuation), followed by one lower case alphabetic character, followed by exactly one character and then any number of other characters and ends with the characters, sh".
You can of course be even more specific or much less, the point is that a regexp is a shortcut and awk can use them. To prove the point we'll search the /etc/passwd file using a regexp and print the lines (records) that contain the character string that matches. To make it interesting, we'll also print just a couple of fields within the record awk finds.
awk 'BEGIN {FS=":"} $1 ~ /^[Bb].[a-z]/ {print $1,$5}' /etc/passwd
bin 0000-Admin(0000) bil Bill Conner
Notice that both bin and bil match the regexp search criteria so awk performed the print command on the lines containing the match. Recall that the tilde character (~) means to match the pattern in the regexp to the operand on the left. Other kinds of matches are also possible,
A > B A greater than B A < B A less than B A !~ B A does not match B A == B A is equal to B A != B A is not equal to B A >= B A is greater than or equal to B A <= B A is less than or equal to B
These are called comparison operators since they compare the operands A and B to see if they match the criterion specified in the operator. These comparisons can work with numbers or with ASCII characters to see the sort order of something. For instance A > B is not true because A has a lower ASCII code value than B. But awk goes further still.
If you want to search for words in field 4 that have a higher ASCII code value than say, the letter m, you could use, $4 > "m" (the double quotes are for safety's sake to ensure that the pattern is correctly interpreted by awk). To search a directory for files greater than 500 blocks,
ls -l |awk '$5/512 > 512*500 {print "File "$9" is too big"}
Notice that you can perform math on the fly within the operands. When the order of precedence of the operations is in doubt or if you want to change the order, use the parentheses for grouping. The built-in variables NR (current record number) and the field variables can also be manipulated by mathematical operations. The math operations are the same as with the bc calculator discussed elsewhere.
The built-in math functions to awk are: atan(y,x) cos(x) exp(x) int(x) log(x) rand() sin(x) sqrt(x) srand(x)
Rather than go into detail here, I refer you to the manual. It's enough to know that math is available and with the examples given, you shouldn't have any problems using math within awk.
As an example of extracting numbers from a database and using them to do something useful,
awk '{IF=":"} {print $3} ' /etc/passwd|sort -n
This will list the user IDs in the /etc/passwd file. If you are a system administrator and want to add a new user, you need to make sure that you don't use an existing user ID. The example above will show the highest numbered user ID so all you have to do is assign the next highest number.
if [ $# -lt 4 ] then echo "Usage: $0 login_name full_name directory" exit 0 fi a=`awk '{FS=":"} {print $3}' /etc/passwd|sort -n|tail -1` mkdir "$4" echo "$1::`expr $a + 1`:110:$2 $3:/bin/sh">>/etc/passwd
Of course you'd have to be root to run this and it lacks some features, but it works and it shows how to incorporate awk into a shell program. Notice that expr is used to increment the user ID number by one, here again a math operation is done on the fly demonstrating that there's always a solution to whatever problems you create for yourself.
There are times when you might want to test for more than one match in a record or possibly an alternate match. Suppose we have created a phone list that contains a name, their relationship or business and their phone number with the area code,
Plasbo Harledoo:Rich Uncle:405 376 8956 Fenemor Tudbury:Acme Novelties: 214 890 7567 Edna Freon:Librarian:405 870 9801 Exeter Tweed:Used Cars:405 321 4567 Alikar Mendacious:Banker:405 345 34501 Paul Hunbar:Owes me money:918 345 0001
Using awk we can construct searches in a variety of ways,
$1 ~ /^P.* H/ && $3 ~ /^405.*/ {print $2}
or
$3 ~ /^405.*/ || $2 ~ /[Oo]wes [Mm]e [Mm]oney/ {print $1/
In the first line the first field of the current record ($1) is compared to the regexp /^P.* H/. This would match both Plasbo Harledoo and Paul Hunbar. To further narrow the search we can AND it to another pattern, here, $3 is compared to 405 at the beginning of the field. Notice that both conditions must be true for there to be a match, this eliminates Paul Hunbar from the search. The double ampersand (&&) means that both conditions must be true.
In the second example there are again two patterns but this time they are separated by the double "or bar" (||) which means that either condition can be true. Notice that the regexp forces awk to ignore capitalization. Both examples above assume that the command, awk, has been entered and that the whole thing is enclosed in the single quotes.
You can also search a range of records within a file using regexp as the range criteria,
awk 'BEGIN {FS=":"} /^P.* H/, /^E/ {print $2}' phone_list
Here we again set the field separator to the colon character (:) and then search the file phone_list for every line between the regexp /^P.* H/ and the regexp /^E/. The two regexp are delimited by a comma. There is a built-in awk variable called FNR that contains the number of the current record so we can also specify a range as,
FNR == 3, FNR == 12 {print $2}
This will print field 2 of every record between the range of record 3 to record 12 inclusive. If you want to include the current file name in the output, use the awk variable FILENAME,
for i in phone_list members_list starboard_list do awk 'BEGIN {FS=":"} FNR == 1, /IRS [Aa]udit [Dd]ay/ {print "FILENAME $3"}' $i done
This time we used both the awk variable FNR and a regexp to find a range of records and then used the awk variable FILENAME to display the name of the file where the match was found. Once again awk is invoked from within a shell script (this could also be entered from the command line).
The power of awk is that it is a complete programming language in its own right yet can be combined with the shell programming capabilities to create very sophisticated and very efficient programs. If the features of awk presented so far were all you ever used, you would have enough power to create the kinds of programs that others would find indispensable (you'd be an Ace). But there's more.
The printf command discussed earlier is even more capable, consider,
{printf "%-12s %4.", $1, $3*$2/1000}
Here the contents of field 1 ($1) are displayed as a string of a maximum of 12 characters, left-justified, followed by the contents of field 4 ($4) as a floating point number (has a fractional component) of length 4 which takes its value from the math operation, $3*$2/1000 (the contents of field 3 times the contents of field 2 divided by 1000). From previous examples it's obvious that there are several ways to perform math. Where you put the math operation can affect the way a program operates; some arrangements may cause slower execution, be more prone to errors, or diddle variables in unexpected ways. You'll have to experiment to find the best structure for your program.
As with C, there are also shortcuts for some operations. If you are re-defining a variable using its current value as part of the definition, you can save some typing by using the following,
A += B -> A=A+B A -= B -> A=A-B A /= B -> A=A/B A *= B -> A=A*B A %= B -> A=A%B A ^= B -> A=A^B A++ -> Do commands then A=A+1 ++A -> A=A+1 Then do commands
The assignments here are pretty terse, but they save a lot of typing and can make your awk script more readable (if you're used to reading awk scripts). There is also the advantage that for C programmers, it seems very natural.
As demonstrated earlier, any arithmetic operations can be performed on the fly as required and don't require the use of variables. Assuming that you will use awk within shell scripts, you will also have all the features of expr, dc, bc and test (though you may not use them much after tinkering with awk for a while). Though awk is touted as a text processing/formatting tool, it can do much more and should be viewed as a programming language.
Anyway, the things you can do with strings can add a lot of power to your shell scripts also so we'll stick with that for now. When using regular expressions, remember that they must be enclosed with the slash characters (/), string constants (a literal string) must be enclosed with the double quotes ("). To ensure that a string is interpreted as a string by awk, concatenate the null string, stringvar"". To ensure that a numeric variable is properly interpreted, add zero to it, numvar+0. These little idiosyncrasies may seem awkward (pun intended), but they reduce the time spent debugging the script.
Strings can be concatenated by just putting them next to each other,
{print NR": "$0}
Here the record number of the current input line has a colon character (:) added to it followed by a space (notice the double quotes) and then the entire record ($0) is displayed. Since the three variables are concatenated, they will display as a single line. If you insert commas (,) between variables, they will be separated by a single space,
{print $2,$3}
Fields 2 and 3 will be separated by a space. without the comma, they display as a single string (they are concatenated). You can also use the formatting characters, \n, \t, \b, etc. to insert newlines, tabs and so on. It's also possible to pre-format strings using the function, sprintf(). This function works just like printf except that it doesn't display. The value of this is that you can define a variable a certain way and then use that variable elsewhere.
out=sprintf("%-10s %6.2d", $2, $3)
This will create a left justified string for field 2 and a floating point number will have 4 digits to the left of the decimal point and 2 to the right for field 3 and put all of this formatting information into the variable, out.
Flow control is also possible with awk using the same tools you've used in creating shell scripts and C programs,
if (exp) statement else statement
Here the exp is an expression to be evaluated and if it evaluates true (1=1 for instance) then the statement is processed. If the expression is not true (1=2), the else portion of the command is processed. Some versions of awk support an else if option which can be used to test many different expressions.
The while and do loops are supported as is the for loop (using the C syntax). Since these are used within awk itself instead of the shell the syntax is somewhat different,
for ( ... ) while ( ... ) do ( ... ) if ( ... )
Notice that the condition to be evaluated is enclosed in the parentheses and that the whole thing is on the same line. The commands to be processed when the condition is satisfied is enclosed in the curly braces,
{ if ($1+0 < 200) { x=sprintf(%-20s %8.2d, $2, $3); print x, "Numbers" } }
This is a nonsense awk command line but the idea is to show the form of the command rather than agonize over its content. Most of these features are well known through the use of the shell so there's little to be gained by going into all the gory details. The only way to learn awk is to use it (ever heard that before?) and since you will make mistakes, it's to your benefit to experiment long and often until it starts making sense. The main difficulty is the syntax and the placement of curly braces, quotes, commas and similar stuff. Once that becomes comprehensible, the rest is really pretty straightforward.
There are also arrays available with awk in the usual form, x[i]. Where x is the array name and i is the element within the array,
{ for (i=1;i<NR;i++) {x=[i]};print x[i]} }
Again a useless example whose only purpose is show the form of the array assignments. I'm assuming here that you've become familiar with shell programming and have at least glanced at C source code since awk assumes as much. Not to belabor the point, but all of this requires practice and ideally, some kind of awk reference handy. To see an example of using awk to format a web page (a pretty good use of awk's capability), see this
Other than the on-line manual entry for awk, which is somewhat scanty, it's not always apparent how to do things with awk. I suggest an excellent book about using awk from O'Reilly & Associates (who else but O'Reilly, of course) entitled "Sed & Awk" by Dale Dougherty. The books is easy to read and follow and there are plenty of examples. O'Reilly has a Web site where you can find books and stuff about all manner of interesting odds and ends. Go to http://www.ora.com

Return to the Previous Page

var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www."); document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));