Total bytes in 37 Files: 95161
Total Blocks: 185.861
The heading defined in the BEGIN statement comes first, then the results
of the prinf command and finally the results of the print command in the
END statement. Surprisingly, this little awk program runs pretty fast proving
that awk's processing capabilities are very efficient. While this may not
be the best use of of awk, it should be sufficient to show the basics and,
more importantly, that it's worth experimenting with.
Once you've used awk to create some simple utilities, you've probably discovered
that, while it is picky about syntax, it's not that hard to use. If all
you used it for was quickly database searches or command output formatting
you're ahead of most UNIX/Linux users. But awk, like the shell, is full of features
that most people never use and fewer still even know exist.
As mentioned before, awk recognizes regular expressions (regexp) in all
their variety (see the manual entry for ed for a complete description of
regexp). In typical UNIX/Linux fashion, a regular expression is defined within
the forward slashes (/) and can be very complex,
/^[Bb].[a-z].*sh$/
Here's an example of a simple regexp. The caret (^) means that the regexp
begins at the beginning of a line. If omitted, the regexp can be anywhere
on the line. Next the square brackets ([]) enclose members of a set of characters,
in this case -either- an upper case B or lower case b. Each set of square
brackets represents a single character.
Next is the dot which means to match
exactly one character and then one character from the set of all lower case
characters, [a-z]. The next dot is like the first - it matches exactly one
character. Next the asterisk (*) matches anywhere from zero to some maximum
number of characters (how many characters are matched depends on how long
the line is). Next the characters "sh" are matched as themselves (they have
no special meaning so they are just what they appear to be). Finally comes
the dollar sign ($) which means the end of the current line. Since the dollar
sign immediately follows the characters, sh, the current line must end with
those characters.
The regexp above then, reads, "Find a line that begins
with either B or b followed by at least one character (alphabetic, numeric
or punctuation), followed by one lower case alphabetic character, followed
by exactly one character and then any number of other characters and ends
with the characters, sh".
You can of course be even more specific or much less, the point is that
a regexp is a shortcut and awk can use them. To prove the point we'll search
the /etc/passwd file using a regexp and print the lines (records) that contain
the character string that matches. To make it interesting, we'll also print
just a couple of fields within the record awk finds.
awk 'BEGIN {FS=":"}
$1 ~ /^[Bb].[a-z]/ {print $1,$5}' /etc/passwd
bin 0000-Admin(0000)
bil Bill Conner
Notice that both bin and bil match the regexp search criteria so awk performed
the print command on the lines containing the match. Recall that the tilde
character (~) means to match the pattern in the regexp to the operand on
the left. Other kinds of matches are also possible,
A > B A greater than B
A < B A less than B
A !~ B A does not match B
A == B A is equal to B
A != B A is not equal to B
A >= B A is greater than or equal to B
A <= B A is less than or equal to B
These are called comparison operators since they compare the operands A and B to see if they match the criterion
specified in the operator. These comparisons can work with numbers or with
ASCII characters to see the sort order of something. For instance A > B
is not true because A has a lower ASCII code value than B. But awk goes
further still.
If you want to search for words in field 4 that have a higher ASCII code
value than say, the letter m, you could use, $4 > "m" (the double quotes
are for safety's sake to ensure that the pattern is correctly interpreted
by awk). To search a directory for files greater than 500 blocks,
ls -l |awk '$5/512 > 512*500 {print "File "$9" is too big"}
Notice that you can perform math on the fly within the operands. When the
order of precedence of the operations is in doubt or if you want to change
the order, use the parentheses for grouping. The built-in variables NR (current
record number) and the field variables can also be manipulated by mathematical
operations. The math operations are the same as with the bc calculator discussed
elsewhere.
The built-in math functions to awk are:
atan(y,x)
cos(x)
exp(x)
int(x)
log(x)
rand()
sin(x)
sqrt(x)
srand(x)
Rather than go into detail here, I refer you to the manual. It's enough
to know that math is available and with the examples given, you shouldn't
have any problems using math within awk.
As an example of extracting numbers from a database and using them to do
something useful,
awk '{IF=":"}
{print $3} ' /etc/passwd|sort -n
This will list the user IDs in the /etc/passwd file. If you are a system
administrator and want to add a new user, you need to make sure that you
don't use an existing user ID. The example above will show the highest numbered
user ID so all you have to do is assign the next highest number.
if [ $# -lt 4 ]
then
echo "Usage: $0 login_name full_name directory"
exit 0
fi
a=`awk '{FS=":"} {print $3}' /etc/passwd|sort -n|tail -1`
mkdir "$4"
echo "$1::`expr $a + 1`:110:$2 $3:/bin/sh">>/etc/passwd
Of course you'd have to be root to run this and it lacks some features,
but it works and it shows how to incorporate awk into a shell program. Notice
that expr is used to increment the user ID number by one, here again a math
operation is done on the fly demonstrating that there's always a solution
to whatever problems you create for yourself.
There are times when you might want to test for more than one match in a
record or possibly an alternate match.
Suppose we have created a phone list that contains a name, their relationship
or business and their phone number with the area code,
Plasbo Harledoo:Rich Uncle:405 376 8956 Fenemor Tudbury:Acme Novelties:
214 890 7567 Edna Freon:Librarian:405 870 9801 Exeter Tweed:Used Cars:405
321 4567 Alikar Mendacious:Banker:405 345 34501 Paul Hunbar:Owes me money:918
345 0001
Using awk we can construct searches in a variety of ways,
$1 ~ /^P.* H/ && $3 ~ /^405.*/ {print $2}
or
$3 ~ /^405.*/ || $2 ~ /[Oo]wes [Mm]e [Mm]oney/ {print $1/
In the first line the first field of the current record ($1) is compared
to the regexp /^P.* H/. This would match both Plasbo Harledoo and Paul Hunbar.
To further narrow the search we can AND it to another pattern, here, $3
is compared to 405 at the beginning of the field. Notice that both conditions
must be true for there to be a match, this eliminates Paul Hunbar from the
search. The double ampersand (&&) means that both conditions must be true.
In the second example there are again two patterns but this time they are
separated by the double "or bar" (||) which means that either condition
can be true. Notice that the regexp forces awk to ignore capitalization.
Both examples above assume that the command, awk, has been entered and that
the whole thing is enclosed in the single quotes.
You can also search a range of records within a file using regexp as the
range criteria,
awk 'BEGIN {FS=":"}
/^P.* H/, /^E/ {print $2}' phone_list
Here we again set the field separator to the colon character (:) and then
search the file phone_list for every line between the regexp /^P.* H/ and
the regexp /^E/. The two regexp are delimited by a comma. There is a built-in
awk variable called FNR that contains the number of the current record so
we can also specify a range as,
FNR == 3, FNR == 12 {print $2}
This will print field 2 of every record between the range of record 3 to
record 12 inclusive. If you want to include the current file name in the
output, use the awk variable FILENAME,
for i in phone_list members_list starboard_list
do
awk 'BEGIN {FS=":"}
FNR == 1, /IRS [Aa]udit [Dd]ay/ {print "FILENAME $3"}' $i
done
This time we used both the awk variable FNR and a regexp to find a range
of records and then used the awk variable FILENAME to display the name of
the file where the match was found. Once again awk is invoked from within
a shell script (this could also be entered from the command line).
The power of awk is that it is a complete programming language in its own
right yet can be combined with the shell programming capabilities to create
very sophisticated and very efficient programs. If the features of awk presented
so far were all you ever used, you would have enough power to create the
kinds of programs that others would find indispensable (you'd be an Ace).
But there's more.
The printf command discussed earlier is even more capable, consider,
{printf "%-12s %4.", $1, $3*$2/1000}
Here the contents of field 1 ($1) are displayed as a string of a maximum
of 12 characters, left-justified, followed by the contents of field 4 ($4)
as a floating point number (has a fractional component) of length 4 which
takes its value from the math operation, $3*$2/1000 (the contents of field
3 times the contents of field 2 divided by 1000). From previous examples
it's obvious that there are several ways to perform math. Where you put
the math operation can affect the way a program operates; some arrangements
may cause slower execution, be more prone to errors, or diddle variables
in unexpected ways. You'll have to experiment to find the best structure
for your program.
As with C, there are also shortcuts for some operations. If you are re-defining
a variable using its current value as part of the definition, you can save
some typing by using the following,
A += B -> A=A+B
A -= B -> A=A-B
A /= B -> A=A/B
A *= B -> A=A*B
A %= B -> A=A%B
A ^= B -> A=A^B
A++ -> Do commands then A=A+1
++A -> A=A+1 Then do commands
The assignments here are pretty terse, but they save a lot of typing and
can make your awk script more readable (if you're used to reading awk scripts).
There is also the advantage that for C programmers, it seems very natural.
As demonstrated earlier, any arithmetic operations can be performed on
the fly as required and don't require the use of variables. Assuming that
you will use awk within shell scripts, you will also have all the features
of expr, dc, bc and test (though you may not use them much after tinkering
with awk for a while). Though awk is touted as a text processing/formatting
tool, it can do much more and should be viewed as a programming language.
Anyway, the things you can do with strings can add a lot of power to your
shell scripts also so we'll stick with that for now. When using regular
expressions, remember that they must be enclosed with the slash characters
(/), string constants (a literal string) must be enclosed with the double
quotes ("). To ensure that a string is interpreted as a string by awk, concatenate
the null string, stringvar"". To ensure that a numeric variable is properly
interpreted, add zero to it, numvar+0. These little idiosyncrasies may seem
awkward (pun intended), but they reduce the time spent debugging the script.
Strings can be concatenated by just putting them next to each other,
{print NR": "$0}
Here the record number of the current input line has a colon character (:)
added to it followed by a space (notice the double quotes) and then the
entire record ($0) is displayed. Since the three variables are concatenated,
they will display as a single line. If you insert commas (,) between variables,
they will be separated by a single space,
{print $2,$3}
Fields 2 and 3 will be separated by a space. without the comma, they display
as a single string (they are concatenated). You can also use the formatting
characters, \n, \t, \b, etc. to insert newlines, tabs and so on. It's also
possible to pre-format strings using the function, sprintf(). This function
works just like printf except that it doesn't display. The value of this
is that you can define a variable a certain way and then use that variable
elsewhere.
out=sprintf("%-10s %6.2d", $2, $3)
This will create a left justified string for field 2 and a floating point
number will have 4 digits to the left of the decimal point and 2 to the
right for field 3 and put all of this formatting information into the variable,
out.
Flow control is also possible with awk using the same tools you've used
in creating shell scripts and C programs,
if (exp) statement else statement
Here the exp is an expression to be evaluated and if it evaluates true (1=1
for instance) then the statement is processed. If the expression is not
true (1=2), the else portion of the command is processed. Some versions
of awk support an else if option which can be used to test many different
expressions.
The while and do loops are supported as is the for loop (using the C syntax).
Since these are used within awk itself instead of the shell the syntax is
somewhat different,
for ( ... )
while ( ... )
do ( ... )
if ( ... )
Notice that the condition to be evaluated is enclosed in the parentheses
and that the whole thing is on the same line. The commands to be processed
when the condition is satisfied is enclosed in the curly braces,
{ if ($1+0 < 200) {
x=sprintf(%-20s %8.2d, $2, $3); print x, "Numbers"
}
}
This is a nonsense awk command line but the idea is to show the form of
the command rather than agonize over its content. Most of these features
are well known through the use of the shell so there's little to be gained
by going into all the gory details. The only way to learn awk is to use
it (ever heard that before?) and since you will make mistakes, it's to
your benefit to experiment long and often until it starts making sense.
The main difficulty is the syntax and the placement of curly braces, quotes,
commas and similar stuff. Once that becomes comprehensible, the rest is
really pretty straightforward.
There are also arrays available with awk in the usual form, x[i]. Where
x is the array name and i is the element within the array,
{ for (i=1;i<NR;i++) {x=[i]};print x[i]} }
Again a useless example whose only purpose is show the form of the array
assignments. I'm assuming here that you've become familiar with shell programming
and have at least glanced at C source code since awk assumes as much. Not
to belabor the point, but all of this requires practice and ideally, some
kind of awk reference handy. To see an example of using awk to format a web
page (a pretty good use of awk's capability), see
this
Other than the on-line manual entry for awk, which is somewhat scanty, it's
not always apparent how to do things with awk. I suggest an excellent book
about using awk from O'Reilly & Associates (who else but O'Reilly, of course) entitled "Sed & Awk" by Dale Dougherty. The books is easy to read and follow and there are plenty
of examples. O'Reilly has a Web site where you can find books and stuff
about all manner of interesting odds and ends. Go to http://www.ora.com
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));