There are quite a few general-purpose scripting languages that do a fine job of hunting down and acting on patterns. I’m conversant with Perl, and am learning Python; but when I need to bang something out in a hurry and XML is (mostly) not involved, Awk is how I hammer my nails. Some wags joke Awk is short for “awkward,” and it can be for those who are used to procedural programming. Anyone exposed to event-based programming—where the program or script reacts to incoming events—will find it much more familiar. Actually, “awk” is the initials of the three people who invented it: Aho, Weinberger, Kernighan (yes, that Brian Kernighan, he who also co-invented the C language and was a major player on the team that invented Unix).
Instead of events, Awk reacts to patterns. A pattern can be a plain string, a variable value, a regular expression, or combinations. Other cool things about Awk:
- Variables have whichever type is most appropriate to the current operation. For example, your script might read the string “12.345,” assign it to
x
, then you can use a statement likeprint x + 4
and you’ll get 16.345. - The language reference (at least for the original Awk) fits comfortably in a manpage, running just over 3 pages when printed. Even the 2nd edition “official” reference is only 7 pages long.
- It’s a required feature in most modern Unix specifications. That means you’ll always have some version of Awk on an operating system that has some pretensions to be “Unix-like” unless it’s a stripped, embedded system. On the other hand, even BusyBox-based systems include a version of Awk. Basically, that means Awk is everywhere except maybe your phone. Maybe.
Now what am I going to do with it? |
Okay. I told you all that to tell you this.
I’m working on something that extracts text from a PDF file, and formats it according to rules that use information such as margin, indent, and font. It requires an intermediate step that transforms the PDF into a simple (but very large) XML file, marking pages, blocks, lines, and individual characters.
“But wait a minute!” you say. “I thought Awk only worked on text files. How does it parse XML?”
Like many useful utilities first released in the 1970s, Awk has been enhanced, rewritten, re-implemented from scratch, extended, and yet it still resembles its ancestral beginnings. The GNU version of Awk (commonly referred to as gawk) has an extensions library and extensions for the most commonly-processed textual formats, including CSV (still beta) and XML. In fact, the XML extension is important enough that gawk has a special incantation called xmlgawk that automatically loads the XML extension.
The neat thing about xmlgawk, at least the default way of using it, is that it has a very Awk-like way of parsing XML files—it provides patterns for matching beginnings of elements, character data, and ends of elements (and a lot more). This is basically a SAX parser. If you don’t need to keep the entire XML file in memory, it’s a very efficient way to work with XML files.
So. In most cases, I only need the left margin of a block (paragraph). Sometimes, I need the lowest extent of that block as well, to throw out headers and footers. I need to check the difference between the first and second line (horizontally), and possibly act upon it.
In the document I used for testing, list items (like bullets) have a first line indent of –18 points. “Cool,” I said. “I can use that to flag list items.”
All well and good, except that it only worked about 10% of the time. I started inserting debugging strings, trying to figure out what was going on, and bloating the output beyond usefulness. Finally, I decided to print the actual difference between the first and second lines in a paragraph, which should have been zero. What I found told me what the problem was.
diff=1.24003e-18
In other words, the difference (between integer and floating point numbers) was so miniscule as to matter only to a computer. Thus, instead of doing a direct comparison, I took the difference and compared that to a number large enough to notice but small enough to ignore—1/10000 point.
And hey presto! The script behaved the way it should!
It’s a good thing I’ve been doing this at home—that means I can soon share it with you. Ironically, it turns out that we might need it at the workplace, which gives me a guilt-free opportunity to beta-test it.
No comments:
Post a Comment
Comments are welcome, and they don't have to be complimentary. I delete spam on sight, but that's pretty much it for moderation. Long off-topic rants or unconstructive flamage are also candidates for deletion but I haven’t seen any of that so far.
I have comment moderation on for posts over a week old, but that’s so I’ll see them.
Include your Twitter handle if you want a shout-out.