Pages

Tuesday, March 10, 2020

Adventures of a #techcomm Geek: Match Game, 2020

It’s been a while since I did one of these, and this one goes in deep.

We’ve been using DITA at work for a year or two now, but rarely is there time to go back and take advantage of the things it offers, retrofitting those things into the documentation we brought in. (Docs we’ve created since then seem to get more thorough treatment.)

One of those things is reuse. It’s easy to reuse an entire topic in a different book—even if it was duplicated. “Hey,” says a writer, “that’s the same thing. Let’s throw away topic B and use topic A.”

DITA also supports reusing common paragraphs in two or two dozen topics, but that’s a little harder. First, you have to recognize that paragraph. Then, you have to create a new topic (a collection file), copy the paragraph into the collection file, and assign it an ID. Then you have to replace the duplicated text (in topics) with a content reference (a/k/a conref). It’s a worthwhile thing to do, because you might say the same thing slightly differently otherwise. Still, who wants to go through an entire book (or worse, set of books), looking for reuse candidates?

Of course, you can always let a computer do the tedious work… if you know how to tell it what to do.

Preparing the (searching) grounds

A while back, I wrote my first useful Python scripts. One takes a particular JSON file and reformats it as a DITA reference topic, containing a table with the relevant data from the JSON file. Another walks through a CSV file, grabbing the columns I need, and producing topics documenting a TR-069 data model. Both scripts take advantage of a vast library of pre-written code to parse their input files.

It occurred to me that, if I were to find (or create) a way to export all the text from a DITA book into a CSV file, I could use a Python script to compare each paragraph to all the others. Using fuzzy matching would help me find “close enough” matches. That was a while ago, because I bogged down on trying to get properly-formatted text out of DITA.

Last week, I got bored. Someone on the DITA-OT forum mentioned a demo plugin that translated DITA to Morse code, and the lightbulb in my head went on. If I could modify that plugin to just give text instead of -.-. .-. .- .—. then maybe I’d have what I needed.

It was an abject failure. What I need is one line per block element (paragraph, list item, etc). What I got was one line for the entire topic, sometimes with missing spaces. I put that aside, but realized that DITA-OT can also spit out Markdown. If I could convert Markdown to plain text, I’d be ready to rock!

So you want to convert DITA to Markdown? It’s easy, at least with the newer toolkits:

dita --format=markdown_github --input=my.bookmap --args.rellinks=none

The DITA-OT output continues to be topic-oriented, writing each topic to its own file. That wasn’t quite what I wanted, or so I thought at the time. Anyway, we have Markdown. How do we get plain text out of it, with each line representing a block element?

Turns out that pandoc, the “Swiss Army knife for converting markup files,” can do it:

pandoc -t plain —wrap=none -o topic.txt topic.md

In the heat of problem-solving, I realized I didn’t need a CSV file… or Python. I could pick up Awk and hammer my nails the text into shape. My script simply inhaled whatever text files I threw at it, and put all the content into an array indexed by [FILENAME,FNR] (FNR is basically the line number of paragraphs inside the file). There was a little stray markup left, not to mention some blank lines, and a couple of Awk rules threw unneeded lines into the mythical bit bucket.

Got a (fuzzy) match?

A typical match is an all or nothing Boolean: you get true (1) if the strings are an exact match, or false (0) if they don’t.

Fuzzy matching uses the universe of floating-point numbers in between 0 and 1 to describe how close a match is. It’s up to you to decide what’s close enough, but you usually want to focus on values of 0.9 and higher. And yes, an exact match still gives you a score of 1.

Why do we want to do this? Unless content developers are really good about cutting and pasting in a pre-reuse environment, inconsistencies creep in. You might see common operations described in slightly different ways:

Click OK to close the dialog.
Click OK to close the window.

So along with flagging potential reuse candidates, a fuzzy match can help you be consistent.

Python and Perl have libraries devoted to fuzzy matching. There are several ways to do a fuzzy match, but one of the more popular is called the Levenshtein distance. There's a scary-looking formula at the link, but it boils down to single-character edits (addition, deletion, or replacement). The distance between “dialog” and “window” is 4 (d→w, a→n, l→d, g→w).

But this is an integer, not a floating-point number between 0 and 1! But that’s easy to fix. If l1 and l2 are the lengths of the two strings, and d is the calculated Levenshtein distance, then the final score is (l1+l2-d)/(l1+l2). In the above example, the score is 0.93—the strings are 93% identical.

There are websites with Levenshtein distance implementations in all sorts of different programming languages, although the ones written in Awk are not as common. But no problem. Awk is close enough to C that it’s simple to translate a short bit of code. I picked the second of these two. There was one already written in Awk, but it took a lot more time to grind through a large set of strings.

Save time, be lazy

The time it takes is important, because it adds up fast. Given n paragraphs, each paragraph has to be compared to all the rest, so you have n2 comparisons. A medium sized book, with 2400 paragraphs, means 5.76 million comparisons. Given that a fuzzy comparison takes a lot longer than a boolean one, you want to eliminate unnecessary comparisons. A few optimizations I came up with:

  • It’s easy to get to (n2-n) by not comparing a string to itself. We also do a boolean compare and skip the fuzzy match if the strings are identical. Every little bit helps. Time to analyze 2400 paragraphs: 2 hr 40 min. My late-2013 iMac averages about 600 fuzzy match comparisons per second.
  • By deleting an entry from the array after comparing it to the others, you eliminate duplicate comparisons (once you’ve compared A to B, doing B to A is a waste of time). That eliminates noise from the report, and cuts the number of comparisons required in half. Time to analyze 2400 paragraphs: 1 hr 20 min. Not bad, for something you can do with one more line of code.
  • Skip strings with big differences in length. Again, if l1 and l2 are the lengths of two strings, then the minimum Levenshtein distance is abs(l1-l2). If the best possible score doesn’t reach the “close enough” threshold, then you don't have to do the fuzzy match. Time to analyze 2400 paragraphs: 5 min 30 sec!!! Now that’s one heck of an optimization!

So we’ve gone to something you run overnight, or at least during a long lunch break, to something that can wrap up during a coffee break (eliminating 96.5% of the time needed is a win no matter how you look at it). Now if your book is all blocks of similar length, it will take longer to grind through them because there isn’t anything obvious to throw out.

Still, this is down to the realm where it's practical to build a “super book” (a book containing a collection of related books) and look for reuse across an entire product line. That might get the processing time back up into the multiple-hours realm, but you also have more reuse potential.

Going commercial

The commercial offerings have some niceties that my humble Awk script does not. For example, they claim to be able to build a collection file (a “library” of sorts, containing all the reusable paragraphs) and apply it to your documentation. That by itself might be worth the price of entry, if you end up with a lot of reuse.

They also offer a pretty Web-based interface, instead of dropping to the command line. And, they have likely implemented a computing cluster to grind through huge jobs even faster.

But hey, if you’re on a tight budget, the price is right. I’m going to make sure the employer doesn’t have a problem with me putting it up on Github before I do it. But maybe I’ve given you enough hints to get going on your own.
UPDATE 10 May 2020: The script is now available on Github.

No comments:

Post a Comment

Comments are welcome, and they don't have to be complimentary. I delete spam on sight, but that's pretty much it for moderation. Long off-topic rants or unconstructive flamage are also candidates for deletion but I haven’t seen any of that so far.

I have comment moderation on for posts over a week old, but that’s so I’ll see them.

Include your Twitter handle if you want a shout-out.