Pages

Wednesday, March 25, 2020

Life and Work in the Time of Pandemic (part 2, food)

Coronavirus (image credit: CDC, public domain)
Our “online learning” was extended through Spring Break, the first full week of April here, and they should probably close out the school year doing it. I suggested that to Daughter Dearest, and her response was “SHUT UP. SHUT UP.”

But whether we pull a Hong Kong and lift restrictions early (spoiler alert: it would be a Bad Move), or keep movement tamped down to prevent further spreading, the occasional grocery trip is a necessity. Maybe less necessary is occasional pickup from restaurants, although they might argue the “less necessary” part.

Restaurants are adjusting as well as they can, offering incentives like extra points for rewards programs or free delivery. Meanwhile, the wife and I have roughed out meal planning. We’re mostly digging meat out of our freezers, although we're short on ground beef and the hoarders (aka #covidiots) grabbed it all last weekend. Bread and milk are easier to find, now… both have a finite shelf life and hoarders might have a hard time using what they have before it spoils. Ground beef should soon be available as well, because even covidiots have only a finite amount of freezer space. (But they must be using all that toilet paper as mattresses.)

Meanwhile, the school system is still running the bus routes… except instead of dropping off kids in the afternoon, they drop off lunches in the late mornings. We don’t need the extra food, but they beg us to take it because we’re at the end of the route. Today, we got burgers. The kids eat whatever sandwiches are provided, but sometimes skip the veggies + ranch dip packages (add 1/4 tsp of onion powder to the ranch containers, instant chip dip). We’re going to cook the veggies for supper, if I keep my mitts out of them. Still, it’s starting to get overwhelming—we’re covered up with fruit, milk, juice, etc. We’ll need to make sure the neighbors get some of this if it continues.

Since the kids don’t drink all the milk, I have rediscovered the joy of drinking half-pints of chocolate milk from the carton. I have not yet tried my old trick of jabbing the side with a pencil, making a hole of the exact diameter of a straw; I could pressurize the carton and pump the milk into my mouth. One of my better memories of elementary school.

Fortunately, Charlie is expanding his protein sources, although he still strongly prefers his latest adoptions to be breaded and fried. Chicken nuggets (especially Chick-Fil-A) and fish sticks are winners. We thawed and baked a slab of salmon I had kicking around in the freezer earlier this week; the adults ate that, and Mason and Charlie gobbled several helpings of sticks. We got these corn dog bites, and Charlie ate half of one before he realized it wasn’t chicken, then ate the breading and left the mystery meat. Sometimes, it’s hard to tell the difference between picky and intelligent.

We have a few days of not-rain this week (yay!) so I’ll likely pull some ribs out of the freezer and smoke/grill them.

Monday, March 16, 2020

Life and Work in the Time of Pandemic (part 1)

Coronavirus (image credit: CDC, public domain)
Local schools are on “online learning” this week (and I’m sure that will be extended). I started working from home last week, and we all got a “recommendation” from a manager to work at home through March 27 (again, it’ll probably be extended). Oddly enough, Charlie’s daycare (a Petri dish if I ever saw one) is remaining open. His therapy office is also open, but they’ve moved the waiting room out to the parking lot… in other words, wait in your car until your therapist comes out. Our little church has moved its sermons online (unfortunately, to the Book of Face, which I don’t use) for the duration.

You’re probably seeing the same things in your locale, and I’m not here to provide dry statistics. I’m going to journal the mostly self-isolated life in a rural area, in case someone else finds it interesting now or later on.

Long-time readers might remember FAR Future, a long blog-novel I wrote starting all the way back in 2007. Although the chronic energy shortages that the whole story is built around have yet to materialize, some of the things I wrote about have had eerie parallels in real life. One episode (written in Sep 2008) discusses a serious flu pandemic, with a 3%-5% mortality rate, breaking out in… December 2019. We’re not laboring under a junta, but the administration in real-life 2020 is every bit as incompetent as was the junta in FAR Future. The difference is, the fictional flu was like the 1918 pandemic, hitting young and healthy adults the hardest. This one goes (mostly) after the elderly. We also have the Internet, a find way to find information (and plenty of misinformation) about what’s happening.

Saturday was “run errands” day, so I combined the trips to limit time out and contacts. Charlie had horse therapy, and we needed both some groceries and a UPS battery. Other than that, we were in all weekend. Wife and I keep talking about meal planning, so we can order pickup from the local Kroger, but haven’t quite done it just yet. Today, she’s out with Charlie for therapy. I’m not sure he’ll go to daycare today. If he does, I might run to the office to grab a laptop dock and some notes, then pick him up on the way back.

These first few days of (mostly) shelter-in-place are very strange. It’s like a winter storm, except that all the utilities are working and the roads are even more clear than usual. Our crappy DSL is crappier than usual, what with all the school kids with Internet doing their work online (not to mention people like me, trying to work). Structure is going to be important… along those lines, Daughter Dearest forwarded me a suggested schedule for families. Modify as needed, but a little structure will make everyone's day go better:


Meanwhile, I have a minor cold. I’ve never been so glad to sneeze.

Tuesday, March 10, 2020

Adventures of a #techcomm Geek: Match Game, 2020

It’s been a while since I did one of these, and this one goes in deep.

We’ve been using DITA at work for a year or two now, but rarely is there time to go back and take advantage of the things it offers, retrofitting those things into the documentation we brought in. (Docs we’ve created since then seem to get more thorough treatment.)

One of those things is reuse. It’s easy to reuse an entire topic in a different book—even if it was duplicated. “Hey,” says a writer, “that’s the same thing. Let’s throw away topic B and use topic A.”

DITA also supports reusing common paragraphs in two or two dozen topics, but that’s a little harder. First, you have to recognize that paragraph. Then, you have to create a new topic (a collection file), copy the paragraph into the collection file, and assign it an ID. Then you have to replace the duplicated text (in topics) with a content reference (a/k/a conref). It’s a worthwhile thing to do, because you might say the same thing slightly differently otherwise. Still, who wants to go through an entire book (or worse, set of books), looking for reuse candidates?

Of course, you can always let a computer do the tedious work… if you know how to tell it what to do.

Preparing the (searching) grounds

A while back, I wrote my first useful Python scripts. One takes a particular JSON file and reformats it as a DITA reference topic, containing a table with the relevant data from the JSON file. Another walks through a CSV file, grabbing the columns I need, and producing topics documenting a TR-069 data model. Both scripts take advantage of a vast library of pre-written code to parse their input files.

It occurred to me that, if I were to find (or create) a way to export all the text from a DITA book into a CSV file, I could use a Python script to compare each paragraph to all the others. Using fuzzy matching would help me find “close enough” matches. That was a while ago, because I bogged down on trying to get properly-formatted text out of DITA.

Last week, I got bored. Someone on the DITA-OT forum mentioned a demo plugin that translated DITA to Morse code, and the lightbulb in my head went on. If I could modify that plugin to just give text instead of -.-. .-. .- .—. then maybe I’d have what I needed.

It was an abject failure. What I need is one line per block element (paragraph, list item, etc). What I got was one line for the entire topic, sometimes with missing spaces. I put that aside, but realized that DITA-OT can also spit out Markdown. If I could convert Markdown to plain text, I’d be ready to rock!

So you want to convert DITA to Markdown? It’s easy, at least with the newer toolkits:

dita --format=markdown_github --input=my.bookmap --args.rellinks=none

The DITA-OT output continues to be topic-oriented, writing each topic to its own file. That wasn’t quite what I wanted, or so I thought at the time. Anyway, we have Markdown. How do we get plain text out of it, with each line representing a block element?

Turns out that pandoc, the “Swiss Army knife for converting markup files,” can do it:

pandoc -t plain —wrap=none -o topic.txt topic.md

In the heat of problem-solving, I realized I didn’t need a CSV file… or Python. I could pick up Awk and hammer my nails the text into shape. My script simply inhaled whatever text files I threw at it, and put all the content into an array indexed by [FILENAME,FNR] (FNR is basically the line number of paragraphs inside the file). There was a little stray markup left, not to mention some blank lines, and a couple of Awk rules threw unneeded lines into the mythical bit bucket.

Got a (fuzzy) match?

A typical match is an all or nothing Boolean: you get true (1) if the strings are an exact match, or false (0) if they don’t.

Fuzzy matching uses the universe of floating-point numbers in between 0 and 1 to describe how close a match is. It’s up to you to decide what’s close enough, but you usually want to focus on values of 0.9 and higher. And yes, an exact match still gives you a score of 1.

Why do we want to do this? Unless content developers are really good about cutting and pasting in a pre-reuse environment, inconsistencies creep in. You might see common operations described in slightly different ways:

Click OK to close the dialog.
Click OK to close the window.

So along with flagging potential reuse candidates, a fuzzy match can help you be consistent.

Python and Perl have libraries devoted to fuzzy matching. There are several ways to do a fuzzy match, but one of the more popular is called the Levenshtein distance. There's a scary-looking formula at the link, but it boils down to single-character edits (addition, deletion, or replacement). The distance between “dialog” and “window” is 4 (d→w, a→n, l→d, g→w).

But this is an integer, not a floating-point number between 0 and 1! But that’s easy to fix. If l1 and l2 are the lengths of the two strings, and d is the calculated Levenshtein distance, then the final score is (l1+l2-d)/(l1+l2). In the above example, the score is 0.93—the strings are 93% identical.

There are websites with Levenshtein distance implementations in all sorts of different programming languages, although the ones written in Awk are not as common. But no problem. Awk is close enough to C that it’s simple to translate a short bit of code. I picked the second of these two. There was one already written in Awk, but it took a lot more time to grind through a large set of strings.

Save time, be lazy

The time it takes is important, because it adds up fast. Given n paragraphs, each paragraph has to be compared to all the rest, so you have n2 comparisons. A medium sized book, with 2400 paragraphs, means 5.76 million comparisons. Given that a fuzzy comparison takes a lot longer than a boolean one, you want to eliminate unnecessary comparisons. A few optimizations I came up with:

  • It’s easy to get to (n2-n) by not comparing a string to itself. We also do a boolean compare and skip the fuzzy match if the strings are identical. Every little bit helps. Time to analyze 2400 paragraphs: 2 hr 40 min. My late-2013 iMac averages about 600 fuzzy match comparisons per second.
  • By deleting an entry from the array after comparing it to the others, you eliminate duplicate comparisons (once you’ve compared A to B, doing B to A is a waste of time). That eliminates noise from the report, and cuts the number of comparisons required in half. Time to analyze 2400 paragraphs: 1 hr 20 min. Not bad, for something you can do with one more line of code.
  • Skip strings with big differences in length. Again, if l1 and l2 are the lengths of two strings, then the minimum Levenshtein distance is abs(l1-l2). If the best possible score doesn’t reach the “close enough” threshold, then you don't have to do the fuzzy match. Time to analyze 2400 paragraphs: 5 min 30 sec!!! Now that’s one heck of an optimization!

So we’ve gone to something you run overnight, or at least during a long lunch break, to something that can wrap up during a coffee break (eliminating 96.5% of the time needed is a win no matter how you look at it). Now if your book is all blocks of similar length, it will take longer to grind through them because there isn’t anything obvious to throw out.

Still, this is down to the realm where it's practical to build a “super book” (a book containing a collection of related books) and look for reuse across an entire product line. That might get the processing time back up into the multiple-hours realm, but you also have more reuse potential.

Going commercial

The commercial offerings have some niceties that my humble Awk script does not. For example, they claim to be able to build a collection file (a “library” of sorts, containing all the reusable paragraphs) and apply it to your documentation. That by itself might be worth the price of entry, if you end up with a lot of reuse.

They also offer a pretty Web-based interface, instead of dropping to the command line. And, they have likely implemented a computing cluster to grind through huge jobs even faster.

But hey, if you’re on a tight budget, the price is right. I’m going to make sure the employer doesn’t have a problem with me putting it up on Github before I do it. But maybe I’ve given you enough hints to get going on your own.
UPDATE 10 May 2020: The script is now available on Github.

Sunday, March 08, 2020

Mooooving out

I called him Buncha (as in Buncha Bull). Mason called him Bully. The young woman that helps the wife out with farm stuff called him Carl (Carl?).

Whatever you call him, the time came for him to moooove back to the pasture. Wife put a halter on him, handed me a lead (basically a really heavy-duty leash) and told Mason and me to walk him down to the pasture.

So down the garden path we went. We brought his milk bottle along, just in case. The calf was both excited and nervous about this Really New Thing, and spent the entire walk alternately planting his hooves or trying to frisk ahead. But despite him weighing well over 100 pounds, he didn’t pull as hard as a 30 pound Aussie Shepard.

The pasture is calling
and I must go.
We got to the pasture. I unclipped the lead and let him go, then slipped through the gate. He stood there, looked around… then stepped through the barbed-wire fence like it wasn’t even there. As Mik’s Aunt Morcati said, cattle are born knowing all profanity and will gladly teach it to anyone nearby.

So I waved the milk bottle at him, he ambled over and chowed down, and I clipped the lead back on him. Now what? I thought. I can’t stand here with him forever.

Finally, I decided to take him further into the pasture. Riding a milk high, he was glad to follow my lead (the one attached to his face via the halter). About 50 yards in, I unclipped him and backed away to see what happened next. He munched on a big clump of grass, then looked up and got this look about him. It was almost like it dawned on him: This is my place, and that’s my herd. The pasture is calling, and I must go. He walked up the hill, found some other calves around his age, and they included him right away.

It’s not like he’s completely gone… he still has the halter, so we can spot him in the herd and bring him an occasional bottle. Like this:


He still comes trotting over when he sees us… or at least sees his bottle. So we don’t have to miss him. Especially since another calf is now in the pen. It never ends at FAR Manor.