Looking for writing-related posts? Check out my new writing blog, www.larrykollar.com!
Showing posts with label computers. Show all posts
Showing posts with label computers. Show all posts

Wednesday, November 02, 2022 1 comment

Adventures of a #techcomm geek: Still I look to find a reason

A future project—I plan to start on it in earnest early next year—is requiring a specific set of “reset reason codes.” The codes, and their meanings, are described in MULPI (a standard specification in the industry I work in)… at least, so said the ticket the developers are using to track this piece of the melange that makes up modern communication devices. I had a PDF copy of MULPI on my work system already, and once I realized the spec said “initialization reason” where my co-workers used the more direct “reset reason,” I found the table in section C.1.3.6. Hey, it’s only 887 pages.

Needle in a haystack

Besides it being a table—I hate tables, in general—I found two glaring issues:

  • I wanted the number in the left column, not the right
  • The table did not describe the conditions that would cause the reset
The latter was a matter of searching the PDF on initialization reason, but first I wanted to reverse the columns in that table. Copying the table out of the PDF, and pasting into a terminal window, gave me one line per cell:
Initialization Reason
Initialization Code
POWER-ON
1
T17_LOST-SYNC
2
ALL_US_FAILED
3
BAD_DHCP_ACK
4
etc
Once again, I had a textual nail, and I reached for my hammer: awk. Redirecting the lines into a junk file (literally called junk), I decided to create tab-delimited output with the column order reversed:
awk '{getline num; print num "\t" $0; next}' junk
Mirabile dictu, it worked the first time! The getline function does a sort of look-ahead, grabbing the next line of input before the normal awk loop can get to it. So it’s pretty easy to grab two lines, print them in reverse order, and move on to the next pair.

“Oh,” I then thought, “I should have made it a Markdown table.” Rather than start over, I just piped the output of the first awk script into another one:
awk '{getline num; print num "\t" $0; next}' junk | \
awk 'BEGIN {FS="\t"} NR==2 {print "|-----|-----|"} \
   {print "|", $1, "|", $2, "|"}'
Once again, first time was the charm! This “long pipeline of short awk scripts” approach does make debugging easier, especially if you don’t have to do any debugging. If you’re not familiar with awk, let me pretty up that second script to make it easier to follow:
BEGIN {
    FS="\t"
}

NR==2 {
    print "|-----|-----|"
}

{
    print "|", $1, "|", $2, "|"
}
The BEGIN block gets executed before the script reads its first line of input. In this case, it sets the field separator (the character or regular expression that breaks a line into fields) to the tab character.

The block beginning with NR==2 applies only to the second line of input (NR = “record number” or line number). In this case, it prints a Markdown separator (between table heading and body) before processing the second line… remember, awk takes each pattern/action pair in order. Since there is no next statement, it falls through to the default action.

The last block is the default action, since it has no pattern to trigger its use. It puts Markdown table call separators on each end of the line, and between the two fields. The commas insert the output field separator (a space, by default). I could get the same result with:
print "| " $1 " | " $2 " |"
So I copied the Markdown-formatted table from the terminal window and pasted it into a text editor. From there, adding a third column was easy. Not so easy, or at least somewhat more tedious, was searching the PDF for “initialization reason” and adding the conditions triggering each reason code to the table. In some cases, there are multiple issues for a particular reason. In two cases, there was nothing in the spec at all about the reason code. Fortunately, I was familiar with one of the causes and the other was straightforward.

The Markdown has been converted to DITA, and is now waiting for the project to get started in earnest. And it won’t be bugging me to deal with it over the next couple months.

Monday, September 05, 2022 No comments

Adventures of a #techcomm Geek: Go API chapter 2, “No ReST for the weary” (edit#2)

Image source: openclipart.org

Last time I had to deal with an API, it was pulling a vendor’s documentation into our own system. Now, I have to document our own APIs.

OpenAPI, formerly known as Swagger, is quite popular in the ReST API universe these days. And why not? One source builds a website, the hooks, documentation, and everything. At least online. If you want to provide a PDF document describing the API, though, there’s a little more to it.

  • First, all those definition and summary strings need some attention. Where developers involve the technical writers in the process makes a huge difference in effort (at least on the writer side).
  • Second, there’s more to documenting an API call than the definition and summary strings. There are path variables, query variables, examples, and the list goes on.

Fortunately, there are several utilities that extract documentation from an OpenAPI file. For my purposes, Widdershins works best—it produces a complete Markdown file—although it’s nowhere near ideal.

  • One issue was definitely not the fault of the tool. The developers told me of a dozen categories (or tags in OpenAPI parlance) that didn’t need to be documented for customers. Widdershins groups all API calls with the same tag under the same section, and that helps a lot.
  • The second issue could be either Widdershins or my personal preference. I didn’t like the order that Widdershins presented data for each method. There were some other minor issues as well.

I had a big wad of text that was my nail, and awk once again is my hammer. I started pounding. I did consider using a YAML parser for a brief time, but realized Widdershins did a lot of busy work for me. It actually does a pretty good job of building a Markdown document, describing all the method calls and schemas. If only there was a way to fix the presentation order, it would be perfect.

My first goal was to reshuffle the internal sections of each method to get them in the order I wanted. Deleting the unneeded groups, I reasoned, was a one-time thing that I could deal with myself.

My script worked the first time, but scrambled a bunch of things on the second attempt. Worse, doing the search-and-delete on those unneeded sections took more time and care than I’d anticipated. I needed a re-think.

Fortunately, a Computerphile interview with Brian Kernighan (the “K” in awk) came around, right when I needed it. It gave me… if not the key to my problem, a map to the key. In a nutshell, Dr. Kernighan advocates against large, monolithic awk scripts. His 1986 paper Tools for Printing Indexes describes his approach as:

…a long pipeline of short awk scripts. This structure makes the programs easy to adapt or augment to meet the special requirements that arise in many indexes.  

This approach can also be easier to debug, as you can replace the pipeline with temporary files and verify that the output of one stage is correct before feeding it to the next stage. Each stage refines the input further.

So I split the monolithic script into two medium-size scripts:

  • Stage 1a (weed) fixes headings, weeds out unneeded HTML markup (mostly <a name="x"/> tags), and gets rid of those unneeded sections. Having less cleanup already makes this approach worth the effort.
  • Stage 1b (shuffle) re-orders the remaining method descriptions. I learned that the input order is important for making this work; so if future versions of Widdershins move things around, it could break the script and I would need to fix it again.

It takes maybe a second to process the raw Markdown through both stages under Cygwin, which is noticeably slower than a shell under a native POSIX system. I expect my 8 year old iMac would be nearly instantaneous.

Now that I’ve cracked the code, so to speak, more stages are coming.

  • Stage 2 throws out the schema definitions that none of the remaining methods refer to. A pair of scripts identify which schemas needed to be kept, then weeds out the others.
  • Stage 3 fixes cross-references (mostly to schemas). The monolithic Markdown file uses a URL of the form #schemaxyz. Since the ultimate goal is to split the single Markdown file into topics, those URLs need to point to the eventual file name instead. A trio of scripts create file names that correspond to the URLs, replace the original #xyz name with the file name, then shuffle the schema’s description to the top of the topic.

These stages take another second to process… so 13,000 lines of YAML to monolithic Markdown file is about two seconds. The mdsplit script, that splits the methods and schemas into topics and builds a Lightweight DITA (LwDITA) bookmap, takes less than ten seconds to complete. So I’m now at the point where it’s easier to regenerate the entire document if I run into a scripting issue, instead of pushing through the problem. Uplifting the LwDITA to full DITA takes maybe a minute. After the uplift, another script fixes the extensions, changing .md to .dita, and fixing the cross-references.

At this point, I can focus on adding value: adding metadata, grouping related schema definitions, and the like. If I need to regenerate this thing again, I need only run the shell scripts that conduct the Geek Chorus.

Going forward, I’ll need to be able to compare versions, so I can replace topics with actual content changes, or add new topics. At that point, I could hand things off and intervene only when the input changes enough to make a difference. Or, we might decide to ditch the PDF entirely, and that would make things far easier on everyone.

Techcomm geeks never worry about automating themselves out of a job, by the way. There’s always a new presentation format, or new source document formats, or many new ways to streamline workflows. Handing off a system is a triumph; it means we have more time to focus on the next thing.

Edited 5 Sep: I didn’t realize I’d pasted this out of Logseq before going through it and fixing some things. Now we’re up to date.

Edited 7 Sep: All the tweaks have been made, and I now have a turnkey system.

Tuesday, December 07, 2021 No comments

Computer-Aided Weeding

A couple weeks ago, I finally decided to start pulling in all the notes I’d saved up from Evernote and Google Keep into Logseq. I started with Evernote, just because.

First, I had to update the Evernote app on my iMac, so I could actually access my stuff. That should tell you how long it’s been since I actively used it.

After exporting, I used a utility called Yarle to convert the notes in each notebook to Markdown.

Now the hard part: deciding what I wanted to keep, and what to toss. The even harder part: cleaning up the sloppy mess that were most of those individual pages. There were over 400. Cleaning them up in Logseq was do-able, but slow. Lots of repeated stuff. This wasn’t a job for an outliner, it was a job for a high-powered text editor like Vim or Atom.

Unlike Vim, Atom sports a sidebar that displays all the files in the directory, and its regular expression parser recognizes newlines. So I could find blank strings using the expression ^- *\n (which means, “look for a line starting with a dash, followed by zero or more spaces, then a new line”) and get rid of them.

But the even bigger time-saver: realizing a lot of those entries were long outdated (some dated back to 2013) and deleting them. By the time I was done with that pass, I had 109 “keepers” left. From there, it was a matter of applying search and replace to fix common issues.

So with 3/4 of the pages deleted, and much of the boilerplate stuff from the remaining pages deleted as well (I just need the content, the source, and some info about the author). That means my assets folder has 4852 items in it, and most of them were no longer being linked to.

Now… am I going to make 4852 passes through my pages, by hand, to see if a pic can be deleted?

The shell (aka Terminal) is my machine gun for blasting a job like this.

# assume we're in the assets directory

mkdir -p ../assets_removed

for i in *; do

  grep -q "$i" ../pages/* || mv "$i" ../assets_removed

done

Let’s pick this apart, for those who need it.

The first line is just a comment. An important one, all the same. You need to be in your Logseq database’s assets directory for this to work correctly. BAD THINGS will happen otherwise! One of the nice things about using MacOS: if I eff something up, I can pull it out of the Time Machine backup and try again.

Next, we make a directory called assets_removed at the same level as the assets directory. Just in case we make a mistake, you know. The -p option is there to make the script shrug and move on if the directory already exists (if we’ve been here before, for example).

The third and fifth lines begin and end a loop, going through each of those >4800 graphic files.

Inside the loop, we search for the file name in the pages. The -q option is exactly what you want for a script like this; it returns success if grep finds the string and failure otherwise. The || (two vertical bars) means “execute the next part if it fails” (in this case, fails to find the file name)… and the next part moves the unused file to the assets_removed directory.

And I ended up with 255 files (out of nearly 5000) that were actually being used. The other ones are out of the way, and can be safely deleted once I verify that none of them are needed.

[UPDATE: After stepping through the pages again, I found 18 “false negatives” that had to be dragged back into the assets folder. That’s why you move them out of the way, instead of just nuking them.]

It took about a minute to grind through the assets directory, and a couple of minutes to set up the script, but that beats the heck out of hours (or days) doing it by hand! I’m fond of saying, I’m lazy enough to get the computer to do my work for me. It doesn’t always pay off this big, but it does pay off.

Off to get the Google Keep notes…

Monday, June 21, 2021 2 comments

The next generation of information management

Some 15 years ago, I was enjoying Journler. The thing I really liked about it back then was, it could post an entry straight to Blogger. Of course, Google likes to screw around with Blogger. They broke the posting mechanism that Journler used, and now they’re concentrating on keeping it deliberately broken for Safari users. Over the years, Journler slowly sunk into the morass of apps that got left behind by advances in MacOS.  I continued to use it to capture flash fiction, scenes, and chunks of longer stories, until it became unusable. The source code for Journler has been available on Github for a long time, but I just now found out about it.

But I digress. As Journler wheezed and died, I tried a variety of paper and app systems to capture stuff I wanted to come back to later. Evernote was okay, until they crippled the free version to support only two devices (previously five). Google Keep held my interest for a long time; but I’m trying to extricate myself as much as possible from Google these days—and if I could find something better that isn’t WordPress, I’d go through the hassle of moving everything. Perhaps my longest-running attempt has been using Tines to keep notes and to-do lists organized. And yet, I’m always keeping my nose in the air, sniffing for a better way of doing things.

Recently, I started looking at journaling apps. Day One came highly recommended, and had the huge advantage of both Mac and iOS apps that talked to each other. I gave it as honest a chance as I could—downloading both the MacOS and iOS apps—and even with the daily prompt on my phone, I never really warmed up to it.

Someone suggested Logseq last week, and it sounded interesting enough to give it a try. The developers describe it as “a privacy-first, open source knowledge base,” and the videos they link to from the home page (an enthusiastic user who describes how to make the most of it) convinced me to give it a try. It can run either as a webapp, writing to your hard drive, or as a standalone desktop app. The developer says he was heavily influenced by Roam Research, Org Mode (Logseq supports Org Mode, although it defaults to Markdown), Tiddlywiki, and Workflowy.

The interface looks invitingly plain, at first. You’re presented with a journal page with today’s date on it, but otherwise blank. Start typing, and it supports Markdown (a big plus)… oh, wait. It’s also an outliner (and given my long-term relationship with Tines, that’s another big plus). Oh, type two square brackets and enter a title, like a Wiki link, and you get a new page that you can click to enter (yeah, I’ve always been fond of Wikis). Oh, type /TODO Download the desktop app on a line, and you get a to-do entry. Put hashtags on entries you think you’ll need to come back to later.

Clusters show groups of related entries
This seems like a chaotic way of doing things… until you display the graph. Then, Logseq gathers up all those scattered to-do entries, those hashtagged items, and other things, and pulls them all together. Some call this a digital garden or digital knowledge garden—me, I just call it software magic. Of course, you get out what you put in. You can create custom queries to pull things together based on how you want them. If you leave it running overnight, click the little “paw print” icon at the top left to open the new day’s journal page. Maybe this is what makes Logseq so much more approachable for me.

It took me a day or two to realize that this is the most natural approach for working with Logseq. There’s a lot of layers to it, and this brief post isn’t doing it justice. I’m using it at home with the desktop app, and at work using the webapp (because Doze complained about how not safe the desktop app was). Either approach works fine.

There’s a mobile app called Obsidian that can be set up to work with Logseq’s files, but it’s a private beta right now and I don’t need it just yet.

Now I have to figure out how to pour all the different entries in all the different paper and pixel systems I’ve accumulated over the years into Logseq. Someone wrote a script to convert Google Keep to Markdown, so that’s settled. I hope I can write a script to pull all my old Journler entries in.

Monday, February 01, 2021 No comments

Adventures of a #techcomm geek: Go API

Image source: openclipart.org
A couple weeks ago, I got an email from a product manager:

Can you convert these API documents to our format?

Attached were three Weird documents. I let my manager know about the request; he told me to make sure we had rights (the documents were from an OEM, we’re marketing a re-badged version of their product), and to loop in the other writer on this product.

I looped in the other writer, who used to sit right across from me when we had those quaint “office” and “commute” things. While both of us thought it might be best to do it the “right” way—that is, convert the docs to DITA and publish them through our CCMS—we both figured replacing logos and changing names would be good enough.

We both expected the other to pick it up, I suppose, and I was doing other things. The upshot was, I forgot to ping him about it. So Monday came around, and nothing had happened. I groaned at the prospect of using Weird for something more than a two-page HOWTO document, then thought about the scripts I wrote for pulling in documentation through Markdown. “If it becomes too much of a hairball to clean up,” I told myself, “I can always replace the logos and change the product names.”

As it turns out, Markdown is quite adequate for API documentation. There was some cleanup involved, but not as much as I had feared. Global search and replace took care of a lot of it. Most of the manual cleanup was the same kind of thing anyone does when bringing someone else’s documentation into their system—improving topic titles, adding boilerplate… you know the drill. It took about a day to knock the three documents (total of 600 pages, give or take) into shape, and another day to tweak things.

I went ahead and fed the bookmaps to our transform. It was only after I got a decent-looking PDF that I realized: all the topics were still in Markdown. In retrospect, that wasn’t too surprising: the toolkit converts those topics to DITA (temporarily) before processing them. Markdown is a lot easier to deal with when you’re doing cleanup stuff anyway, and I finished that before doing the uplift.

So by Wednesday evening, I was ready to upload the converted documents into the CCMS. The upload tool is more finicky than Morris the cat, and it uncovered a couple more cleanup issues. I resolved those Thursday morning, and we now have clean DITA in the system.

And yay, I didn’t have to touch Weird!

By the way, the conversion scripts are on Github. Just in case you need to do something like this.

Wednesday, October 21, 2020 No comments

Adventures of a #techcomm geek: Constants Aren't, Variables Won't

DITA-OT logo
One of the advantages of having a DITA-based workflow for technical writing is for translation. During the acquisition binge that ended with us being on the “bought” end, we picked up a product with a fairly strong retail presence. You’ve probably seen those products in Best Buy and similar places, and maybe even bought one to upgrade your home network. (No, I’m not going into details, because I don’t write documentation for that line… mostly.)

But, as usual, I digress. Retail products, or not-retail products that are supplied to the end-user, need to have localized documentation—that is, not just in the native language, but using country-specific idioms (although this might go a little too far). And, to help with consistency, things like notes or cautions use canned strings.

The DITA Open Toolkit (DITA-OT) PDF plugin provides a pretty good list of canned “variable” strings for a bunch of different languages, including languages with non-Latin glyphs. Of course, we added to that list… somewhat. I put quotes on “variables” because I don't know why they call them variables; they are basically language-specific constants. Local Idiom, I suppose.

Fast-forward a couple years, to the disease-ridden hellscape that most refer to as “2020.” A year ago, one of the point people for translations sat two aisles down from me, on those days we weren’t both working from home. We would have hashed half of this out in person, before roping in a bunch of other people in a long email chain. (Don't get me wrong, working remote is da bomb, and I hope they don’t expect me to do time in the office in the future… but it had the occasional upside.)

Anyway, this was the first Brazilian Portuguese translation we had done in a while, and weird things were happening. My initial guess—that we had provided updated strings for only a subset of languages (mostly French and German)—turned out to be correct, when I started poking around in the source. I remembered working on a script to parse the XML-based “variable” files to build a spreadsheet, so we could easily see what needed updating. Turns out, I had either given up or got pulled away after the script was less than a quarter-baked (let alone half). I beamed my brain power at the cursed XSLT file, and it finally turned brown and gave me the output I wanted: name[tab]value.

Now I was halfway there. I had tab-delimited files for each language, now I just needed to coalesce them into a single (again, tab-delimited) file. As I’m fond of saying, when I want to process a big wad of text, awk is how I hammer my nails… and I started pounding.

Since I had an anchor point—the “variable” names that were constant for each language—it was a Small Matter of Programming. Knowing that English (en) was the most complete language helped; I used it as a touchpoint for all the other languages. After a few fits and starts, the script produced the output I needed and I imported it into Excel. Blank cells that needed values, I highlighted in dark red. Things I needed to personally tweak here and there got yellow highlighting. I hid rows that didn’t need attention (some were complete across the board, others we don’t use), and sent it to the rest of the team.

Just to be complete, I finished the day embedding the XSLT and awk scripts inside a shell script (and tested the results). If I need to do this again, and I probably will, I can do it in a matter of minutes instead of spending an entire day on it.

I deliberately formatted the spreadsheet so I can export changes to TSV (tab-separated values) and write another script to rebuild the language “variables” if I feel it’s necessary. It’s always good to anticipate future requests and be ready for them.

Tuesday, March 10, 2020 No comments

Adventures of a #techcomm Geek: Match Game, 2020

It’s been a while since I did one of these, and this one goes in deep.

We’ve been using DITA at work for a year or two now, but rarely is there time to go back and take advantage of the things it offers, retrofitting those things into the documentation we brought in. (Docs we’ve created since then seem to get more thorough treatment.)

One of those things is reuse. It’s easy to reuse an entire topic in a different book—even if it was duplicated. “Hey,” says a writer, “that’s the same thing. Let’s throw away topic B and use topic A.”

DITA also supports reusing common paragraphs in two or two dozen topics, but that’s a little harder. First, you have to recognize that paragraph. Then, you have to create a new topic (a collection file), copy the paragraph into the collection file, and assign it an ID. Then you have to replace the duplicated text (in topics) with a content reference (a/k/a conref). It’s a worthwhile thing to do, because you might say the same thing slightly differently otherwise. Still, who wants to go through an entire book (or worse, set of books), looking for reuse candidates?

Of course, you can always let a computer do the tedious work… if you know how to tell it what to do.

Preparing the (searching) grounds

A while back, I wrote my first useful Python scripts. One takes a particular JSON file and reformats it as a DITA reference topic, containing a table with the relevant data from the JSON file. Another walks through a CSV file, grabbing the columns I need, and producing topics documenting a TR-069 data model. Both scripts take advantage of a vast library of pre-written code to parse their input files.

It occurred to me that, if I were to find (or create) a way to export all the text from a DITA book into a CSV file, I could use a Python script to compare each paragraph to all the others. Using fuzzy matching would help me find “close enough” matches. That was a while ago, because I bogged down on trying to get properly-formatted text out of DITA.

Last week, I got bored. Someone on the DITA-OT forum mentioned a demo plugin that translated DITA to Morse code, and the lightbulb in my head went on. If I could modify that plugin to just give text instead of -.-. .-. .- .—. then maybe I’d have what I needed.

It was an abject failure. What I need is one line per block element (paragraph, list item, etc). What I got was one line for the entire topic, sometimes with missing spaces. I put that aside, but realized that DITA-OT can also spit out Markdown. If I could convert Markdown to plain text, I’d be ready to rock!

So you want to convert DITA to Markdown? It’s easy, at least with the newer toolkits:

dita --format=markdown_github --input=my.bookmap --args.rellinks=none

The DITA-OT output continues to be topic-oriented, writing each topic to its own file. That wasn’t quite what I wanted, or so I thought at the time. Anyway, we have Markdown. How do we get plain text out of it, with each line representing a block element?

Turns out that pandoc, the “Swiss Army knife for converting markup files,” can do it:

pandoc -t plain —wrap=none -o topic.txt topic.md

In the heat of problem-solving, I realized I didn’t need a CSV file… or Python. I could pick up Awk and hammer my nails the text into shape. My script simply inhaled whatever text files I threw at it, and put all the content into an array indexed by [FILENAME,FNR] (FNR is basically the line number of paragraphs inside the file). There was a little stray markup left, not to mention some blank lines, and a couple of Awk rules threw unneeded lines into the mythical bit bucket.

Got a (fuzzy) match?

A typical match is an all or nothing Boolean: you get true (1) if the strings are an exact match, or false (0) if they don’t.

Fuzzy matching uses the universe of floating-point numbers in between 0 and 1 to describe how close a match is. It’s up to you to decide what’s close enough, but you usually want to focus on values of 0.9 and higher. And yes, an exact match still gives you a score of 1.

Why do we want to do this? Unless content developers are really good about cutting and pasting in a pre-reuse environment, inconsistencies creep in. You might see common operations described in slightly different ways:

Click OK to close the dialog.
Click OK to close the window.

So along with flagging potential reuse candidates, a fuzzy match can help you be consistent.

Python and Perl have libraries devoted to fuzzy matching. There are several ways to do a fuzzy match, but one of the more popular is called the Levenshtein distance. There's a scary-looking formula at the link, but it boils down to single-character edits (addition, deletion, or replacement). The distance between “dialog” and “window” is 4 (d→w, a→n, l→d, g→w).

But this is an integer, not a floating-point number between 0 and 1! But that’s easy to fix. If l1 and l2 are the lengths of the two strings, and d is the calculated Levenshtein distance, then the final score is (l1+l2-d)/(l1+l2). In the above example, the score is 0.93—the strings are 93% identical.

There are websites with Levenshtein distance implementations in all sorts of different programming languages, although the ones written in Awk are not as common. But no problem. Awk is close enough to C that it’s simple to translate a short bit of code. I picked the second of these two. There was one already written in Awk, but it took a lot more time to grind through a large set of strings.

Save time, be lazy

The time it takes is important, because it adds up fast. Given n paragraphs, each paragraph has to be compared to all the rest, so you have n2 comparisons. A medium sized book, with 2400 paragraphs, means 5.76 million comparisons. Given that a fuzzy comparison takes a lot longer than a boolean one, you want to eliminate unnecessary comparisons. A few optimizations I came up with:

  • It’s easy to get to (n2-n) by not comparing a string to itself. We also do a boolean compare and skip the fuzzy match if the strings are identical. Every little bit helps. Time to analyze 2400 paragraphs: 2 hr 40 min. My late-2013 iMac averages about 600 fuzzy match comparisons per second.
  • By deleting an entry from the array after comparing it to the others, you eliminate duplicate comparisons (once you’ve compared A to B, doing B to A is a waste of time). That eliminates noise from the report, and cuts the number of comparisons required in half. Time to analyze 2400 paragraphs: 1 hr 20 min. Not bad, for something you can do with one more line of code.
  • Skip strings with big differences in length. Again, if l1 and l2 are the lengths of two strings, then the minimum Levenshtein distance is abs(l1-l2). If the best possible score doesn’t reach the “close enough” threshold, then you don't have to do the fuzzy match. Time to analyze 2400 paragraphs: 5 min 30 sec!!! Now that’s one heck of an optimization!

So we’ve gone to something you run overnight, or at least during a long lunch break, to something that can wrap up during a coffee break (eliminating 96.5% of the time needed is a win no matter how you look at it). Now if your book is all blocks of similar length, it will take longer to grind through them because there isn’t anything obvious to throw out.

Still, this is down to the realm where it's practical to build a “super book” (a book containing a collection of related books) and look for reuse across an entire product line. That might get the processing time back up into the multiple-hours realm, but you also have more reuse potential.

Going commercial

The commercial offerings have some niceties that my humble Awk script does not. For example, they claim to be able to build a collection file (a “library” of sorts, containing all the reusable paragraphs) and apply it to your documentation. That by itself might be worth the price of entry, if you end up with a lot of reuse.

They also offer a pretty Web-based interface, instead of dropping to the command line. And, they have likely implemented a computing cluster to grind through huge jobs even faster.

But hey, if you’re on a tight budget, the price is right. I’m going to make sure the employer doesn’t have a problem with me putting it up on Github before I do it. But maybe I’ve given you enough hints to get going on your own.
UPDATE 10 May 2020: The script is now available on Github.

Monday, December 30, 2019 1 comment

Upgrade marathon

Kahuna, my late-2013 iMac, has been really laggy as of late. Running EtreCheck suggested the hard drive might be failing, although it gave no evidence for that conclusion, and checking system logs turned up no obvious disk-related issues. Still, the computer is 6 years old, and has been running almost constantly that entire time. System Monitor showed the 8GB of RAM was getting used up, and that’s probably the biggest issue I’ve been having.

Back when my daily driver was a 2008 MacBook, I maxed out the RAM and put a SSD drive in it, and it felt twice as fast as before (I still use it for vacations and other out-and-about things, although now it’s running Xubuntu). An iMac is a somewhat more difficult beast to upgrade, because there’s no handy hatch to access the RAM and hard drive… but iFixit provides parts, tools, and detailed instructions for doing the deed. So I ordered the parts and tools.

And there they sat, on my dresser, for months.

I knew that it would take a fair amount of time and effort to dismantle my beloved iMac. Between it running all the time, and not having a big block of time to do the repair, it languished. I seriously considered paying someone to do the work for me, but things finally came together. Rainy afternoon, nothing pressing, wife was present to watch the kids. I made sure the Time Machine backup was current, shut Kahuna down, and disconnected all the wires. As I usually do with these things, I pulled up the iFixit manual on my iPad so I could see what to do next.

Turns out I’d made a mistake on the memory upgrade. I knew the iMac had two RAM slots, and it had 8GB total. I ASSumed it had 8GB in one slot, and had an empty slot for another 8GB, so that’s what I ordered. WRONG. It had two 4GB sticks instead. This I learned, after pretty much taking the whole computer apart. Hey, 12GB is better than 8, so I swapped the new stick for one of the old ones. While I was at it, I replaced the 1TB spinny hard drive for a 2TB SSD. I also cleaned out 6 years worth of accumulated dust. I didn’t realize the iMac had a fan, and I’m not sure it could have pushed much air anyway, given the dust choking the airways.

There were a few glitches in the repair instructions, and that slowed me down quite a bit. Between several rounds of going back and fixing things both I and the manual had missed, and wiping up dust as I went, it was a good 5 hours before I put the display (“big, heavy, and glass” according to the instructions) back on. I had new adhesive strips (Apple uses a lot of double-sided tape to hold their products together), but the instructions wisely suggested checking your work before sticking the display back onto the frame.

Fortunately, the upgrade had gone according to plan. I restored my backup onto the new SSD, and it finished while saying it had an hour to go (I hadn’t emptied the trash in like forever). Once the system booted, launching apps was significantly faster, and everything is more responsive. Yeah, it was a hassle, but I’m going to get all that time back and then some with a snappier desktop. I’m writing this post on the upgraded system, and I haven’t seen a single beachball or lag.

Now the question is: do I go ahead and button it up, or order another 8GB stick and completely max it out? I’m kind of leaning toward the latter. It’s not going to kill me to have the display held on by tape for a while longer, after all.

Thursday, May 16, 2019 No comments

Adventures of a #techcomm Geek: Sharp Edges when Rounding

One of the advantages of using a text-based markup grammer for documentation—these days, often XHTML or some other XML, but could be Markdown, Restructured Text, Asciidoc, or even old-sK001 typesetting languages like troff or TeX—is that they’re easy to manipulate with scripts.

There are quite a few general-purpose scripting languages that do a fine job of hunting down and acting on patterns. I’m conversant with Perl, and am learning Python; but when I need to bang something out in a hurry and XML is (mostly) not involved, Awk is how I hammer my nails. Some wags joke Awk is short for “awkward,” and it can be for those who are used to procedural programming. Anyone exposed to event-based programming—where the program or script reacts to incoming events—will find it much more familiar. Actually, “awk” is the initials of the three people who invented it: Aho, Weinberger, Kernighan (yes, that Brian Kernighan, he who also co-invented the C language and was a major player on the team that invented Unix).

Instead of events, Awk reacts to patterns. A pattern can be a plain string, a variable value, a regular expression, or combinations. Other cool things about Awk:
  • Variables have whichever type is most appropriate to the current operation. For example, your script might read the string “12.345,” assign it to x, then you can use a statement like print x + 4 and you’ll get 16.345.
  • The language reference (at least for the original Awk) fits comfortably in a manpage, running just over 3 pages when printed. Even the 2nd edition “official” reference is only 7 pages long.
  • It’s a required feature in most modern Unix specifications. That means you’ll always have some version of Awk on an operating system that has some pretensions to be “Unix-like” unless it’s a stripped, embedded system. On the other hand, even BusyBox-based systems include a version of Awk. Basically, that means Awk is everywhere except maybe your phone. Maybe.
If your operating system is that Microsoft thing, you can download a version of Awk for it. If you install the ISH app, you can even have it on an iPhone.

Now what am I going to do with it?

Okay. I told you all that to tell you this.

I’m working on something that extracts text from a PDF file, and formats it according to rules that use information such as margin, indent, and font. It requires an intermediate step that transforms the PDF into a simple (but very large) XML file, marking pages, blocks, lines, and individual characters.

“But wait a minute!” you say. “I thought Awk only worked on text files. How does it parse XML?”

Like many useful utilities first released in the 1970s, Awk has been enhanced, rewritten, re-implemented from scratch, extended, and yet it still resembles its ancestral beginnings. The GNU version of Awk (commonly referred to as gawk) has an extensions library and extensions for the most commonly-processed textual formats, including CSV (still beta) and XML. In fact, the XML extension is important enough that gawk has a special incantation called xmlgawk that automatically loads the XML extension.

The neat thing about xmlgawk, at least the default way of using it, is that it has a very Awk-like way of parsing XML files—it provides patterns for matching beginnings of elements, character data, and ends of elements (and a lot more). This is basically a SAX parser. If you don’t need to keep the entire XML file in memory, it’s a very efficient way to work with XML files.

So. In most cases, I only need the left margin of a block (paragraph). Sometimes, I need the lowest extent of that block as well, to throw out headers and footers. I need to check the difference between the first and second line (horizontally), and possibly act upon it.

In the document I used for testing, list items (like bullets) have a first line indent of –18 points. “Cool,” I said. “I can use that to flag list items.”

All well and good, except that it only worked about 10% of the time. I started inserting debugging strings, trying to figure out what was going on, and bloating the output beyond usefulness. Finally, I decided to print the actual difference between the first and second lines in a paragraph, which should have been zero. What I found told me what the problem was.

    diff=1.24003e-18

In other words, the difference (between integer and floating point numbers) was so miniscule as to matter only to a computer. Thus, instead of doing a direct comparison, I took the difference and compared that to a number large enough to notice but small enough to ignore—1/10000 point.

And hey presto! The script behaved the way it should!

It’s a good thing I’ve been doing this at home—that means I can soon share it with you. Ironically, it turns out that we might need it at the workplace, which gives me a guilt-free opportunity to beta-test it.

Monday, October 16, 2017 No comments

Tines 1.11.1

… is out. This is a quick bug-fix release, the PageUp and PageDown keys should now work properly (on Macs, use Fn-up-arrow and Fn-down-arrow). I’ve also merged the dev branch into the master branch on GitHub.

To make things more convenient, I registered tines-outliner.org and set it up as a shortcut to the repository.

If you aren’t familiar with Tines, it’s a console-mode outliner. It runs in MacOS X, Linux, and the Microsoft thing (using Cygwin). It’s unique in that it supports a “text” tag for entries, and so can differentiate between headings and body text. It can export to Markdown, HTML, *roff, and plain text formats. You can export your outline to Markdown and pull it into Scrivener. See Getting Your Outline into Scrivener (pt 2) for details.

Tines, and a few other outliners, has support for to-do lists (basically a collection of entries with checkboxes). That means you can use it to keep outlines, goals, snippets of scenes, and notes about your stories in a single place.

Compile?

Yes, the word “compile” is composed of the Latin com (together) and the English pile (a random heap, or hemorrhoid). So, yes,  compile means either to throw things together, or a multifaceted pain in the @$$. Still, if you need an outliner, it’s available!

I want to get back to working on an install package, at least for MacOS X. I’ll probably have to leave packages for the other operating systems to their own experts (not that I’m anything like an expert in MacOS X packaging, mind you).

Tuesday, March 21, 2017 No comments

Tech Tuesday: Roll Your Own Writing System, part 6: Jekyll


The series rolls to an end…

In Part 1, we had a look at Markdown and the five or six formatting symbols that cover 97% of written fiction. Part 2 showed how you can use Markdown without leaving the comfort of Scrivener. Part 3 began exploring eBook publishing using files generated from both Scrivener and directly from MultiMarkdown. Part 4 provided a brief overview to a different tool called Pandoc that can convert your output to a wider variety of formats, and is one way to create print documents for beta readers or even production. Part 5 described how to use MultiMarkdown’s transclusion feature to include boilerplate information in an output-agnostic way, and how to use metadata variables to automatically set up front matter.

Scrivener is an excellent writing tool, and we have seen how using it with MultiMarkdown only makes it better. But there are conditions where abandoning the GUI for a completely text-based writing system just makes sense. For example, you might want to go to a minimalist, distraction-free environment. You may want to move to a completely open-source environment. Or you might need to collaborate with someone else on a project, and Scrivener really isn’t made for that.

Don’t Hyde from Jekyll


Jekyll is the most popular static site generator. You write in Markdown—Jekyll’s particular flavor, which is similar to MultiMarkdown in many ways—and if Jekyll is running, it automatically converts your pages to HTML as soon as you save. It even includes a built-in web server so you can see what the changes look like.

If you’re on a Mac, installation is almost too easy. Drop to a command line, enter gem install jekyll bundler, and watch a lot of weird stuff scroll by. It’s as easy on Linux, if you have Ruby 2.0 or newer installed. On the Microsoft thing, there are some specific instructions to follow (I installed it on my work PC, no problem).

Once it’s installed, get going by following the quick-start instructions.

Organizing


Unlike Scrivener, organizing your project is on you. But there are a couple things that might help:

Each story or project should live in its own folder. Within that folder, tag each chapter or scene with a number to put everything in its proper sequence. For example:

100_chapter_1.md
110_arrival.md
120_something_happens.md
200_chapter_2.md
210_more_stuff_happens.md

It’s a good idea to increment by 10 as you create new scenes, in case you need to insert a scene between two existing ones later. To move a scene, change its number. If you have more than nine chapters, use four-digit numbers for the sequence. (If you need five-digit numbers, you should seriously consider turning that epic into a series of novels.)

Differences from MultiMarkdown


Like MultiMarkdown, Jekyll’s flavor of Markdown supports variables and transclusion. But there are a couple differences. In Jekyll, variables look like MultiMarkdown’s transclusion:

{{ page.title }}

You can draw variables from the page’s metadata, or from the _config.yml configuration file (in which case you replace page with site).

Transclusion is a function of the Liquid templating language, built into Jekyll. To include a file:

{% include.relative file.md %}

You can also use include instead of include.relative to pull files from the _includes directory. By using Liquid, you can specify parameters to do different things, effectively creating your own extensions.

For example, here’s how you might do section breaks:

<p class="sectionbrk">
  {% if include.space %}&nbsp;{% else %}&bull; &bull; &bull;{% endif %}
</p>

So if you just enter {% include secbrk.html %}, you get three bullets. To get a blank line, enter {% include secbrk.html space="true" %} instead.

Also like MultiMarkdown, Jekyll supports a metadata block at the beginning of a file. While they look very similar, Jekyll uses YAML format for its metadata. The upshot is, a Jekyll file begins and ends its metadata with a line of three or more dashes, like this:

---
title: The Sordid Tale of Woe
author: Henrietta Jekyll
permalink: /sordid/sordid_tale.html
---

Certain metadata tags are special to Jekyll. For example, permalink specifies the name and location of the HTML file Jekyll creates from the Markdown source. Another important tag, layout, can be used to choose a template. You can set the default layout in the configuration file, then use a second configuration file to override it for doing things like publishing.

Git Out


Jekyll is also a blogging tool. Your posts go into a special directory, _posts, and have a specific naming convention. Two additional metadata tags are important:

date:   2017-03-21 07:00:00 -0500
categories: writing technology

The date entry specifies the date and time your post goes live on the generated site. The categories entry lets you tag each post for easier searches.

But all that’s just pixels on the screen unless you have a place to put your site. That’s where Github Pages comes in. You can upload your Jekyll files to Github Pages, and it automatically updates your site when it finds new or changed content. This is pretty useful, but it’s even more useful when you’re working with other people. Everyone has their own copy of the source files on their own computers, and they can each push (update) their changes as needed.

Now What?


I hope I’ve given you some ideas for new ways of looking at your writing, and how to make the publishing part more efficient and more collaborative.

The rest… is up to you. I’d love to see your own ideas in the comments.

Tuesday, March 14, 2017 No comments

Tech Tuesday: Roll Your Own Writing System, part 5: Reuse

The series rolls on…

In Part 1, we had a look at Markdown and the five or six formatting symbols that cover 97% of written fiction. Part 2, showed how you can use Markdown without leaving the comfort of Scrivener. Part 3 began exploring eBook publishing using files generated from both Scrivener and directly from MultiMarkdown. Part 4 provided a brief overview to a different tool called Pandoc that can convert your output to a wider variety of formats, and is one way to create print documents for beta readers or even production.

Way back in Part 2, we used Scrivener to embed HTML separators between scenes and for internal scene breaks. As we saw last week, that doesn’t work when you need to output to a different format. As it turns out, there’s a way to work around that by using MultiMarkdown’s transclusion mechanism. Transclusion and metadata variables provide the capability for reuse, pulling common boilerplate files from a library.

Inclusion… Transclusion?


Transclusion is a technical term, but it’s easy enough to explain. You use it to embed another Markdown file into your document, like you might include a graphics file. A function like this is essential when you’re maintaining a collection of technical documents, because you can reuse common sections or passages—write them once, store them in a library of common files, and then changing one of the source documents automatically updates all the documents that use it. For fiction writing, it’s a good way to pull in all those boilerplate files (about the author, front matter, etc.) that you need for each book.

To transclude a boilerplate file, put this on its own line:

{{myfile.md}}

When you run multimarkdown, it pulls in the contents of myfile.md and processes it.

Now here’s where it gets fun. Say you really need to be able to output to both HTML and OpenOffice. Instead of embedding HTML that gets ignored in the OpenOffice conversion, or vice versa, you can use a wildcard:

{{myfile.*}}

Now, when you output to HTML, multimarkdown transcludes the file myfile.html. When you want OpenOffice, it uses myfile.fodt. You just have to supply the files with the right extensions and content, and you’re off to the races! You can use this in the Separators in Scrivener to choose the right markup for your output.

A few caveats for fodt transclusion: You cannot use entities like &bull; or &#8026; to specify special characters. You have to enter them as characters. If you only have one line to add, you don’t need to put any OpenOffice markup in the fodt file—plain text is fine, but use the right extension so multimarkdown knows which file to use.

If you want to reuse transcluded files with other documents, you can add another line to the metadata:

Transclude Base: /path/to/your/files

You can use a relative path like ../boilerplate, but it’s safer to specify the entire path in case you move the file to some other location.

Does the Front Matter?


But transcluding boilerplate files is only the beginning. Especially for front matter, you need to change at least the title for each book. Fortunately, MultiMarkdown has that covered.

In Scrivener’s Compile window, the last entry is Meta-Data. Back in Part 3, you used this to specify a CSS file for HTML output. Scrivener pre-fills entries for the Title and Author, but you can add anything else you want here. All the metadata ends up at the beginning of the file, where MultiMarkdown can process it further.

So you might have a block that looks like this:

Title: Beyond All Recognition
Subtitle: The Foobar Chronicles, Book 1
Author: Marcus Downs
Copyright: 2017
Publisher: High Press UR

Create a title page that looks like this (for HTML output):

<div style="text-align:center" markdown="1">
**[%title]**

**[%subtitle]**

by  
[%author]

Copyright [%copyright] [%author]. All rights reserved.

Published by [%publisher]
</div>

![](logo.png)

{{TOC}}

Instant front matter! The {{TOC}} construct inserts a table of contents, another Multimarkdown feature.

Now What?


Now you know how to include boilerplate files in your book, and how to automatically put the right text in each output format.

Next week… it’s something completely different to wrap up the series.

Tuesday, March 07, 2017 No comments

Tech Tuesday: Roll Your Own Writing System, part 4: MultiMarkdown and Pandoc

The series rolls on…

In Part 1, we had a look at Markdown and the five or six formatting symbols that cover 97% of written fiction. Part 2, showed how you can use Markdown without leaving the comfort of Scrivener. Part 3 began exploring eBook publishing using files generated from both Scrivener and directly from MultiMarkdown.

Today, we’re going to take a brief look at a different tool you can use to publish MultiMarkdown files.

Pandoc describes itself as a Swiss Army knife for markup languages, but it goes farther than that. More than markup languages, it converts to and from common word processor formats and can even convert directly to EPUB. You can mess with templates to get the output really close to production-ready, but that's a little beyond the scope of our series here. In real terms, it’s not any faster than loading a prepared HTML file into a skeleton EPUB; both methods need a little cleanup afterwards.

This sounds at first like it’s just an alternative to using MultiMarkdown, but it goes a little farther than that. One problem with embedding HTML in your Markdown files, none of it gets converted to other formats. So you can’t just take your MultiMarkdown file and create an OpenOffice file by running:

multimarkdown --to=odf story.md >story.fodt

Because all your section breaks disappear. Pandoc ignores embedded HTML as well… so again, what does Pandoc buy you?

Well, once you have your HTML file, you can use Pandoc to convert that HTML file to the word processor format of your choice.

pandoc -f html -t odt -o story.odt story.html

And there’s the answer to how you make your story available for beta readers who want a word processor file. If you’re willing to tolerate some sloppy typesetting, you could use it for your print document as well. Pandoc also supports docx and rtf as output formats.

Now What?


Now you can output your MultiMarkdown file in a number of formats, including eBook (direct and indirect) and common word processor formats.

Next week, we’ll look at some special features of MultiMarkdown that you might find useful.

Comments? Questions? Floor’s open!

Wednesday, March 01, 2017 No comments

Tech Tuesday: Roll Your Own Writing System, part 3: Publishing MultiMarkdown

The series rolls on…

In Part 1, we had a look at Markdown and the five or six formatting symbols that cover 97% of written fiction. Last week, we saw how you can use Markdown without leaving the comfort of Scrivener.

This week, it's time to build an eBook using MultiMarkdown output. If you have been cleaning up Scrivener’s EPUB output in Sigil, you should find the process familiar—only, without most of the cleanup part.

First thing, output an HTML file through MultiMarkdown. In Scrivener, click the Compile button and select MultiMarkdown→Web Page in the dropdown at the bottom of the screen.

Under the Overhead

Open Sigil, then import your HTML into a new eBook—or better yet, a “skeleton” eBook with all the boilerplate files already in place.

All you have to do now is to break the file into separate chapters and generate a table of contents. You can save even more time by creating a custom text and folder separator in the last part of Scrivener’s Compile Separators pane:

<hr class="sigil_split_marker"/>

Then, when you’ve imported your HTML file, just press F6 and Sigil breaks up the file for you. If you start with a skeleton EPUB file, you can have a perfectly-formatted EPUB in a matter of minutes. Seeing as it takes me an entire evening to clean spurious classes out of Scrivener’s direct EPUB output, this is a gigantic step forward.

One thing to watch out for: MultiMarkdown inserts a tag, <meta charset="utf-8"/>, at the beginning of the HTML output. EPUB validators choke on this, insisting on an older version of this definition, but all you have to do is remove the line before you split the file.

Breaking Free

Perhaps you want to slip the surly bonds of Scrivener. Maybe your computer died, and your temporary replacement does not have Scrivener—but you saved a Markdown version of the latest in your Dropbox, and your beta readers are waiting.

Scrivener has its own copy bundled inside the app, so you’ll need to download MultiMarkdown yourself. It runs from the command line, which is not as scary as it sounds. In fact, Markdown and MultiMarkdown are very well-suited to a distraction-free writing environment.

After you’ve installed MultiMarkdown, start a Terminal (or Command Line on that Microsoft thing). On OSX, press Cmd-Space to bring up Spotlight. Type term, and that should be enough for Spotlight to complete Terminal. If you prefer, you can start it directly from /Applications/Utilities.

Next, move to the right directory. For example, if your file is in Dropbox/fiction, type cd Dropbox/fiction (remember to reverse the slash on the Microsoft thing).

Here we go…

multimarkdown mybook.md >mybook.html

Now you have an HTML file that you can import into Sigil (just don’t forget to remove that pesky meta tag).

Silly CSS Tricks

Last week, I mentioned a couple of things you can do with CSS to help things along.

First, when you Compile your Scrivener project to MultiMarkdown, click Meta-Data in the options list. You should see some pre-filled options: Title, Author, and Base Header Level. Click the + above the Compile button to add a new entry. Call the entry CSS then click in the text box below and enter ../Styles/styles.css—if you’re using Sigil, it puts all stylesheets in the Styles directory. You can give it another name if you have a stylesheet pre-defined (mine is called novel.css).

Pre-define your CSS

Now open your stylesheet, or create one if you need to. Add the following entries:

p.sectionbrk {
    text-indent:0; text-align:center;
    margin-top:0.2em; margin-bottom:0.2em
}
.sectionbrk + p { text-indent: 0; }
h1 + p { text-indent: 0; }

The first entry formats the sectionbrk class to be centered, with some extra space above and below. The second one is more interesting: it cancels the text indent for the paragraph after a section break. The third entry does the same thing for a paragraph following a chapter heading (you can do this for h2 if needed as well). This is the proper typographical way to format paragraphs following headings or breaks, and you don’t have to go look for each one and do it yourself. I told you this can save a ton of time!

Again… Now What?

Now you can work with MultiMarkdown within Scrivener. You can export it, generate an eBook, and work with the file outside of Scrivener.

Next week, I’ll show you another way to make an eBook from your MultiMarkdown file.

Comments? Questions? Floor’s open!

Tuesday, February 21, 2017 1 comment

Tech Tuesday: Roll Your Own Writing System, Part 2: Markdown in Scrivener

Last week, I showed you a brief introduction to Markdown. I only hinted at why you might want to use Markdown instead of comfortable old bold/italic (and other decorations). I’ll get detailed next week, but here’s a hint: you can save yourself an entire evening of work getting your eBook prepared for publication.

This week, though, we’re going to look at how Scrivener and Markdown work together. TL;DR: Very well, actually.

Scrivener supports a Markdown extension called MultiMarkdown. You don’t have to worry about the extensions, unless you’re writing more technical fiction with tables and the like. For fiction, what I showed you last week should cover all but decorative stuff.

Make a copy of your WIP. Got it open? Original one is closed? Okay, let’s get started.

In Scrivener, click the Scrivenings icon in the toolbar, then click the Draft or Manuscript icon in your Binder (whichever one your story is in). You should now see your entire story laid out in Scrivener.

Click anywhere in the story text, then go to Format menu→Convert→Bold and Italics to MultiMarkdown Syntax. If you use anything other than bold/italic in your writing—like typewriter font for text messages, or blockquotes for letters—you’ll have to go through your manuscript and mark those yourself. This is that other 10% I mentioned last week.

Stylin’

Scrivener has formatting presets, since it only remembers the formatting and not the preset name after you apply it. Not as good as styles, but they work for our purposes.

Markdown uses backticks (a/k/a accent grave) to define typewriter font: `this is a text message`. You can either insert your backticks by hand, or let Scrivener insert them when you publish. I have a preset called Typewriter for this, but we can define a new preset or redefine an existing one. Here’s how it works: any string of text marked “Preserve Formatting” (Format menu→Formatting→Preserve Formatting) gets the backtick treatment at Compile time.

So go find a text message or other small string of typewriter text in your manuscript, and select it. Apply Preserve Formatting as described above, and the text gets highlighted in cyan or light blue.

Now, go Format menu→Formatting and:

  • for a new preset: New Preset from Selection
  • to redefine a preset: Redefine Preset from Selection→(preset name)

For a new preset, enter the name in the dialog box. In both cases, select Save Character Attributes in the dropdown to create a text (as opposed to a paragraph) preset. Now, any time you mark a selection of text as Typewriter (or TextMsg, or whatever you called it), you’ll see it highlighted and in your designated typewriter font.

Looks good, gets converted to backticks. What’s not to like?

To make a block quote, put a > at the beginning of each paragraph in the block, and in any blank lines in between. Add a blank line to the end of the blockquote so the next paragraph doesn’t get picked up as well. Scrivener assumes that preserved-format paragraphs are code blocks, and displayed as-is, so you can’t use its Block Quote preset this way unless you turn off Preserve Formatting. In either case, you’ll have to add the > character.

Okay, ship it!

Not quite. There are still a few things you need to set up before you can get to the Efficiency Nirvana that Scrivener and MultiMarkdown offer.

To see where we need to go, let’s have a look at the output. In Scrivener, click Compile, then go to the Compile For: dropdown at the bottom of the compile window and select MultiMarkdown. You could also try MultiMarkdown→Web Page. Don’t forget to check which directory it’s going in, so you’ll be able to find it. Open it an a text editor (Text Edit, Notepad, whatever you like).

You should now see a few lines at the top with the story and author name, followed by the rest of the story. If you don’t use blank lines between paragraphs, your paragraphs run together in one big blob. There may not be any chapter titles, and likely no section breaks beyond blank lines. So let’s start fixing things. You’ll only have to do these once, or (at worst) once for each project.

Close the file, go back to Compile, and click Separators in the list. For Text Separator, click Custom and then enter the following:

<p class="sectionbrk">&bull; &bull; &bull;</p>

This tells Scrivener to put three bullets between each scene. (Anything Markdown or MultiMarkdown can’t do directly, you can do with HTML.) You’ll want to create or edit a CSS file to format the sectionbrk class the way you want (most people want it centered with a little space above and below). We’ll go over how to automatically link the CSS file to your HTML in a later post.

Set the other parts to Single Return. That’s all you have to do for Separators. In the other options:

  • Formatting: Check Title for Level 1 (and lower levels, if needed) folders.
  • Transformations: Check:
    • Straighten Quotes
    • Convert em-dashes
    • Convert ellipses
    • Convert multiple spaces
    • Convert to plain text: Paragraph Spacing
  • Replacements:
    • Replace (Option-Return twice); With (Option-Return)<p class="sectionbrk">&nbsp;</p>(Option-Return twice)

The Transformations section sounds a little scary, but MultiMarkdown re-converts those text entries to their nice typographical equivalents. I suggest you do it this way for more consistent results. The Replacements entry just inserts a blank section break that won’t get deleted during a conversion. You could just insert a non-breaking space, but (again) a later blog post will show you how you can use this to eliminate formatting issues.

Converting paragraph spacing to plain text replaces a paragraph break with two paragraphs, inserting a blank line between paragraphs as Markdown expects. It works if your Body paragraph format puts space at the beginning or end of the paragraph. If you use indents instead, try “Paragraph Spacing and Indents” and hope the indents are deep enough for Scrivener to catch.

If that doesn’t work, add two more entries to Replacements:

  • Replace (Option-Return); With (Option-Return twice)
  • Replace (Option-Return four times); With (Option-Return twice)

The two replacements are needed because of a bug in Scrivener. It converts one return to four instead of two, but the second time through fixes it.

Now hit Compile, then open the generated file in a text editor. You should see a plain text file, with a blank line between each paragraph and Markdown syntax for various highlighting. You can go back into Scrivener and try MultiMarkdown→Web Page to see what that looks like, too.

Now What?

Now that you can export a clean MultiMarkdown file from Scrivener, you can work with it in any text editor. Sometimes, just looking at the same text in a different way is enough to get you moving on a WIP and get it done. If you have an iPad, you can still edit your Markdown-ified project using Scrivener on iOS, or you can use an iOS Markdown editor like Byword to edit your Markdown file (and import it back into your Scrivener project later).

But that’s only scratching the surface. Next week, we’ll start looking at ways to prep your MultiMarkdown file for beta or final publishing.

Comments? Questions? Floor’s open!

Tuesday, February 14, 2017 No comments

Tech Tuesday: Roll Your Own Writing System, Part 1: Markdown

I’ve said this before, but for people who want to make a living (or even beer money) writing fiction, the best writing advice out there is still Kristen Kathryn Rusch’s: treat it like a business. Simple enough, but the ramifications are as wide as the world of commerce.
  • Watch your expenses, but don’t hesitate to spend where it’s going to improve your product.
  • Plan time for editing and marketing as well as writing (I haven’t done too well in that regard over the last year).
  • Set a budget, and track your expenses (and income) so you know if you’re meeting it.
  • Analyze your processes, and look for better ways to do things.
For someone like me, the latter can be dangerous. It’s really easy to go down a rathole, constantly tinkering with stuff instead of actually accomplishing anything. That goes double when I often have to write one-handed, with a baby in my lap who is trying to grab the keyboard or anything else within reach.

Still, I think I might have stumbled across something.
  • What if there was a way to write stories, using any computer (or tablet), anywhere you are?
  • What if you could preview your writing using a web browser?
  • What if you could output your writing in squeaky-clean HTML for producing EPUBs?
  • What if you could easily copy your entire oeuvre to a USB drive for backup or to continue writing when you’re offline?
  • And finally, what if you could play with different versions of your story to figure what what works best?
In other words, I think I might have found something better than Scrivener. That’s saying a lot; I’ve been using Scrivener for about five years now, and it’s close to ideal for the way I work. The fun part is, it’s possible to keep using Scrivener as long as you want, until you’re ready to let go.

So what is this miracle? Read on…

About Markdown

Markdown was created by John Gruber to make it easier to write blog posts. It has been extended every which way to work with more technical documents, but the vanilla version is well-suited for writing fiction as well as blog posts.

If you’ve ever decorated a text-only email, you already know how to use most of Markdown. Here’s an example that easily covers 90% of what you do in fiction writing:

# The Swamp Road

Night was falling,
yet Joe doggedly marched up the swamp road.
Time was pressing,
after all.

In the flickering light of his torch,
Joe saw two signs:

**SHORTCUT LEFT**

**Do NOT take the shortcut!**

He pondered the advice for only a second.
*Bah*, he thought,
*I have to get home*.

Taking the left fork,
Joe soon found himself sinking in the bog.

Let’s see how this looks when formatted:

The Swamp Road

Night was falling, yet Joe doggedly marched up the swamp road. Time was pressing, after all.
In the flickering light of his torch, Joe saw two signs:
SHORTCUT LEFT
Do NOT take the shortcut!
He pondered the advice for only a second. Bah, he thought, I have to get home.
Taking the left fork, Joe soon found himself sinking in the bog.
It’s pretty easy to see how this translates: blank lines start a new paragraph. Use asterisks to highlight, *for italic* and **for bold**. The number of pound (or hash) characters set the heading level. For example, # Heading 1, ## Heading 2, and so on.

In the example above, I broke lines inside each paragraph so each line is a phrase. That’s not necessary; you can go long and run your paragraphs together, like you would in Scrivener or a word processor. Either way, you’ll get properly-formatted paragraphs.

But I LIKE Scrivener!

Hey, no problem. Scrivener has built-in Markdown support, and can use it to produce cleaner output for publishing than its direct word processor or eBook output. We’ll have a look at how to set things up, and a couple of things to look out for, next week.


Thoughts? Questions? Floor’s open!

Thursday, February 09, 2017 No comments

Tech Tuesday (on Thursday): Tines 1.11

I didn't get this posted Tuesday. Oh well.

The two biggest components in this release are compatibility enhancements:

  • Tines now compiles and runs on Cygwin, something I've wanted for a while. That involved changing escdelay from a variable to a CLI command. The change is backward-compatible; there’s no difference in how it works in .tinesrc and scripts.
  • Tines now preserves and (where it makes sense) uses the metadata in the OPML <head>. You can access and change metadata through text variables, and Tines uses reasonable defaults when necessary.

With 1.11.0, Tines is essentially complete. The Creeping Feature Creature will get hungry some day, no doubt, but for now I’m going to focus on making packages available for people who don’t want to compile the app themselves, making the code more robust, and squashing any bugs that turn up. I might tinker with alternative RC files to focus Tines on special purpose uses as well.

Tuesday, August 30, 2016 No comments

Tech Tuesday: Distraction-Free for Free

I’m a very technical boy. So I decided to get as crude
as possible. These days, though, you have to be pretty
technical before you can even aspire to crudeness.
— William Gibson, Johnny Mnemonic

The writing advice people (and websites) are always harping about minimizing distractions. Shut off Twitter (or your social media of choice), close the browser, fill your screen with the editor, and just write look a squirrel! You can even buy special editors that fill the screen automatically… well, of course you can. Seems like everybody and their dogs are trying to make money off writers these days, aspiring and otherwise.

The thing is, it’s really easy to set up a distraction-free writing environment using the tools and apps that come standard with your operating system—at least for MacOSX and Linux. It’s probably true for the Microsoft thing as well, but I’ll have to look at it a little closer. Both MacOSX and Linux evolved from Unix, an operating system that dates back to when computers were more expensive than displays—so you would have a bunch of people using one computer, typing commands and text into terminals. That was back when timesharing didn’t refer to a sketchy way to sell the same condo to 50 people.

The interesting part is, all the code needed to support that circa-1980 hardware is still part of modern operating systems, and we can use that code to create our distraction-free environment. So let’s get to a shell prompt, the way we all interacted with computers before 1984.

Down and Dirty

If I really wanted to get down and dirty, I’d get a USB-to-serial adapter and hook up that old VT220 terminal I still have laying around. But we’re focusing on stuff you already have on your computer.

Personally, I like to have some music playing while I’m writing at home—it masks TV noise, kid noise, dog noise (unless there’s a thunderstorm, then she’s moaning under my feet), and noise from outside. But you might have a stereo in your writing room, or you find the music distracting, and you don’t need anything but a screen to type into.

Keeping the Johnny Mnemonic quote above in mind, Linux is more technical than MacOSX, so it’s easier to get to the crudeness you want using Linux. Press Ctrl+Opt+F1, and you’ll be presented with a glorious console with a login prompt. Most versions of Linux have six of these consoles; press Ctrl+Opt+F7 to get back to the graphical interface. I have never dug into the reason why Linux typically has six text consoles… I’m sure there’s a reason. Anyway, enter your usual login name and password at the prompts.

If you’ve set up MacOSX to automatically log you in when you start up… don’t. For one thing, you’re inviting anyone who gets into your house to poke through your stuff. For another, you can’t get to the one console that Apple provides. To fix this, open System Preferences, select “Users and Groups,” then click “Login Options” at the bottom of your list of user names.


Once you’re there, make sure “Automatic login” is set to Off. Next up, set “Display login window as” to “Name and password.”

While you’re in this screen, make sure your regular user name is not an administrative account. Set up a separate admin account if you need to, and remember that admin password. These are things that make it harder for malware (or your teenage niece) to do things they shouldn’t be doing on your computer.

But I digress. Next time you log in, instead of typing your usual user name, type >console and press Return. This immediately drops you into a text console and presents you with a login and password prompt.

So… Linux or Mac, you have a text console until you press Ctrl+D at a shell prompt to exit. Skip down to “Now What?” to see what’s next.

Work Within the System

If you’re not quite ready to abandon all hope the graphical interface entirely, because you might need to jump onto the Web to goof off research something important, you can still eliminate most distractions… although all those distractions are still easily available if you can’t resist. Perhaps it’s a small price to pay to have your music, right?

Most Linux systems make it really easy to get a terminal app on the screen, whether through shortcuts or the application menu (look in Accessories or Utilities). Macs aren’t much more difficult—press Cmd-space to pop up Spotlight; typing term should be enough for it to complete Terminal (it’s in /Applications/Utilities if you want to do it from the Finder). Press Return, and it should start. If you’re using the Microsoft thing, look for “Command Prompt” or “PowerShell” in your Start menu. One or the other should be in Accessories.

Now that you have a terminal window up, you need to maximize it to keep the distractions at bay look a squirrel!. On Macs, press Ctrl-Cmd-F to enter full-screen mode (press it again to exit). On Linux, your distribution determines the keystroke; Ubuntu uses F11. You can always click the “maximize” button to expand the window, although this leaves extraneous window elements visible. You can also maximize a command window in the Microsoft thing.

One of the advantages of a terminal app over a console: you can increase the text size, either by using the terminal app’s preferences or by using a keystroke (Cmd+ on Macs). 18 points should be sufficient on a laptop; you might want 24 points or even huger on a big desktop screen.

Okay, you’re ready to go…

Now What?

Okay, now you have a screen full of nothing but white text on a black background. There’s a prompt at the top, usually ending with a $ symbol.

The distraction-free writing paradigm basically turns your computer into an electronic version of a manual typewriter. No going back, no editing on the fly, just type your story and hope the result isn’t too incoherent to salvage (says the guy who likes to edit as he composes).

There are few lower-level ways to input text than using a line editor, and Unix derivatives (including both Linux and MacOSX) include ed.  Johnny Mnemonic, that technical boy, would have been proud of ed. It’s about as crude as it gets. So let’s get crude! Type ed and press Return.

Nothing happened. Or did it?

Ed (as we’ll refer to ed for a while) is a program of very few words, which is exactly what you want when you’re going for a minimalist writing environment. If you give Ed a command he doesn’t understand, or one that might destroy your work, he will respond with ? (a minimalist understands when that means “huh?” or “you don’t really want to do that, right?”). Ed’s commands all consist of a single character; in some cases, you might include a range of lines or some other info. But right now, there are three commands you really care about.

Right now, you should see a blank line. Type i and press Return. This enters input mode, where everything you type is copied into Ed’s buffer. Ed will happily ingest everything you type, until you enter a line containing only a . character. That tells him to return to command mode. The following screen shot shows an example.


Now for the second command: saving what you entered. Type w and the name you want to give the story. Make sure you’re not using that name already, or you’ll overwrite what’s there! I reserve a few file names like foo, junk, and tmp for situations like this, when I either don’t need to keep what’s in the file or plan to do something else with it right away (like copy it into Scrivener). Anyway, after you use the w command, Ed responds with the number of characters it wrote into the file. If you want a rough word count, divide by 5 (I wrote 1458 characters, a shade under 300 words, in the above example).

All done! Type q and press Return, and you’ll return to the shell prompt. If you want to keep writing instead, type $a and press Return. This command means “go to the last line” ($) “and append.” Again, Ed will take everything you type as input until you enter a line containing only a . character. This time, you can just type w and press Return, because Ed remembers the last file name you used. Just remember to use q when you’re done.

Bonus Info

Now you’re at the shell prompt, and you want to know exactly how many words you typed? Type wc and the file name, and press Return. The info you get looks like this (using the above example):

Kahuna:fiction larry$ wc tollen.md 
      31     269    1458 tollen.md
Kahuna:fiction larry$

You get three numbers: lines, words, and characters. So that number in the middle, 269, is the actual number of words I wrote.

If you’re not enamored with Ed’s ultra-minimalism, try entering nano or pico instead. Both of these are simple screen-oriented text editors that include a little help at the bottom of the screen (but will let you arrow back and noodle with the text).

Like with any writing tool, you’ll improve with practice. Don’t give up right away; try a different editor or even a different color scheme (most terminal apps let you choose colors). And don’t forget to copy your text into your normal writing tool!

Your Turn!

Have you ever tried a minimalist writing environment? How did it work? Get as detailed (or as minimalist) as you like in the comments.

LinkWithin

Related Posts Plugin for WordPress, Blogger...