Image source: openclipart.org |
Last time I had to deal with an API, it was pulling a vendor’s documentation into our own system. Now, I have to document our own APIs.
OpenAPI, formerly known as Swagger, is quite popular in the ReST API universe these days. And why not? One source builds a website, the hooks, documentation, and everything. At least online. If you want to provide a PDF document describing the API, though, there’s a little more to it.
- First, all those definition and summary strings need some attention. Where developers involve the technical writers in the process makes a huge difference in effort (at least on the writer side).
- Second, there’s more to documenting an API call than the definition and summary strings. There are path variables, query variables, examples, and the list goes on.
Fortunately, there are several utilities that extract documentation from an OpenAPI file. For my purposes, Widdershins works best—it produces a complete Markdown file—although it’s nowhere near ideal.
- One issue was definitely not the fault of the tool. The developers told me of a dozen categories (or tags in OpenAPI parlance) that didn’t need to be documented for customers. Widdershins groups all API calls with the same tag under the same section, and that helps a lot.
- The second issue could be either Widdershins or my personal preference. I didn’t like the order that Widdershins presented data for each method. There were some other minor issues as well.
I had a big wad of text that was my nail, and awk once again is my hammer. I started pounding. I did consider using a YAML parser for a brief time, but realized Widdershins did a lot of busy work for me. It actually does a pretty good job of building a Markdown document, describing all the method calls and schemas. If only there was a way to fix the presentation order, it would be perfect.
My first goal was to reshuffle the internal sections of each method to get them in the order I wanted. Deleting the unneeded groups, I reasoned, was a one-time thing that I could deal with myself.
My script worked the first time, but scrambled a bunch of things on the second attempt. Worse, doing the search-and-delete on those unneeded sections took more time and care than I’d anticipated. I needed a re-think.
Fortunately, a Computerphile interview with Brian Kernighan (the “K” in awk) came around, right when I needed it. It gave me… if not the key to my problem, a map to the key. In a nutshell, Dr. Kernighan advocates against large, monolithic awk scripts. His 1986 paper Tools for Printing Indexes describes his approach as:
…a long pipeline of short awk scripts. This structure makes the programs easy to adapt or augment to meet the special requirements that arise in many indexes.
This approach can also be easier to debug, as you can replace the pipeline with temporary files and verify that the output of one stage is correct before feeding it to the next stage. Each stage refines the input further.
So I split the monolithic script into two medium-size scripts:
- Stage 1a (weed) fixes headings, weeds out unneeded HTML markup (mostly <a name="x"/> tags), and gets rid of those unneeded sections. Having less cleanup already makes this approach worth the effort.
- Stage 1b (shuffle) re-orders the remaining method descriptions. I learned that the input order is important for making this work; so if future versions of Widdershins move things around, it could break the script and I would need to fix it again.
It takes maybe a second to process the raw Markdown through both stages under Cygwin, which is noticeably slower than a shell under a native POSIX system. I expect my 8 year old iMac would be nearly instantaneous.
Now that I’ve cracked the code, so to speak, more stages are coming.
- Stage 2 throws out the schema definitions that none of the remaining methods refer to. A pair of scripts identify which schemas needed to be kept, then weeds out the others.
- Stage 3 fixes cross-references (mostly to schemas). The monolithic Markdown file uses a URL of the form #schemaxyz. Since the ultimate goal is to split the single Markdown file into topics, those URLs need to point to the eventual file name instead. A trio of scripts create file names that correspond to the URLs, replace the original #xyz name with the file name, then shuffle the schema’s description to the top of the topic.
These stages take another second to process… so 13,000 lines of YAML to monolithic Markdown file is about two seconds. The mdsplit script, that splits the methods and schemas into topics and builds a Lightweight DITA (LwDITA) bookmap, takes less than ten seconds to complete. So I’m now at the point where it’s easier to regenerate the entire document if I run into a scripting issue, instead of pushing through the problem. Uplifting the LwDITA to full DITA takes maybe a minute. After the uplift, another script fixes the extensions, changing .md to .dita, and fixing the cross-references.
At this point, I can focus on adding value: adding metadata, grouping related schema definitions, and the like. If I need to regenerate this thing again, I need only run the shell scripts that conduct the Geek Chorus.
Going forward, I’ll need to be able to compare versions, so I can replace topics with actual content changes, or add new topics. At that point, I could hand things off and intervene only when the input changes enough to make a difference. Or, we might decide to ditch the PDF entirely, and that would make things far easier on everyone.
Techcomm geeks never worry about automating themselves out of a job, by the way. There’s always a new presentation format, or new source document formats, or many new ways to streamline workflows. Handing off a system is a triumph; it means we have more time to focus on the next thing.
Edited 5 Sep: I didn’t realize I’d pasted this out of Logseq before going through it and fixing some things. Now we’re up to date.
Edited 7 Sep: All the tweaks have been made, and I now have a turnkey system.
No comments:
Post a Comment
Comments are welcome, and they don't have to be complimentary. I delete spam on sight, but that's pretty much it for moderation. Long off-topic rants or unconstructive flamage are also candidates for deletion but I haven’t seen any of that so far.
I have comment moderation on for posts over a week old, but that’s so I’ll see them.
Include your Twitter handle if you want a shout-out.