Looking for writing-related posts? Check out my new writing blog, www.larrykollar.com!

Tuesday, December 07, 2021

Computer-Aided Weeding

A couple weeks ago, I finally decided to start pulling in all the notes I’d saved up from Evernote and Google Keep into Logseq. I started with Evernote, just because.

First, I had to update the Evernote app on my iMac, so I could actually access my stuff. That should tell you how long it’s been since I actively used it.

After exporting, I used a utility called Yarle to convert the notes in each notebook to Markdown.

Now the hard part: deciding what I wanted to keep, and what to toss. The even harder part: cleaning up the sloppy mess that were most of those individual pages. There were over 400. Cleaning them up in Logseq was do-able, but slow. Lots of repeated stuff. This wasn’t a job for an outliner, it was a job for a high-powered text editor like Vim or Atom.

Unlike Vim, Atom sports a sidebar that displays all the files in the directory, and its regular expression parser recognizes newlines. So I could find blank strings using the expression ^- *\n (which means, “look for a line starting with a dash, followed by zero or more spaces, then a new line”) and get rid of them.

But the even bigger time-saver: realizing a lot of those entries were long outdated (some dated back to 2013) and deleting them. By the time I was done with that pass, I had 109 “keepers” left. From there, it was a matter of applying search and replace to fix common issues.

So with 3/4 of the pages deleted, and much of the boilerplate stuff from the remaining pages deleted as well (I just need the content, the source, and some info about the author). That means my assets folder has 4852 items in it, and most of them were no longer being linked to.

Now… am I going to make 4852 passes through my pages, by hand, to see if a pic can be deleted?

The shell (aka Terminal) is my machine gun for blasting a job like this.

# assume we're in the assets directory

mkdir -p ../assets_removed

for i in *; do

  grep -q "$i" ../pages/* || mv "$i" ../assets_removed

done

Let’s pick this apart, for those who need it.

The first line is just a comment. An important one, all the same. You need to be in your Logseq database’s assets directory for this to work correctly. BAD THINGS will happen otherwise! One of the nice things about using MacOS: if I eff something up, I can pull it out of the Time Machine backup and try again.

Next, we make a directory called assets_removed at the same level as the assets directory. Just in case we make a mistake, you know. The -p option is there to make the script shrug and move on if the directory already exists (if we’ve been here before, for example).

The third and fifth lines begin and end a loop, going through each of those >4800 graphic files.

Inside the loop, we search for the file name in the pages. The -q option is exactly what you want for a script like this; it returns success if grep finds the string and failure otherwise. The || (two vertical bars) means “execute the next part if it fails” (in this case, fails to find the file name)… and the next part moves the unused file to the assets_removed directory.

And I ended up with 255 files (out of nearly 5000) that were actually being used. The other ones are out of the way, and can be safely deleted once I verify that none of them are needed.

[UPDATE: After stepping through the pages again, I found 18 “false negatives” that had to be dragged back into the assets folder. That’s why you move them out of the way, instead of just nuking them.]

It took about a minute to grind through the assets directory, and a couple of minutes to set up the script, but that beats the heck out of hours (or days) doing it by hand! I’m fond of saying, I’m lazy enough to get the computer to do my work for me. It doesn’t always pay off this big, but it does pay off.

Off to get the Google Keep notes…

No comments

Post a Comment

Comments are welcome, and they don't have to be complimentary. I delete spam on sight, but that's pretty much it for moderation. Long off-topic rants or unconstructive flamage are also candidates for deletion but I haven’t seen any of that so far.

I have comment moderation on for posts over a week old, but that’s so I’ll see them.

Include your Twitter handle if you want a shout-out.

LinkWithin

Related Posts Plugin for WordPress, Blogger...