Fixing my blog (part 3) - Querying the Wayback Machine

21st April 2022

Last time I’d discovered I had a huge link rot problem on my blog, but I didn’t fancy copy and pasting all those broken URLs into the Internet Archive’s Wayback Machine to search for the equivalent archive URL.

What I need is a way to query the Wayback Machine via some kind of API.. Like maybe this one?. The API is quite simple to use.

To look up the snapshot for Nigel’s old blog site, I’d use the following request

https://archive.org/wayback/available?url=http://blog.spencen.com/

That returns the following JSON result

{
  "url": "http://blog.spencen.com/",
  "archived_snapshots": {
    "closest": {
      "status": "200",
      "available": true,
      "url": "https://web.archive.org/web/20201129195822/http://blog.spencen.com/",
      "timestamp": "20201129195822"
    }
  }
}

That’s pretty good, but we can do better. An optional timestamp can be supplied in the query string (using the format YYYYMMDDhhmmss), so that instead of returning the most recent available capture, instead the snapshot closest to that timestamp will be returned. This is useful for URLs where the content might have changed over time. I realised that my blog posts include the date in the filename, so if I parsed the filename I could get a timestamp with a resolution of a specific day - that would be good enough to make the snapshot URLs more accurate.

So given the page that linked to Nigel’s blog was named 2009-09-13-tech-ed-2009-thursday.md, then the query would become

https://archive.org/wayback/available?url=http://blog.spencen.com/&timestamp=20090913

And now we get this JSON result

{
  "url": "http://blog.spencen.com/",
  "archived_snapshots": {
    "closest": {
      "status": "200",
      "available": true,
      "url": "https://web.archive.org/web/20100315160244/http://blog.spencen.com:80/",
      "timestamp": "20100315160244"
    }
  },
  "timestamp": "20090913"
}

Browsing to that new snapshot URL shows a representation of Nigel’s blog as it would have been when I wrote that blog post back in 2009.

So how can we automate this a bit more? By creating a new GitHub Action of course.

Enter the Wayback Machine Query GitHub Action!

If you look a the inputs for this action, it’s no coincidence that the file it requires to provide the list of URLs to query the Wayback Machine with just happens to be compatible with the one generated by the Lychee Broken Link Checker GitHub Action we used in the previous post.

My new action returns the results in two output properties:

missing is an array of URLs that had no snapshots on the Wayback Machine. These URLs were never archived, so we’ll have to deal with these differently.
replacements is an array of objects with the original URL and the Wayback Machine’s snapshot URL.

The action also has an input property that allows us to provide a regular expression that can be used to parse the filename to obtain the timestamp.

Here’s how I’m using it:

      - name: Wayback Machine Query
        uses: flcdrg/wayback-machine-query-action@v2
        id: wayback
        with:
          source-path: ./lychee/links.json
          timestamp-regex: '_posts\/(\d+)\/(?<year>\d+)-(?<month>\d+)-(?<day>\d+)-'

Obviously if your filenames or paths contain the date in a different format you’d need to adjust the timestamp-regex value (or if they don’t contain the date then don’t set that property at all).

We’ve now got a list of old and new URLs. In the next part we’ll update the files.