David Gardiner

Apr 21, 2022
Fixing my blog (part 3) - Querying the Wayback Machine

Last time I'd discovered I had a huge link rot problem on my blog, but I didn't fancy copy and pasting all those broken URLs into the Internet Archive's Wayback Machine to search for the equivalent archive URL.

What I need is a way to query the Wayback Machine via some kind of API.. Like maybe this one?. The API is quite simple to use.

To look up the snapshot for Nigel's old blog site, I'd use the following request

https://archive.org/wayback/available?url=http://blog.spencen.com/

That returns the following JSON result
```
{
  "url": "http://blog.spencen.com/",
  "archived_snapshots": {
    "closest": {
      "status": "200",
      "available": true,
      "url": "http://web.archive.org/web/20201129195822/http://blog.spencen.com/",
      "timestamp": "20201129195822"
    }
  }
}
```
That's pretty good, but we can do better. An optional timestamp can be supplied in the query string (using the format YYYYMMDDhhmmss), so that instead of returning the most recent available capture, instead the snapshot closest to that timestamp will be returned. This is useful for URLs where the content might have changed over time. I realised that my blog posts include the date in the filename, so if I parsed the filename I could get a timestamp with a resolution of a specific day - that would be good enough to make the snapshot URLs more accurate.

So given the page that linked to Nigel's blog was named 2009-09-13-tech-ed-2009-thursday.md, then the query would become

https://archive.org/wayback/available?url=http://blog.spencen.com/&timestamp=20090913

And now we get this JSON result
```
{
  "url": "http://blog.spencen.com/",
  "archived_snapshots": {
    "closest": {
      "status": "200",
      "available": true,
      "url": "http://web.archive.org/web/20100315160244/http://blog.spencen.com:80/",
      "timestamp": "20100315160244"
    }
  },
  "timestamp": "20090913"
}
```
Browsing to that new snapshot URL shows a representation of Nigel's blog as it would have been when I wrote that blog post back in 2009.

So how can we automate this a bit more? By creating a new GitHub Action of course.

Enter the Wayback Machine Query GitHub Action!

If you look a the inputs for this action, it's no coincidence that the file it requires to provide the list of URLs to query the Wayback Machine with just happens to be compatible with the one generated by the Lychee Broken Link Checker GitHub Action we used in the previous post.

My new action returns the results in two output properties:
- missing is an array of URLs that had no snapshots on the Wayback Machine. These URLs were never archived, so we'll have to deal with these differently.
- replacements is an array of objects with the original URL and the Wayback Machine's snapshot URL.
The action also has an input property that allows us to provide a regular expression that can be used to parse the filename to obtain the timestamp.

Here's how I'm using it:
```
      - name: Wayback Machine Query
        uses: flcdrg/wayback-machine-query-action@v2
        id: wayback
        with:
          source-path: ./lychee/links.json
          timestamp-regex: '_posts\/(\d+)\/(?<year>\d+)-(?<month>\d+)-(?<day>\d+)-'
```
Obviously if your filenames or paths contain the date in a different format you'd need to adjust the timestamp-regex value (or if they don't contain the date then don't set that property at all).

We've now got a list of old and new URLs. In the next part we'll update the files.
Apr 20, 2022
Fixing my blog (part 2) - Broken links

My first attempt to use the Accessibility Insights Action returned some actionable results, but it also crashed with an error. Not a great experience, but looking closer I realised it seemed to be checking an external link that was timing out.

Hmm. I've been blogging for a few years. I guess there's the chance that the odd link might be broken? Maybe I should do something about that first, and maybe I should automate it too.

A bit of searching turned up The Lychee Broken Link Checker GitHub Action. It turns out 'Lychee' is the name of the underlying link checker engine.

Let's create a GitHub workflow that uses this action and see what we get. I came up with something like this:
```
name: Links

on:
  workflow_dispatch:
    
jobs:
  linkChecker:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3
      - name: Link Checker
        id: lychee
        uses: lycheeverse/[email protected]
        env:
          GITHUB_TOKEN: ${{secrets.GITHUB_TOKEN}}
        with:
          args: '--verbose ./_posts/**/*.md --exclude-mail --scheme http https'
          format: json
          output: ./lychee/links.json
          fail: false
```
Results are reported to the workflow output, but also saved to a local file. The contents of this file look similar to this:
```
{
  "total": 5262,
  "successful": 4063,
  "failures": 1037,
  "unknown": 5,
  "timeouts": 125,
  "redirects": 0,
  "excludes": 32,
  "errors": 0,
  "cached": 381,
  "fail_map": {
    "./_posts/2009/2009-09-13-tech-ed-2009-thursday.md": [
      {
        "url": "http://blog.spencen.com/",
        "status": "Timeout"
      },
      {
        "url": "http://notgartner.wordpress.com/",
        "status": "Cached: Error (cached)"
      },
      {
        "url": "http://adamcogan.spaces.live.com/",
        "status": "Failed: Network error"
      }
    ],
    "./_posts/2010/2010-01-22-tour-down-under-2010.md": [
      {
        "url": "http://www.cannondale.com/bikes/innovation/caad7/",
        "status": "Failed: Network error"
      },
      {
        "url": "http://lh3.ggpht.com/",
        "status": "Cached: Error (cached)"
      },
      {
        "url": "http://www.tourdownunder.com.au/race/stage-4",
        "status": "Failed: Network error"
      }
    ],
```
(Note that there's a known issue that the output JSON file isn't actually valid JSON. Hopefully that will be fixed soon)

The fail_map contains a property for each file that has failing links, and for eacho of those an array of all the links that failed (and the particular error observed). Just by looking at the links, I know that some of those websites don't even exist anymore, some might still have the website but the content has changed, and some could be transient errors. I had no idea I had so much link rot!

Good memories Nigel, Mitch and Adam (the writers of those first three old sites)!

Ok, let's start fixing them.

But what do you replace each broken link with? Sure, some of the content might still exist at a slightly different URL, but for many the content has long gone. Except maybe it hasn't. I remembered the Internet Archive operates the Wayback Machine. So maybe I can take each broken URL, paste it into the Wayback Machine, and if there's a match, use the archive's URL instead.

Except I had hundreds of broken links. Maybe I could automate this as well?

Find out in part 3..
Apr 19, 2022
Fixing my blog (part 1) - Introduction

I've been revisiting web accessibility. It's something I remember first learning about accessibility many years ago at a training workshop run by Vision Australia back when I worked at the University of South Australia. The web has progressed a little bit in the last 15 odd years, but the challenge of accessibility remains. More recently I had the opportunity to update my accessibility knowledge by attending a couple of presentations given by Larene Le Gassick (who also happens to be a fellow Microsoft MVP).

I wondered how accessible my blog was. Theoretically it should be pretty good, considering it is largely text with just a few images. There shouldn't be any complicated navigation system or confusing layout. Using tools to check accessibility, and in particular compliance with a particular level of the Web Content Accessibility Guidelines (WCAG) standard will not give you the complete picture. But it can identify some deficiencies and give you confidence that particular problems have been eliminated.

Ross Mullen wrote a great article showing how to use the pa11y GitHub Action as part of your continuous integration workflow to automatically scan files at build time. Pa11y is built on the axe-core library.

Further research brought me to Accessibility Insights - Android, browser and Windows desktop accessibility tools produced by Microsoft. From here I then found that Microsoft had also made a GitHub Action (currently in development) Accessibility Insights Action, which as I understand it, also leverages axe-core.

The next few blog posts will cover my adventures working towards being able to run that action against my blog. I thought it would be simple, but it turns out I had some other issues with my blog that needed to be addressed along the way. Stay tuned!

Fixing my blog (part 3) - Querying the Wayback Machine

Fixing my blog (part 2) - Broken links

Fixing my blog (part 1) - Introduction