Fixing my blog (part 2) - Broken links

20th April 2022

My first attempt to use the Accessibility Insights Action returned some actionable results, but it also crashed with an error. Not a great experience, but looking closer I realised it seemed to be checking an external link that was timing out.

Hmm. I’ve been blogging for a few years. I guess there’s the chance that the odd link might be broken? Maybe I should do something about that first, and maybe I should automate it too.

A bit of searching turned up The Lychee Broken Link Checker GitHub Action. It turns out ‘Lychee’ is the name of the underlying link checker engine.

Let’s create a GitHub workflow that uses this action and see what we get. I came up with something like this:

name: Links

on:
  workflow_dispatch:
    
jobs:
  linkChecker:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3
      - name: Link Checker
        id: lychee
        uses: lycheeverse/lychee-action@v1.4.1
        env:
          GITHUB_TOKEN: ${{secrets.GITHUB_TOKEN}}
        with:
          args: '--verbose ./_posts/**/*.md --exclude-mail --scheme http https'
          format: json
          output: ./lychee/links.json
          fail: false

Results are reported to the workflow output, but also saved to a local file. The contents of this file look similar to this:

{
  "total": 5262,
  "successful": 4063,
  "failures": 1037,
  "unknown": 5,
  "timeouts": 125,
  "redirects": 0,
  "excludes": 32,
  "errors": 0,
  "cached": 381,
  "fail_map": {
    "./_posts/2009/2009-09-13-tech-ed-2009-thursday.md": [
      {
        "url": "http://blog.spencen.com/",
        "status": "Timeout"
      },
      {
        "url": "http://notgartner.wordpress.com/",
        "status": "Cached: Error (cached)"
      },
      {
        "url": "http://adamcogan.spaces.live.com/",
        "status": "Failed: Network error"
      }
    ],
    "./_posts/2010/2010-01-22-tour-down-under-2010.md": [
      {
        "url": "http://www.cannondale.com/bikes/innovation/caad7/",
        "status": "Failed: Network error"
      },
      {
        "url": "http://lh3.ggpht.com/",
        "status": "Cached: Error (cached)"
      },
      {
        "url": "http://www.tourdownunder.com.au/race/stage-4",
        "status": "Failed: Network error"
      }
    ],

(Note that there’s a known issue that the output JSON file isn’t actually valid JSON. Hopefully that will be fixed soon)

The fail_map contains a property for each file that has failing links, and for eacho of those an array of all the links that failed (and the particular error observed). Just by looking at the links, I know that some of those websites don’t even exist anymore, some might still have the website but the content has changed, and some could be transient errors. I had no idea I had so much link rot!

Good memories Nigel, Mitch and Adam (the writers of those first three old sites)!

Ok, let’s start fixing them.

But what do you replace each broken link with? Sure, some of the content might still exist at a slightly different URL, but for many the content has long gone. Except maybe it hasn’t. I remembered the Internet Archive operates the Wayback Machine. So maybe I can take each broken URL, paste it into the Wayback Machine, and if there’s a match, use the archive’s URL instead.

Except I had hundreds of broken links. Maybe I could automate this as well?

Find out in part 3..