• Fixing my blog (part 5) - Putting it all together with a PR

    In part 4 of this series of posts we used the Replace multiple strings in files GitHub Action to update files with the replacement values. The files are modified on disk, but what to do with the changes? I could just automatically commit the changes back to version control, but I prefer to take a more cautious approach and give myself a chance to review the changes to confirm if they look reasonable. Creating a pull request is a great way of doing that.

    A an action I’ve used successfully before to do this is Peter Evans’ Create Pull Request GitHub Action. Using it is very easy:

    
    - name: Create Pull Request
      uses: peter-evans/[email protected]
      with:
        token: ${{ secrets.PAT_REPO_FULL }}
    
    

    I pass in a personal access token so that any workflows that should be run on creation of a pull request will be executed.

    And with that, I have a full workflow for checking for broken links and repairing them. The only thing to watch out for is that issue I mentioned previously with the invalid JSON produced by the Lychee action. As a temporary measure, I inlined that action in my repo and applied a local code fix. Hopefully once the Lychee action itself is updated then I can go back to using their implementation.

    
    name: Links
    
    on:
      workflow_dispatch:
        
    jobs:
      linkChecker:
        runs-on: ubuntu-latest
    
        steps:
          - uses: actions/checkout@v3
          - name: Link Checker
            id: lychee
            uses: ./.github/actions/lychee-action #lycheeverse/[email protected]
            env:
              GITHUB_TOKEN: ${{secrets.GITHUB_TOKEN}}
            with:
              args: '--verbose ./_posts/**/*.md --exclude-mail --scheme http https'
              format: json
              output: ./lychee/links.json
              fail: false
    
          - uses: actions/[email protected]
            with:
              path: ./lychee/links.json
    
          - name: Wayback Machine Query
            uses: flcdrg/wayback-machine-query-action@v2
            id: wayback
            with:
              source-path: ./lychee/links.json
              timestamp-regex: '_posts\/(\d+)\/(?<year>\d+)-(?<month>\d+)-(?<day>\d+)-'
    
          - uses: actions/[email protected]
            with:
              path: ./wayback/replacements.json
              
          - name: Replacements
            uses: flcdrg/replace-multiple-action@v1
            with:
              find: ${{ steps.wayback.outputs.replacements }}
              prefix: '(^|\\s+|\\()'
              suffix: '($|\\s+|\\))'
              
          - name: Create Pull Request
            # if: ${{ github.ref == 'refs/heads/main' }}
            uses: peter-evans/[email protected]
            with:
              token: ${{ secrets.PAT_REPO_FULL }}
    
    

    Now that we’ve addressed the broken links, we’re in a better state to revisit the Accessibility Insights Action that started this whole adventure!

  • Fixing my blog (part 4) - Updating the files

    Last time we looked at using the Wayback Machine Query GitHub Action to automate querying the Wayback Machine for all our broken links. Now we need to apply the changes to our files.

    Ideally there’d be a GitHub Action that could take a list of our changes and apply them to a list of files. I did search for something like that but all I could find were actions that only made one single change. Time to create another action.

    Enter the Replace multiple strings in files GitHub Action.

    For example, to find all instances of ‘Multiple’ and replace them with ‘Many’ for all the .md files in the current directory you can do:

    - uses: flcdrg/replace-multiple-action@v1
      with:
        files: './*.md'
        find: '[{ "find": "Multiple", "replace": "Many" }]'
    

    For my case I want something like this:

    - uses: flcdrg/replace-multiple-action@v1
      with:
        files: './*.md'
        find: '[{ "find": "http://localhost", "replace": "https://localhost"}, { "find": "http://davidgardiner.net.au", "replace": "https://david.gardiner.net.au" }]'
        prefix: '(^|\\s+|\\()'
        suffix: '($|\\s+|\\))'
    

    The prefix and suffix input properties need some explanation. Originally I was just using a plain string find and replace, but I discovered that there was a problem.

    Consider that I could have multiple broken links to a site. eg. blog.spencen.com/2010/09/04/word-puzzle-to-sliverlight-phonendashpart-3.aspx and blog.spencen.com.

    I replace all the instances of the first URL with http://web.archive.org/web/20100926212957/http://blog.spencen.com/2010/09/04/word-puzzle-to-sliverlight-phonendashpart-3.aspx.

    The problem comes with the next find/replace in that it is now looking for blog.spencen.com and that value also exists in the new snapshot URL! We potentially end up in a recursive ‘inception’ mess. To avoid this, we can supply some partial regular expressions that get concatenation before and after the broken URL when we’re searching. In my case those expressions mean that blog.spencen.com won’t be matched in the middle of the snapshot URL.

    I’ve made these properties so the action is more general than just my use case of updating my blog posts.

    To actually use the action in concert with the previous actions, I’m using it thus:

    
    - name: Replacements
      uses: flcdrg/replace-multiple-action@v1
      with:
        find: ${{ steps.wayback.outputs.replacements }}
        prefix: '(^|\\s+|\\()'
        suffix: '($|\\s+|\\))'
    
    

    The key difference here is that I’m making use of the output property from the Wayback Machine Query action.

    Our files have been updated. Next we finish off the workflow by creating a pull request with the changes.

  • Fixing my blog (part 3) - Querying the Wayback Machine

    Last time I’d discovered I had a huge link rot problem on my blog, but I didn’t fancy copy and pasting all those broken URLs into the Internet Archive’s Wayback Machine to search for the equivalent archive URL.

    What I need is a way to query the Wayback Machine via some kind of API.. Like maybe this one?. The API is quite simple to use.

    To look up the snapshot for Nigel’s old blog site, I’d use the following request

    https://archive.org/wayback/available?url=http://blog.spencen.com/

    That returns the following JSON result

    {
      "url": "http://blog.spencen.com/",
      "archived_snapshots": {
        "closest": {
          "status": "200",
          "available": true,
          "url": "http://web.archive.org/web/20201129195822/http://blog.spencen.com/",
          "timestamp": "20201129195822"
        }
      }
    }
    

    That’s pretty good, but we can do better. An optional timestamp can be supplied in the query string (using the format YYYYMMDDhhmmss), so that instead of returning the most recent available capture, instead the snapshot closest to that timestamp will be returned. This is useful for URLs where the content might have changed over time. I realised that my blog posts include the date in the filename, so if I parsed the filename I could get a timestamp with a resolution of a specific day - that would be good enough to make the snapshot URLs more accurate.

    So given the page that linked to Nigel’s blog was named 2009-09-13-tech-ed-2009-thursday.md, then the query would become

    https://archive.org/wayback/available?url=http://blog.spencen.com/&timestamp=20090913

    And now we get this JSON result

    {
      "url": "http://blog.spencen.com/",
      "archived_snapshots": {
        "closest": {
          "status": "200",
          "available": true,
          "url": "http://web.archive.org/web/20100315160244/http://blog.spencen.com:80/",
          "timestamp": "20100315160244"
        }
      },
      "timestamp": "20090913"
    }
    

    Browsing to that new snapshot URL shows a representation of Nigel’s blog as it would have been when I wrote that blog post back in 2009.

    So how can we automate this a bit more? By creating a new GitHub Action of course.

    Enter the Wayback Machine Query GitHub Action!

    If you look a the inputs for this action, it’s no coincidence that the file it requires to provide the list of URLs to query the Wayback Machine with just happens to be compatible with the one generated by the Lychee Broken Link Checker GitHub Action we used in the previous post.

    My new action returns the results in two output properties:

    • missing is an array of URLs that had no snapshots on the Wayback Machine. These URLs were never archived, so we’ll have to deal with these differently.
    • replacements is an array of objects with the original URL and the Wayback Machine’s snapshot URL.

    The action also has an input property that allows us to provide a regular expression that can be used to parse the filename to obtain the timestamp.

    Here’s how I’m using it:

          - name: Wayback Machine Query
            uses: flcdrg/wayback-machine-query-action@v2
            id: wayback
            with:
              source-path: ./lychee/links.json
              timestamp-regex: '_posts\/(\d+)\/(?<year>\d+)-(?<month>\d+)-(?<day>\d+)-'
    

    Obviously if your filenames or paths contain the date in a different format you’d need to adjust the timestamp-regex value (or if they don’t contain the date then don’t set that property at all).

    We’ve now got a list of old and new URLs. In the next part we’ll update the files.