I've noticed some interesting variations in build times on Azure Pipelines related to using Docker to build and/or run containers. I tracked down the issue to excessive time downloading docker images, so I started investigating whether there's ways to cache or optimise the docker pull steps.

So what to do?

Self-hosted agents

First off, if you can run your own build agent you're probably not going to see these problems. Having a dedicated agent means that Docker caches all images locally and can reuse them for subsequent build jobs. That's issue with Microsoft-hosted agents - you get a brand new agent for each job - there's no possibility to persist any changes, so the value of Docker's image caching is reduced.

Cache Task

First off I did some research into whether the Cache Task could be used as a way to more efficiently restore the Docker image cache. I think the answer is probably 'no'. I've started using this task for another build involving NuGet packages, and in that case it makes the dotnet restore step much faster, but I think the problem with Docker is the files are stored in various places, so trying to cache C:\ProgramData\Docker\image\windowsfilter\layerdb\sha256 and/or C:\ProgramData\Docker\windowsfilter didn't seem to have any effect.

Docker Save/Load

There's a GitHub issue with some discussion about using docker save/load and I can confirm the comments on the issue that this did not make things faster (in fact it made things slower).

Here's an example - a pipeline that's saving 4.8-windowsservercore-ltsc2019 in the cache.

pool:
  vmImage: 'windows-latest'

variables:
  dockerCache: $(Build.ArtifactStagingDirectory)\.dockercache

steps:

- script: docker images --digests
  displayName: Docker images

- task: Cache@2
  inputs:
    key: 'nuget | "$(Agent.OS)" | azure-pipelines-docker-cache.yml'
    path: '$(dockerCache)'
    cacheHitVar: DOCKER_CACHE_HIT
  displayName: 1. Cache Task

- script: |
    if exist $(dockerCache)\4.8-windowsservercore-ltsc2019.tar docker load -i $(dockerCache)/4.8-windowsservercore-ltsc2019.tar
  displayName: 2. Docker Load

- script: docker pull mcr.microsoft.com/dotnet/framework/aspnet:4.8-windowsservercore-ltsc2019
  displayName: 3. Docker Pull

- script: |
    if not exist $(dockerCache) mkdir $(dockerCache)
    docker image save -o $(dockerCache)/4.8-windowsservercore-ltsc2019.tar mcr.microsoft.com/dotnet/framework/aspnet:4.8-windowsservercore-ltsc2019
  condition: ne(variables['DOCKER_CACHE_HIT'], 'true')
  displayName: 4. Docker Save
Build 1. Cache Task 2. Docker Load 3. Docker Pull 4. Docker Save 5. Cache Save Job time (total)
Build 1 (cache miss) 00:00:03 00:00:01 00:08:04 00:17:06 00:06:09 00:31:50
Build 2 (cache hit) 00:05:45 00:10:56 00:00:03 00:00:01 00:00:03 00:17:15

So yes, the second build was faster, but both of these are way slower than a build that had no caching at all (cf just the first docker pull at 00:08:04). So that's no help.

Buildctl

The issue also mentions using buildctl, which is part of BuildKit. The trouble is I'm working with Windows Containers and BuildKit currently supported with those. If you're working with Linux containers, this does sound promising.

Existing images on the agent

It occurred to me that shouldn't Docker be making use of some existing images that are shipped on the hosted agent. The software and tools pre-installed on each agent is documented. For example, the Windows 2019 agent. This is not a static list, the agents will be updated over time as patches and updates are issued for both the OS and applications.

You can see the list of pre-installed images listed, or just to confirm, you can run docker images --digests in a pipeline step to confirm. Here's the output I got which matches the documentation.

REPOSITORY                                   TAG                              DIGEST                                                                    IMAGE ID            CREATED             SIZE
mcr.microsoft.com/dotnet/framework/aspnet    4.8-windowsservercore-ltsc2019   sha256:dbf97206264133cdef6b49b06fa5d4028482845547c2858a086b5ce5c4513f00   8280f73a9be1        9 days ago          6.87GB
mcr.microsoft.com/dotnet/framework/runtime   4.8-windowsservercore-ltsc2019   sha256:bf47599181ae3877ec680428a99f76d43ffb26251155a6f0b0b76f4e70304c26   bcd511658148        9 days ago          6.51GB
mcr.microsoft.com/windows/servercore         ltsc2019                         sha256:2629881183feda906459163cb58fbdbc001bea76a92b2dc4695c8e5b14f747ae   561b89eac394        2 weeks ago         3.7GB
mcr.microsoft.com/windows/nanoserver         1809                             sha256:8e6807c213b52405fec8a861e0b766055ba9d4f941267adf49ee67526755b63a   9e7d556b2b51        2 weeks ago         251MB
microsoft/aspnetcore-build                   1.0-2.0                          sha256:9ecc7c5a8a7a11dca5f08c860165646cb30d084606360a3a72b9cbe447241c0c   5d8be0910d37        21 months ago       3.99GB

Out of curiosity, I added a docker pull mcr.microsoft.com/dotnet/framework/aspnet:4.8-windowsservercore-ltsc2019 step to the pipeline. Now that should be super-quick as you can see that image is already cached. But it wasn't! It too almost 8 minutes.

docker pull mcr.microsoft.com/dotnet/framework/aspnet:4.8-windowsservercore-ltsc2019
========================== Starting Command Output ===========================
"C:\windows\system32\cmd.exe" /D /E:ON /V:OFF /S /C "CALL "D:\a\_temp\d190c5d8-262b-4a70-9c04-216b9ac2b165.cmd""
4.8-windowsservercore-ltsc2019: Pulling from dotnet/framework/aspnet
4612f6d0b889: Already exists
eed17b4baac2: Pulling fs layer
565c587c68c2: Pulling fs layer
c732b140f2ad: Pulling fs layer
84ae672f9921: Pulling fs layer
cd04865d4563: Pulling fs layer
7c75100d3a4d: Pulling fs layer
bea74093ac0e: Pulling fs layer
6353217bf85b: Pulling fs layer
ca397bdd5ee0: Pulling fs layer
ef8702482a58: Pulling fs layer
84ae672f9921: Waiting
cd04865d4563: Waiting
7c75100d3a4d: Waiting
bea74093ac0e: Waiting
6353217bf85b: Waiting
ca397bdd5ee0: Waiting
ef8702482a58: Waiting
c732b140f2ad: Verifying Checksum
c732b140f2ad: Download complete
565c587c68c2: Verifying Checksum
565c587c68c2: Download complete
eed17b4baac2: Verifying Checksum
eed17b4baac2: Download complete
cd04865d4563: Verifying Checksum
cd04865d4563: Download complete
7c75100d3a4d: Verifying Checksum
7c75100d3a4d: Download complete
6353217bf85b: Verifying Checksum
6353217bf85b: Download complete
ca397bdd5ee0: Verifying Checksum
ca397bdd5ee0: Download complete
84ae672f9921: Verifying Checksum
84ae672f9921: Download complete
ef8702482a58: Verifying Checksum
ef8702482a58: Download complete
bea74093ac0e: Verifying Checksum
bea74093ac0e: Download complete
eed17b4baac2: Pull complete
565c587c68c2: Pull complete
c732b140f2ad: Pull complete
84ae672f9921: Pull complete
cd04865d4563: Pull complete
7c75100d3a4d: Pull complete
bea74093ac0e: Pull complete
6353217bf85b: Pull complete
ca397bdd5ee0: Pull complete
ef8702482a58: Pull complete
Digest: sha256:3579480a92f0795c37d6e551139b431eb7cafe798d257c7ce279e10adbd0cb6d
Status: Downloaded newer image for mcr.microsoft.com/dotnet/framework/aspnet:4.8-windowsservercore-ltsc2019
mcr.microsoft.com/dotnet/framework/aspnet:4.8-windowsservercore-ltsc2019

Why is it pulling all those layers? When does mcr.microsoft.com/dotnet/framework/aspnet:4.8-windowsservercore-ltsc2019 != mcr.microsoft.com/dotnet/framework/aspnet:4.8-windowsservercore-ltsc2019?

I then took a look at the Docker Hub page for ASP.NET. It lists the same tag, though interestingly the 'last modified' date was 19th of May (4 days ago). Compare that with the docker images list above - it says '9 days ago' - and on closer examination the sha256 values are different too!

So I'm pretty sure that's the problem - there's a lag between when a new image is published on Docker Hub and when that image will be included in the current hosted agent VM.

Image architecture

One other thing to watch out for. Notice that the one of the images listed on the agent is mcr.microsoft.com/windows/servercore:ltsc2019? There's different 'architecture' options for container images. For Windows Containers, these are usually either 'multiarch' or 'amd64'. For example see both listed for Windows Server Core. The trap is that these are two different images. If you specify the ltsc2019-amd64 tag, that won't match the image on the agent.

Possible solutions

So that seems like a reasonable hypothesis. Because we're either explicitly doing a docker pull or we're depending on images that were built with a different version of the base image, we're experiencing a cache miss and paying the penalty by needing to download an entirely new image.

I think the problem can be managed by ensuring that images are built against the current base images on the agent. If you're building images in different pipelines and then storing those in a private registry (Azure Container Registry for example), then you're probably going to need to refresh those as soon as the agent images are updated.

Following the releases in the GitHub Actions virtual-environments repo appears to be the easiest way to know when the agent software is changing. Yes, GitHub Actions and Azure Pipelines share the same agent configurations.

If you really need to fix on a version, don't just rely on the tags - you're best bet then is to reference the sha256. That way there's no ambiguity. But be aware you'll more than likely end up referring to an image that isn't cached. In that case if build time matters, then using a self-hosted agent is probably the best strategy.