Docker performance on Azure Pipelines agents
I've noticed some interesting variations in build times on Azure Pipelines related to using Docker to build and/or run containers. I tracked down the issue to excessive time downloading docker images, so I started investigating whether there's ways to cache or optimise the docker pull
steps.
So what to do?
Self-hosted agents
First off, if you can run your own build agent you're probably not going to see these problems. Having a dedicated agent means that Docker caches all images locally and can reuse them for subsequent build jobs. That's issue with Microsoft-hosted agents - you get a brand new agent for each job - there's no possibility to persist any changes, so the value of Docker's image caching is reduced.
Cache Task
First off I did some research into whether the Cache Task could be used as a way to more efficiently restore the Docker image cache. I think the answer is probably 'no'. I've started using this task for another build involving NuGet packages, and in that case it makes the dotnet restore
step much faster, but I think the problem with Docker is the files are stored in various places, so trying to cache C:\ProgramData\Docker\image\windowsfilter\layerdb\sha256
and/or C:\ProgramData\Docker\windowsfilter
didn't seem to have any effect.
Docker Save/Load
There's a GitHub issue with some discussion about using docker save
/load
and I can confirm the comments on the issue that this did not make things faster (in fact it made things slower).
Here's an example - a pipeline that's saving 4.8-windowsservercore-ltsc2019 in the cache.
pool:
vmImage: 'windows-latest'
variables:
dockerCache: $(Build.ArtifactStagingDirectory)\.dockercache
steps:
- script: docker images --digests
displayName: Docker images
- task: Cache@2
inputs:
key: 'nuget | "$(Agent.OS)" | azure-pipelines-docker-cache.yml'
path: '$(dockerCache)'
cacheHitVar: DOCKER_CACHE_HIT
displayName: 1. Cache Task
- script: |
if exist $(dockerCache)\4.8-windowsservercore-ltsc2019.tar docker load -i $(dockerCache)/4.8-windowsservercore-ltsc2019.tar
displayName: 2. Docker Load
- script: docker pull mcr.microsoft.com/dotnet/framework/aspnet:4.8-windowsservercore-ltsc2019
displayName: 3. Docker Pull
- script: |
if not exist $(dockerCache) mkdir $(dockerCache)
docker image save -o $(dockerCache)/4.8-windowsservercore-ltsc2019.tar mcr.microsoft.com/dotnet/framework/aspnet:4.8-windowsservercore-ltsc2019
condition: ne(variables['DOCKER_CACHE_HIT'], 'true')
displayName: 4. Docker Save
Build | 1. Cache Task | 2. Docker Load | 3. Docker Pull | 4. Docker Save | 5. Cache Save | Job time (total) |
---|---|---|---|---|---|---|
Build 1 (cache miss) | 00:00:03 | 00:00:01 | 00:08:04 | 00:17:06 | 00:06:09 | 00:31:50 |
Build 2 (cache hit) | 00:05:45 | 00:10:56 | 00:00:03 | 00:00:01 | 00:00:03 | 00:17:15 |
So yes, the second build was faster, but both of these are way slower than a build that had no caching at all (cf just the first docker pull
at 00:08:04). So that's no help.
Buildctl
The issue also mentions using buildctl
, which is part of BuildKit. The trouble is I'm working with Windows Containers and BuildKit currently supported with those. If you're working with Linux containers, this does sound promising.
Existing images on the agent
It occurred to me that shouldn't Docker be making use of some existing images that are shipped on the hosted agent. The software and tools pre-installed on each agent is documented. For example, the Windows 2019 agent. This is not a static list, the agents will be updated over time as patches and updates are issued for both the OS and applications.
You can see the list of pre-installed images listed, or just to confirm, you can run docker images --digests
in a pipeline step to confirm. Here's the output I got which matches the documentation.
REPOSITORY TAG DIGEST IMAGE ID CREATED SIZE
mcr.microsoft.com/dotnet/framework/aspnet 4.8-windowsservercore-ltsc2019 sha256:dbf97206264133cdef6b49b06fa5d4028482845547c2858a086b5ce5c4513f00 8280f73a9be1 9 days ago 6.87GB
mcr.microsoft.com/dotnet/framework/runtime 4.8-windowsservercore-ltsc2019 sha256:bf47599181ae3877ec680428a99f76d43ffb26251155a6f0b0b76f4e70304c26 bcd511658148 9 days ago 6.51GB
mcr.microsoft.com/windows/servercore ltsc2019 sha256:2629881183feda906459163cb58fbdbc001bea76a92b2dc4695c8e5b14f747ae 561b89eac394 2 weeks ago 3.7GB
mcr.microsoft.com/windows/nanoserver 1809 sha256:8e6807c213b52405fec8a861e0b766055ba9d4f941267adf49ee67526755b63a 9e7d556b2b51 2 weeks ago 251MB
microsoft/aspnetcore-build 1.0-2.0 sha256:9ecc7c5a8a7a11dca5f08c860165646cb30d084606360a3a72b9cbe447241c0c 5d8be0910d37 21 months ago 3.99GB
Out of curiosity, I added a docker pull mcr.microsoft.com/dotnet/framework/aspnet:4.8-windowsservercore-ltsc2019
step to the pipeline. Now that should be super-quick as you can see that image is already cached. But it wasn't! It too almost 8 minutes.
docker pull mcr.microsoft.com/dotnet/framework/aspnet:4.8-windowsservercore-ltsc2019
========================== Starting Command Output ===========================
"C:\windows\system32\cmd.exe" /D /E:ON /V:OFF /S /C "CALL "D:\a\_temp\d190c5d8-262b-4a70-9c04-216b9ac2b165.cmd""
4.8-windowsservercore-ltsc2019: Pulling from dotnet/framework/aspnet
4612f6d0b889: Already exists
eed17b4baac2: Pulling fs layer
565c587c68c2: Pulling fs layer
c732b140f2ad: Pulling fs layer
84ae672f9921: Pulling fs layer
cd04865d4563: Pulling fs layer
7c75100d3a4d: Pulling fs layer
bea74093ac0e: Pulling fs layer
6353217bf85b: Pulling fs layer
ca397bdd5ee0: Pulling fs layer
ef8702482a58: Pulling fs layer
84ae672f9921: Waiting
cd04865d4563: Waiting
7c75100d3a4d: Waiting
bea74093ac0e: Waiting
6353217bf85b: Waiting
ca397bdd5ee0: Waiting
ef8702482a58: Waiting
c732b140f2ad: Verifying Checksum
c732b140f2ad: Download complete
565c587c68c2: Verifying Checksum
565c587c68c2: Download complete
eed17b4baac2: Verifying Checksum
eed17b4baac2: Download complete
cd04865d4563: Verifying Checksum
cd04865d4563: Download complete
7c75100d3a4d: Verifying Checksum
7c75100d3a4d: Download complete
6353217bf85b: Verifying Checksum
6353217bf85b: Download complete
ca397bdd5ee0: Verifying Checksum
ca397bdd5ee0: Download complete
84ae672f9921: Verifying Checksum
84ae672f9921: Download complete
ef8702482a58: Verifying Checksum
ef8702482a58: Download complete
bea74093ac0e: Verifying Checksum
bea74093ac0e: Download complete
eed17b4baac2: Pull complete
565c587c68c2: Pull complete
c732b140f2ad: Pull complete
84ae672f9921: Pull complete
cd04865d4563: Pull complete
7c75100d3a4d: Pull complete
bea74093ac0e: Pull complete
6353217bf85b: Pull complete
ca397bdd5ee0: Pull complete
ef8702482a58: Pull complete
Digest: sha256:3579480a92f0795c37d6e551139b431eb7cafe798d257c7ce279e10adbd0cb6d
Status: Downloaded newer image for mcr.microsoft.com/dotnet/framework/aspnet:4.8-windowsservercore-ltsc2019
mcr.microsoft.com/dotnet/framework/aspnet:4.8-windowsservercore-ltsc2019
Why is it pulling all those layers? When does mcr.microsoft.com/dotnet/framework/aspnet:4.8-windowsservercore-ltsc2019
!= mcr.microsoft.com/dotnet/framework/aspnet:4.8-windowsservercore-ltsc2019
?
I then took a look at the Docker Hub page for ASP.NET. It lists the same tag, though interestingly the 'last modified' date was 19th of May (4 days ago). Compare that with the docker images
list above - it says '9 days ago' - and on closer examination the sha256 values are different too!
So I'm pretty sure that's the problem - there's a lag between when a new image is published on Docker Hub and when that image will be included in the current hosted agent VM.
Image architecture
One other thing to watch out for. Notice that the one of the images listed on the agent is mcr.microsoft.com/windows/servercore:ltsc2019? There's different 'architecture' options for container images. For Windows Containers, these are usually either 'multiarch' or 'amd64'. For example see both listed for Windows Server Core. The trap is that these are two different images. If you specify the ltsc2019-amd64
tag, that won't match the image on the agent.
Possible solutions
So that seems like a reasonable hypothesis. Because we're either explicitly doing a docker pull
or we're depending on images that were built with a different version of the base image, we're experiencing a cache miss and paying the penalty by needing to download an entirely new image.
I think the problem can be managed by ensuring that images are built against the current base images on the agent. If you're building images in different pipelines and then storing those in a private registry (Azure Container Registry for example), then you're probably going to need to refresh those as soon as the agent images are updated.
Following the releases in the GitHub Actions virtual-environments repo appears to be the easiest way to know when the agent software is changing. Yes, GitHub Actions and Azure Pipelines share the same agent configurations.
If you really need to fix on a version, don't just rely on the tags - you're best bet then is to reference the sha256. That way there's no ambiguity. But be aware you'll more than likely end up referring to an image that isn't cached. In that case if build time matters, then using a self-hosted agent is probably the best strategy.
Categories: Docker, Azure Pipelines