- initial architecture (#33081) - monitor service but no monitor api - reserve runner by applying a runner label - serialise* decision jobs with job concurrency - “serialise” as in concurrency, not encoding - known caveats - decision job needs github api token - downtime after reservation? timeout job - job concurrency could fail under contention - uh oh (#33276) - run id labels (#33283) - if we label runners to take them, why not label them with more metadata?
end at 0:15 0:09 0:08 0:08; 0:07 - how servo can have fast CI for not a lot of money

Web engine CI
on a shoestring budget

delan azabani (she/her)
azabani.com
December 2025
end at 2:43 2:01 2:00 1:53; 1:49 - servo: greenfield web browser engine → demanding CI requirements - we’re currently on github and github actions - but as time goes on, more for network effects early on than for merits - wouldn’t be surprised if we moved to codeberg in the next year or two

Servo’s situation

end at 6:17 5:07 4:19 3:21; 3:34 - very useful for small workloads, plus the logic that coordinates workloads - things like taking a tryjob request for “linux” into a run that just builds for linux - or a tryjob request for “full” into a run that builds all the platforms and runs all the tests - for a project of our scale, these runners fall apart for anything beyond that

GitHub-hosted runners

end at 12:34 10:52 8:37 6:26; 6:47 - alternatives we considered - third-party runner providers - namespace, warpbuild, … - often just as expensive per hour as github’s first-party runners - key selling point tends to be better caching - “dumb terminal” jobs → tricky to do without losing access to actions ecosystem - but you should probably avoid it anyway: it’s platform lock-in and yaml sucks - and if it weren’t for forgejo actions, it would be vendor lock-in too - self-host a whole CI service - some CI services like jenkins and bamboo have built-in container orchestration - none of them have really solved the problem of virtual machine orchestration - we lacked the dedicated personnel to operate something on the critical path

Alternatives considered

end at 14:17 12:26 9:49 7:27; 7:38

Self-hosted runners

end at ...; 8:31

How much faster?

end at 17:09 14:35 11:45 9:49; 10:41 - common to all versions of this system - augments the built-in CI service of the forge (github / forgejo) - almost transparent user experience - just one or two extra jobs per job, and some unique ids

What makes our system unique

^ operating expenses, not necessarily labour
end at -:-- 16:09 13:09 11:10; 12:11 - unfair comparison, because it assumes we would need the same amount of hours

Completely self-hosted
so it’s dirt cheap^

^ operating expenses, not necessarily labour
end at ...; ...;

Three ways to use runners

end at --:-- --:-- 29:25 28:10; 16:13 - faster checkouts

Faster checkouts

end at --:-- --:-- 30:17 29:18; 17:43

Incremental builds

end at -:-- 17:06 13:58 11:55; 13:08

Servo’s deployment

^ not including OpenHarmony runners
end at -:-- 18:37 15:01 13:15; 14:36

How does it work?

end at ...; 18:58



Graceful fallback

(skip)
end at -:-- 21:45 17:18 15:27; skipped

Graceful fallback

end at -:-- 22:54 18:22 16:46; skipped

Decision jobs

end at -:-- 24:38 20:11 18:34; skipped - problem: most solutions known at the time were inherently racy - solved: *reserve* a runner by applying a label to it - these labels are of the form `reserved-for:uuidv4` - then the workload job can `runs-on: reserved-for:uuidv4`

Decisions must be serialised

end at -:-- 26:38 21:55 20:49; skipped

Decisions must be serialised

end at -:-- 28:16 23:45 22:29; skipped - monitor api (#33315) - move reservation into the servers managing the runners - serialise the monitor api requests - problem: what happens if the runner fails to materialise? - jobs are `queued`, then they are `in_progress` - problem: you can only declare a time limit for `in_progress`, not `queued`

Decisions must be serialised

end at -:-- 30:39 24:56 24:06; skipped - timeout jobs - each timeout job is a watchdog for your workload job, ensuring that it actually gets a runner - query the github api for that job run id, check `status` / `created_at`

Timeout jobs

end at -:-- 32:52 27:02 26:02; skipped - unique ids - problem: you can’t know the job run id of the workload job - they can be instantiated multiple times via workflow calls - timeout job does not (and can not) express any dependency on workload job - in other words, the workload job and the timeout job are just two jobs

Uniquely identifying jobs

end at --:-- --:-- 27:50 26:41; skipped - solved: tie them together with the uuidv4 generated in the decision job - in the friendly / display `name` of the job. yes, really, we string-match

Uniquely identifying jobs

end at --:-- --:-- 31:59 30:45; skipped - tokenless api - with the monitor api, the workflow now needs a secret - otherwise anyone could deny service by reserving all of the runners - so while the workflow no longer needs a privileged GitHub API token, we’ve got the same problem, just in a different place

Tokenless API

end at --:-- --:-- 33:06 31:34; skipped - we can publish an artifact representing the request - this is unforgeable!

Tokenless API

end at --:-- --:-- 35:17 34:02; skipped

Global queue





Runner images

(skip)
end at --:-- --:-- 37:27 36:17; 19:26

Runner images

end at --:-- --:-- 38:45 37:17

Runner images

end at --:-- --:-- 39:41 37:56

Linux runners

if ! [ -e built ]; then
# Image config
touch built
poweroff
else
# Runner boot
fi
end at --:-- --:-- 40:57 38:58

Windows runners

end at --:-- --:-- 43:04 40:34

macOS runners

end at --:-- --:-- 47:03 44:34; 24:37

Future directions





github.com/servo/ci-runners

Slides: go.daz.cat/3tdhp

Transcript: go.daz.cat/2ra8x