Follow

sadly, pony.social had a bigger outage today.

During my work hours, the docker registry went AWOL along with it's worker node, which lead to a bunch of services having to restart (and in mastodon's case, had to be evicted for more crucial system services to ensure availability).

Once that cleared, k8s attempted to recover but on the new node, where all the new services got to start, no docker images were cached so things kinda collapsed from there on out.

Lesson learned: docker registry does not belong inside k8s cluster.

I've set more strict resource limits, which should eliminate further node crashing and I will have to investigate options for the registry.

Sign in to participate in the conversation
Manechat on Mastodon

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!