Digitalcourage @digitalcourage

**Joe Tinoco** @oprimo@mastodon.social · 2d

I usually send #hugops on notable outages, but Sportsnet+ essentially repeated the *exact same* outage of game 3, at exactly the same inning, on the first #WorldSeries run in 32 years. They had all they needed to make things right for game 7 and still didn't. What a shame.

**Chris. R.** @haploc@fedi.cr-net.be · 3d

Chris. R. @haploc@fedi.cr-net.be

A great post-mortem read from Matrix.org

https://matrix.org/blog/2025/10/post-mortem/

matrix.org · 6dPost-mortem of the September 2 outageBy Matthew Hodgson

#postgresql #postmortem #hugops

**dch** @dch@bsd.network · 4d

dch @dch@bsd.network

@jimsalter @joeress here's one for your 2.5 admins. #25admins #zfs #postgresql

https://matrix.org/blog/2025/10/post-mortem/ great write-up and #HugOps to those involved over that very stressful period.

- it's great to see they had a comprehensive backup/recovery strategy in place
- well done, really well done. Multiple fallback layers. Sad that they had to use them, but hey that's why we do this.
- Kudos to @beasts hosting for moral & technical support, once again I keep hearing good things about them

A story in SQL backup/recovery from matrix.org with three key lessons:

- always do critical recovery work with 2+ people checking and reviewing together (they did this), and rotate regions because sleep is critical
- never actually delete stuff during a crisis. Ever (narrator: they learned this the hard way)
- ZFS would have made this recovery significantly easier, in so many ways

It would have been almost trivial to recover from their failed storage with ZFS, and perhaps avoid either the failover, or the remote restore.

Scheduled ZFS snapshots would have meant a rollback instead of a recover in at least 2 of their high-risk moments.

It would have also meant higher storage costs because snapshots are almost but not quite free.

ZFS snapshots can be sent/received from an alternate system, over LAN at very high rates, much much faster than a remote S3-based streaming restore.

matrix.org · 6dPost-mortem of the September 2 outageBy Matthew Hodgson

Continued thread

**Jan Wildeboer** @jwildeboer@social.wildeboer.net · 5d

Jan Wildeboer @jwildeboer@social.wildeboer.net

2nd UPDATE: They now say "While we don't have an ETA yet. customers can consider implementing failover strategies with Azure Traffic Manager [..]" Uh-oh. Gut feeling: This is going to take a while to get fixed. Dear friends at Microsoft, here's my #hugops. May you succeed and implement solutions that avoid this situation for the future.

**Hazel Weakly** @hazelweakly@hachyderm.io · Oct 20

Oct 20

Hazel Weakly @hazelweakly@hachyderm.io

#hugOps story time! Quote this and tell me the biggest incident you ever saw in production. It’s inevitable, it’s gonna happen, and learning from incidents is way better than shitting on people trying to fix them.

I’ll start :)

https://hazelweakly.me/blog/mother-of-all-outages/

Hazel WeaklyMother of All Outages | Hazel WeaklyY’all ready for a story about one of the wildest fuckups production outages I ever took part in? Buckle up; we’re going for a ride far, far away from any...

**der.hans** @lufthans@mastodon.social · Oct 20

Oct 20

der.hans @lufthans@mastodon.social

Did AWS join the government shut down?

#USpol #AWS #GeekHumor

**Jan Wildeboer** @jwildeboer@social.wildeboer.net · Oct 20 *

Oct 20 *

Jan Wildeboer @jwildeboer@social.wildeboer.net

Hey Amazon friends at US EAST 1. You had a shitty few hours. But you got stuff fixed. Thank you. #hugops

**coldclimate** @coldclimate@hachyderm.io · Oct 20

Oct 20

coldclimate @coldclimate@hachyderm.io

I am seeing *so many* bad takes about the AWS outage, so many.
Everybody is smug until they get fucked with their pants on.
Even if you've done everything to run your own stuff and host it, it is very difficult to avoid services that will be impacted and there's *no way* the majority of businesses are running all that stuff themselves.
Comms tools, status pages, payment systems, monitoring systems, build tool, deployment tools, planning tools, the list goes on.
If you're telling me you're running absolutely everything yourself, well done, I have no idea how, and if you're better at all of that than all the SAAS providers, I struggle to believe you've time left in the day to run your actual business.
100% you should be doing your due diligence to make sure you're resilient where it matters, but being caught up in something like this is almost inevitable if you're a non trivial online company.
Stop throwing rocks, start sending #hugOps

**Trending** @trending@mastodon.bot · Oct 20

Oct 20

Trending @trending@mastodon.bot

#HugOps is now trending across Mastodon

**Grafana Labs** @grafana@grafana.social · Oct 20

Oct 20

Grafana Labs @grafana@grafana.social

#HugOps to those dealing with today's outage and those impacted by it.

**bert hubert** @bert_hubert@eupolicy.social · Oct 20 *

Oct 20 *

bert hubert @bert_hubert@eupolicy.social

https://health.aws.amazon.com/health/status has a fascinating stream of updates if you are into that kind of thing. #hugops for everyone involved, it looks like an uphill battle. #aws

**Scott Francis** @darkuncle@infosec.exchange · Oct 20

Oct 20

Scott Francis @darkuncle@infosec.exchange

much as I love AWS (and I do), having a third of the Internet on a single provider is probably not ideal from a resiliency perspective.

#hugops to everybody on service teams today (and all the enterprise TAMs who are getting paged by customers constantly)

been there

**dch** @dch@bsd.network · Oct 20 *

Oct 20 *

dch @dch@bsd.network

#HugOps to all the IT and business people whose start of week has been borked by the AWS outage, while they try to restart everything from the ground up.

**quintessence** @quintessence@hachyderm.io · Oct 20

Oct 20

quintessence @quintessence@hachyderm.io

Massive #HugOps to AWS and their ecosystem. Outages are stressful, let alone international ones, so let's be extra kind to each other today

**Pozorvlak** @pozorvlak@mathstodon.xyz · Oct 20

Oct 20

Pozorvlak @pozorvlak@mathstodon.xyz

Is #hugops still a thing? #hugops to all those dealing with the AWS outage today, many of whom probably pleaded with their bosses for time to make critical services multi-cloud or at least multi-region.