Two categories of large-scale production system overloading issues
Top-Down overload or "Reddit Hug of Death": This is what Bluesky experienced today - suddenly there was a HUGE demand surge and the servers just couldn't for a while. This also happens after superbowl ads or when pop stars announce tours or during DDOS attacks.
Bottom-up: This is the less obvious and more common scenario, when something inside the system fails, that makes the system unable to serve normal load. If you lose a redis cache and everything is reading to DB, you will drastically reduce your ability to serve requests.
The whole thread is worth reading.
- Software Engineering