There's a pretty cool article about the scale of S3: Building and operating a pretty big storage system called S3.
It's pretty long: ~6.5k words, ~36k chars. Here's the tl;dr version of it.
Numbers
- 280 trillion objects
- 100 million requests per second
- 125 billion event notifications to serverless applications
- 100PB data moved per week for S3 Replication
- 1PB per day restored from Glacier
- 4 billion checksums per second
- Millions of hard disks
280 TRILLION objects, with a T! 4 billion checksums per second! MILLIONS of hard disks!
Holy fucking shit.
I suspected Amazon dealt with numbers like these, considering it pretty much hosts a large chunk of the entire internet, but seeing these numbers feels unreal.
Hard disks bit error rate
Hard disks have a bit error rate of 1 in 10^15 requests. Mortal humans usually don't even need to think about this. S3 actually encounters this pretty frequently and needs to account for it.
How to deal with eventual consistency
Don't. Just let there be inconsistency — it isn't a big deal.
Dynamo was eventually consistent, so it was possible for your shopping cart to be wrong. [...] ultimately, these conflicts were rare, and you could resolve them by getting support staff involved and making a human decision.
This is a key takeaway for me. If the problem is small enough at Amazon's scale, it's something I generally needn't worry about.
Leadership, Motivation, Ownership
Explain what the problem is and let people come up with their own solutions.
It’s a lot harder to get invested in an idea that you don’t own. I consciously spend a lot more time trying to develop problems, and to do a really good job of articulating them, rather than trying to pitch solutions.
Super biased and with a ton of caveats, but the idea is interesting.