A distributed system can be defined as multiple computers (nodes) communicating via a network trying to achieve some task together.
Martin Kleppmann’s Course
Notes from Martin Kleppmann’s Distributed Systems Course. He has a set of course notes on his teaching site as well.
How do we share data amongst different concurrent entities?
- Recommended Reading
- “Distributed Systems” by van Steen & Tanenbaum: Implementation detail heavy, more practical
- “Introduction to Reliable and Secure Distributed Programs” (2nd ed) by Cachin, Guerraoui & Rodrigues: Theory heavy
- “Designing Data-Intensive Applications” by Kleppmann: More oriented toward distributed databases
- “Operating Systems: Concurrent and Distributed Software Design” by Addison-Wesley: links to Operating Systems
Why distributed?
- Things are inherently distributed: sending a message from your phone to your friend’s phone
- Reliability: even if one node fails, the system as a whole keeps functioning
- Performance: get data from a nearby node rather than one centralized server halfway around the world
- Solve bigger problems: some amounts of data can’t fit on just one machine
Why not distributed?
- Communication may fail (and we might not even know it has failed)
- Processes may crash (and we might not know)
- All of this can happen nondeterministically
- Thus we need to think about fault tolerance