RemNote Community
Community

Introduction to Distributed Computing

Understand the fundamentals, motivations, and design trade‑offs of distributed computing.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz

Quick Practice

What is the purpose of synchronization in distributed systems?
1 of 11

Summary

Definition and Fundamentals of Distributed Computing What is Distributed Computing? Distributed computing is a computational approach that breaks down complex problems into smaller tasks executed simultaneously by multiple independent computers—called nodes—that communicate over a network. Instead of relying on a single powerful machine, distributed systems leverage the combined power of many machines working in concert to solve problems faster, handle larger datasets, or provide better reliability. In a distributed system, each node runs its own local program and maintains its own local data. However, nodes must constantly coordinate with one another, exchanging messages to ensure that their collective actions remain synchronized and correct. Think of it like a team project where each person works independently on their own task, but must communicate regularly to ensure everyone is working toward the same goal. Why Distribute Work? There are four primary motivations for using distributed systems rather than relying on a single powerful computer: Scalability As problems grow larger or user demand increases, you can simply add more nodes to the system. This horizontal scaling allows the system to handle increased workloads without requiring a complete redesign. For example, if a web service becomes popular and receives 100 times more traffic, you can add more server nodes rather than replacing the entire system with a faster one. Performance Enhancement When computation can be divided among multiple machines, tasks complete much faster through parallel execution. If you need to sort a billion-record dataset, distributing the work across 100 machines can theoretically complete the task roughly 100 times faster than a single machine. Each node processes its portion of the data simultaneously, not sequentially. Fault Tolerance Distributed systems are inherently more resilient to failures. If one node crashes, the remaining nodes can continue operating. The system achieves this by replicating both data and tasks across multiple nodes, so losing one machine doesn't lose critical information. From the user's perspective, the system may continue without noticeable interruption. Geographic Proximity Benefits By placing computational nodes near the data they need to process, systems reduce latency (the time for data to travel across the network) and conserve bandwidth (the amount of data that must travel across networks). For instance, a global company might place data processing centers on different continents so that each region's users experience faster service. Fundamental Challenges in Distributed Systems Distributing work across multiple machines introduces significant challenges that don't exist in single-machine programs: Network Communication Nodes communicate through message passing using protocols like TCP/IP (Transmission Control Protocol/Internet Protocol). Unlike local function calls that happen instantly, network messages experience delays and can fail to arrive. This makes coordination more complex than traditional programming. Concurrency Multiple nodes operate simultaneously, which means actions can overlap in unexpected ways. Without proper mechanisms to manage this concurrency, the system can produce incorrect results. Distributed systems employ locks (mechanisms that prevent multiple nodes from accessing the same resource simultaneously), timestamps (to determine the order of events), and consensus algorithms like Paxos and Raft (to ensure all nodes agree on important decisions). Synchronization and Coordination Nodes must maintain a consistent understanding of the system's state. Synchronization ensures that nodes don't interfere with each other's operations, while coordination signals help all nodes maintain an agreed-upon view of what's happening in the system. This is much more difficult when nodes can't instantly read each other's memory as they can in a single machine. Core Concepts and Techniques Data Partitioning and Sharding When datasets become too large for a single machine, distributed systems split the data into chunks called shards and store each shard on a different node. For example, a customer database might be partitioned so that customers with IDs 1-1,000,000 are stored on Node A, IDs 1,000,001-2,000,000 on Node B, and so forth. This approach allows the system to: Store datasets larger than any single machine can hold Process data faster by letting each node work on its local partition Improve performance by reducing the amount of data each node must search through Load Balancing Work must be distributed evenly across all available nodes to prevent bottlenecks. Load balancing ensures that no single node becomes overloaded while others sit idle. A well-balanced system distributes incoming requests proportionally across all nodes so that each node carries roughly equal computational load. If a node becomes slower or fails, the load balancer can redirect work to other healthy nodes. Fault Detection Distributed systems must detect when nodes fail so they can respond appropriately. A common technique is the heartbeat mechanism—each node periodically sends a small "I'm alive" message to a monitoring service. If a node stops sending heartbeats, the system assumes it has failed and initiates recovery procedures. Recovery Techniques When failures occur, systems must recover without losing data or progress. Checkpointing is a key recovery technique: the system periodically saves intermediate states (snapshots of what has been computed so far) to persistent storage. If a node crashes, work can resume from the most recent checkpoint rather than starting from scratch. Practical Framework Example: Hadoop and Apache Spark To understand how these concepts work in practice, consider Hadoop and Apache Spark, popular distributed data-processing frameworks. These systems allow programmers to write a single program that processes a massive dataset. The programmer writes code as if it will run on one machine, but the framework automatically: Splits the input dataset into chunks Distributes these chunks to different nodes Executes the program on each chunk in parallel Handles network communication between nodes Performs load balancing to distribute work evenly Detects node failures and reexecutes failed tasks Combines results from all nodes into a final answer The programmer doesn't need to explicitly write code for communication, failure handling, or load balancing—the framework handles these distributed systems complexities transparently. This makes it practical for developers to leverage distributed computing without becoming experts in distributed systems. <extrainfo> Design Trade-offs in Distributed Systems Every design decision in distributed computing involves trade-offs: Speed Versus Complexity Faster execution often requires more sophisticated coordination and communication mechanisms, which add complexity. You must decide how much complexity is worth the performance gain. Reliability Versus Overhead Adding fault-tolerance mechanisms like replication makes the system more reliable but requires storing and transmitting extra copies of data. This increases both storage costs and network usage. Coordination Needs Effective distributed systems must balance the need for coordinated actions (to maintain consistency) with the desire to minimize synchronization delays (which slow down the system). Too much synchronization makes the system slow; too little causes inconsistencies. </extrainfo>
Flashcards
What is the purpose of synchronization in distributed systems?
To ensure nodes do not interfere with each other's operations.
What is scalability in the context of distributed systems?
The ability to handle larger problems or higher traffic by adding more nodes without redesigning software.
How does a distributed system achieve fault tolerance?
By replicating data and tasks so remaining nodes can continue operating if one node crashes.
What is the benefit of placing nodes in close geographic proximity to the data they need?
Reduction in latency and bandwidth usage.
What is the process of splitting large data sets into chunks and storing them on different nodes called?
Sharding (or data partitioning).
What is the goal of load balancing strategies?
To distribute work evenly across nodes to prevent any single node from becoming a bottleneck.
What mechanism is used to regularly monitor the health of each node?
Heartbeat messages.
What is checkpointing in the context of recovery techniques?
Saving intermediate states so that work can be resumed after a failure.
What tasks do distributed data-processing frameworks handle on behalf of the programmer?
Communication Load balancing Fault tolerance
What is a common trade-off for achieving faster execution in distributed systems?
Increased complexity in coordination and communication mechanisms.
What are the overhead costs of adding replication for reliability?
Increased storage and network overhead.

Quiz

What does sharding refer to in distributed systems?
1 of 7
Key Concepts
Distributed Systems Concepts
Distributed computing
Consensus algorithm
Fault tolerance
Sharding
Load balancing
Scalability
Concurrency (distributed systems)
Distributed Processing Frameworks
Hadoop
Apache Spark
Synchronization Mechanisms
Synchronization (computer science)