Subjects/Science/Computer and Information Science/Computer Science/Distributed computing

Introduction to Distributed Computing

Understand the fundamentals, motivations, and design trade‑offs of distributed computing.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

What is the purpose of synchronization in distributed systems?

1 of 11

Summary

Definition and Fundamentals of Distributed Computing What is Distributed Computing? Distributed computing is a computational approach that breaks down complex problems into smaller tasks executed simultaneously by multiple independent computers—called nodes—that communicate over a network. Instead of relying on a single powerful machine, distributed systems leverage the combined power of many machines working in concert to solve problems faster, handle larger datasets, or provide better reliability. In a distributed system, each node runs its own local program and maintains its own local data. However, nodes must constantly coordinate with one another, exchanging messages to ensure that their collective actions remain synchronized and correct. Think of it like a team project where each person works independently on their own task, but must communicate regularly to ensure everyone is working toward the same goal. Why Distribute Work? There are four primary motivations for using distributed systems rather than relying on a single powerful computer: Scalability As problems grow larger or user demand increases, you can simply add more nodes to the system. This horizontal scaling allows the system to handle increased workloads without requiring a complete redesign. For example, if a web service becomes popular and receives 100 times more traffic, you can add more server nodes rather than replacing the entire system with a faster one. Performance Enhancement When computation can be divided among multiple machines, tasks complete much faster through parallel execution. If you need to sort a billion-record dataset, distributing the work across 100 machines can theoretically complete the task roughly 100 times faster than a single machine. Each node processes its portion of the data simultaneously, not sequentially. Fault Tolerance Distributed systems are inherently more resilient to failures. If one node crashes, the remaining nodes can continue operating. The system achieves this by replicating both data and tasks across multiple nodes, so losing one machine doesn't lose critical information. From the user's perspective, the system may continue without noticeable interruption. Geographic Proximity Benefits By placing computational nodes near the data they need to process, systems reduce latency (the time for data to travel across the network) and conserve bandwidth (the amount of data that must travel across networks). For instance, a global company might place data processing centers on different continents so that each region's users experience faster service. Fundamental Challenges in Distributed Systems Distributing work across multiple machines introduces significant challenges that don't exist in single-machine programs: Network Communication Nodes communicate through message passing using protocols like TCP/IP (Transmission Control Protocol/Internet Protocol). Unlike local function calls that happen instantly, network messages experience delays and can fail to arrive. This makes coordination more complex than traditional programming. Concurrency Multiple nodes operate simultaneously, which means actions can overlap in unexpected ways. Without proper mechanisms to manage this concurrency, the system can produce incorrect results. Distributed systems employ locks (mechanisms that prevent multiple nodes from accessing the same resource simultaneously), timestamps (to determine the order of events), and consensus algorithms like Paxos and Raft (to ensure all nodes agree on important decisions). Synchronization and Coordination Nodes must maintain a consistent understanding of the system's state. Synchronization ensures that nodes don't interfere with each other's operations, while coordination signals help all nodes maintain an agreed-upon view of what's happening in the system. This is much more difficult when nodes can't instantly read each other's memory as they can in a single machine. Core Concepts and Techniques Data Partitioning and Sharding When datasets become too large for a single machine, distributed systems split the data into chunks called shards and store each shard on a different node. For example, a customer database might be partitioned so that customers with IDs 1-1,000,000 are stored on Node A, IDs 1,000,001-2,000,000 on Node B, and so forth. This approach allows the system to: Store datasets larger than any single machine can hold Process data faster by letting each node work on its local partition Improve performance by reducing the amount of data each node must search through Load Balancing Work must be distributed evenly across all available nodes to prevent bottlenecks. Load balancing ensures that no single node becomes overloaded while others sit idle. A well-balanced system distributes incoming requests proportionally across all nodes so that each node carries roughly equal computational load. If a node becomes slower or fails, the load balancer can redirect work to other healthy nodes. Fault Detection Distributed systems must detect when nodes fail so they can respond appropriately. A common technique is the heartbeat mechanism—each node periodically sends a small "I'm alive" message to a monitoring service. If a node stops sending heartbeats, the system assumes it has failed and initiates recovery procedures. Recovery Techniques When failures occur, systems must recover without losing data or progress. Checkpointing is a key recovery technique: the system periodically saves intermediate states (snapshots of what has been computed so far) to persistent storage. If a node crashes, work can resume from the most recent checkpoint rather than starting from scratch. Practical Framework Example: Hadoop and Apache Spark To understand how these concepts work in practice, consider Hadoop and Apache Spark, popular distributed data-processing frameworks. These systems allow programmers to write a single program that processes a massive dataset. The programmer writes code as if it will run on one machine, but the framework automatically: Splits the input dataset into chunks Distributes these chunks to different nodes Executes the program on each chunk in parallel Handles network communication between nodes Performs load balancing to distribute work evenly Detects node failures and reexecutes failed tasks Combines results from all nodes into a final answer The programmer doesn't need to explicitly write code for communication, failure handling, or load balancing—the framework handles these distributed systems complexities transparently. This makes it practical for developers to leverage distributed computing without becoming experts in distributed systems. <extrainfo> Design Trade-offs in Distributed Systems Every design decision in distributed computing involves trade-offs: Speed Versus Complexity Faster execution often requires more sophisticated coordination and communication mechanisms, which add complexity. You must decide how much complexity is worth the performance gain. Reliability Versus Overhead Adding fault-tolerance mechanisms like replication makes the system more reliable but requires storing and transmitting extra copies of data. This increases both storage costs and network usage. Coordination Needs Effective distributed systems must balance the need for coordinated actions (to maintain consistency) with the desire to minimize synchronization delays (which slow down the system). Too much synchronization makes the system slow; too little causes inconsistencies. </extrainfo>

Flashcards

What is the purpose of synchronization in distributed systems?

To ensure nodes do not interfere with each other's operations.

What is scalability in the context of distributed systems?

The ability to handle larger problems or higher traffic by adding more nodes without redesigning software.

How does a distributed system achieve fault tolerance?

By replicating data and tasks so remaining nodes can continue operating if one node crashes.

What is the benefit of placing nodes in close geographic proximity to the data they need?

Reduction in latency and bandwidth usage.

What is the process of splitting large data sets into chunks and storing them on different nodes called?

Sharding (or data partitioning).

What is the goal of load balancing strategies?

To distribute work evenly across nodes to prevent any single node from becoming a bottleneck.

What mechanism is used to regularly monitor the health of each node?

Heartbeat messages.

What is checkpointing in the context of recovery techniques?

Saving intermediate states so that work can be resumed after a failure.

What tasks do distributed data-processing frameworks handle on behalf of the programmer?

Communication Load balancing Fault tolerance

What is a common trade-off for achieving faster execution in distributed systems?

Increased complexity in coordination and communication mechanisms.

What are the overhead costs of adding replication for reliability?

Increased storage and network overhead.

Quiz

Introduction to Distributed Computing Quiz Question 1: What does sharding refer to in distributed systems?

Splitting large data sets into chunks stored on different nodes (correct)
Encrypting data before transmission across the network
Duplicating the entire database on every node
Running the same program simultaneously on all nodes without data division

Introduction to Distributed Computing Quiz Question 2: What is the main purpose of load balancing in a distributed system?

Distribute work evenly across nodes to avoid bottlenecks (correct)
Increase the number of nodes required
Eliminate the need for network communication
Ensure each node stores the same amount of data

Introduction to Distributed Computing Quiz Question 3: What is a trade‑off of adding replication for fault tolerance in a distributed system?

It increases storage and network overhead (correct)
It eliminates any need for security measures
It guarantees zero latency for all operations
It reduces the number of nodes required

Introduction to Distributed Computing Quiz Question 4: In a distributed computing system, what does each individual node typically execute?

Its own local program while cooperating with other nodes (correct)
A centralized master program that controls all other nodes
A duplicate of the entire system’s code base
Only network communication protocols without computation

Introduction to Distributed Computing Quiz Question 5: What is the main purpose of sending regular heartbeat messages in a distributed system?

To monitor the health and availability of each node (correct)
To synchronize the clocks of all nodes to a global time
To distribute data blocks evenly across the network
To encrypt all inter‑node communications

Introduction to Distributed Computing Quiz Question 6: When using Hadoop or Apache Spark, what does the programmer provide to the framework?

A single program that the framework automatically divides into many tasks (correct)
A pre‑partitioned dataset that the framework processes without modification
Separate programs for each node that must be manually coordinated
A low‑level assembly code for each cluster machine

Introduction to Distributed Computing Quiz Question 7: Why must distributed systems balance the need for coordinated actions with the desire to minimize synchronization delays?

Because excessive coordination can slow down overall performance (correct)
Because coordination eliminates the need for fault tolerance
Because synchronization delays are always negligible in large clusters
Because minimizing coordination always leads to data inconsistency

What does sharding refer to in distributed systems?

1 of 7

Key Concepts

Distributed Systems Concepts

Distributed computing

Consensus algorithm

Fault tolerance

Sharding

Load balancing

Scalability

Concurrency (distributed systems)

Distributed Processing Frameworks

Hadoop

Apache Spark

Synchronization Mechanisms

Synchronization (computer science)

Definitions

Distributed computing

A field of computer science where multiple independent computers (nodes) collaborate over a network to solve problems more efficiently than a single machine could.

Consensus algorithm

A protocol that enables distributed nodes to agree on a single data value or system state despite failures or message delays.

Fault tolerance

The ability of a distributed system to continue operating correctly in the presence of node or component failures.

Sharding

The practice of partitioning a large dataset into smaller, distinct pieces that are stored across multiple nodes to improve performance and scalability.

Load balancing

The process of distributing work evenly across multiple nodes to prevent any single node from becoming a performance bottleneck.

Hadoop

An open-source framework that provides distributed storage and processing of large data sets across clusters of commodity hardware.

Apache Spark

A fast, in‑memory data‑processing engine that supports distributed computation for big data analytics and machine learning.

Synchronization (computer science)

Mechanisms that coordinate the timing of operations among distributed nodes to avoid conflicts and ensure consistency.

Scalability

The capacity of a system to handle increased workload by adding resources such as more nodes without redesigning the architecture.

Concurrency (distributed systems)

The execution of multiple operations simultaneously across different nodes, requiring coordination to manage shared resources.