RemNote Community
Community

Foundations of Distributed Computing

Understand the core concepts, key algorithms, and complexity foundations of distributed computing.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz

Quick Practice

What two elements do real-time distributed systems combine?
1 of 9

Summary

Foundations of Distributed Computing Introduction Distributed computing is the study of computer systems where multiple independent computers work together over a network to solve a common problem. Rather than having a single powerful computer execute a program, distributed systems spread computation across multiple machines that communicate by passing messages. This approach enables systems to scale to massive sizes, tolerate failures, and solve problems that would be impractical on a single machine. However, the distributed nature introduces significant challenges: computers have incomplete information, communication is slow compared to local memory access, and components can fail independently. What is a Distributed System? A distributed system is a collection of autonomous computational entities located on different networked computers that coordinate their actions by passing messages to achieve a common goal. Each entity is independent with its own local memory, and they work together without a single central controller. A distributed program is any computer program designed to run within a distributed system. Distributed programming is the art and science of writing such programs. Key Definition: The Importance of Message Passing The fundamental characteristic that defines distributed systems is message passing. Unlike a single computer where different programs can directly access shared memory, computers in a distributed system cannot directly read each other's memory. Instead, they must explicitly send and receive messages over a network. This seemingly simple distinction has profound implications for algorithm design and system behavior. Challenges of Distributed Systems Distributed systems must deal with several inherent challenges that wouldn't exist in a centralized system: Autonomous Components with Incomplete Information. Each computer in a distributed system is autonomous—it makes decisions based only on information it has locally received. No single computer has complete knowledge of the entire system state. For example, in a distributed database system, each server might not know the current values stored on other servers. Unknown System Structure. The network topology (how computers are connected), the communication latency (delay in sending messages), and even the total number of computers may be unknown in advance or may change during execution. A distributed system must be designed to work correctly even when these parameters vary. Failure Tolerance. Individual computers can fail independently. A robust distributed system must continue operating correctly even when some nodes crash or become unreachable. This is essential for long-running systems like cloud services. Limited Local View. Each computer may only know a part of the input data. For example, in a sensor network, each node might measure only its local environment. The system must aggregate information to solve the global problem. Dynamic Structure. The system structure may change during execution. Computers may join or leave the network, connections may be established or severed, and the number of active participants may vary over time. Parallel Computing vs. Distributed Computing While distributed systems and parallel systems both involve multiple processors working on a problem simultaneously, they differ fundamentally in their architecture and communication model. Shared-Memory Parallelism In parallel computing, all processors have access to a common shared memory. Each processor can directly read from and write to this shared memory. This shared memory acts as a communication channel—one processor writes data to memory, and another processor later reads that data. Distributed-Memory Parallelism In distributed computing, each processor has private memory that only it can access directly. Processors communicate exclusively through message passing—sending and receiving explicit messages over a network. The image illustrates these architectures: diagram (b) shows a distributed system where each processor has its own private memory and must communicate via messages, while diagram (c) shows a shared-memory parallel system where all processors share access to a common memory space. Diagram (a) shows how computers in a distributed system form a network topology. The Coupling Distinction Parallel systems are tightly coupled: processors operate in sync with minimal delay between communication, memory access is fast and predictable, and failure of one processor typically brings down the entire system. Distributed systems are loosely coupled: processors operate asynchronously with unpredictable message delays, failures are independent, and the system continues operating despite component failures. This distinction is not merely architectural—it fundamentally changes how algorithms must be designed. Algorithms for shared-memory systems (parallel algorithms) often cannot be directly translated to message-passing systems (distributed algorithms) because they may assume instant, synchronized memory access. Fundamental Problems in Distributed Computing Distributed computing addresses several canonical problems that appear repeatedly in practical systems: Leader Election. In some distributed systems, one node needs to be designated as a leader or coordinator. The leader election problem asks: how can the nodes collaboratively and reliably choose a leader, especially when nodes may fail or new nodes may join? Spanning Tree Construction. Nodes need to organize themselves into a tree structure (a connected graph with no cycles). This enables efficient communication patterns and is often a building block for more complex algorithms. Consensus. Multiple nodes must agree on a single value despite potential failures. For example, a distributed database must ensure all replicas agree on which transactions have been committed. Consensus is surprisingly difficult when nodes can fail or messages can be delayed arbitrarily. Complexity Measures for Distributed Algorithms Distributed algorithms are evaluated using different complexity metrics than sequential algorithms. Instead of counting CPU operations, we measure: Message Complexity. How many messages must be sent between processors? Sending messages over a network is expensive and often the dominant cost in a distributed system. Bit Complexity. How many total bits of data are transmitted across the network? This accounts for the fact that a smaller message is preferable to a larger message even if both count as a single message. Time Complexity (Rounds of Communication). How many rounds of synchronous message exchange are required? In this model, all processors send their messages simultaneously in each round, then all receive messages from other processors. These three measures often present trade-offs. An algorithm might reduce the number of messages by having each message carry more data, increasing bit complexity. Another algorithm might add extra rounds of communication to reduce the total bits transmitted. Understanding these trade-offs is crucial for designing efficient distributed algorithms. <extrainfo> Self-Stabilization Self-stabilizing algorithms are designed to be robust in the face of transient faults or arbitrary corrupted states. Rather than assuming the system starts in a known correct state, a self-stabilizing algorithm guarantees that no matter what arbitrary initial configuration the system finds itself in, it will eventually converge to a correct state and remain correct thereafter. This is a powerful property for real-world systems where recovering from crashes or corrupted memory may be difficult. </extrainfo> <extrainfo> Computational Complexity: Models and Classes Understanding distributed computing requires grounding in computational complexity theory, which studies the inherent difficulty of problems. Models of Computation. Different formal models (such as Turing machines and Boolean circuits) define what problems are solvable and what resources are required. These models reveal the fundamental limits of computation. Complexity Classes. Problems are grouped into complexity classes based on their resource requirements. The class P contains problems solvable in polynomial time; NP contains problems whose solutions can be verified in polynomial time; PSPACE contains problems solvable with polynomial memory. Understanding these classes helps identify which problems are tractable and which may be inherently difficult. The existence of problems outside the class P (such as NP-complete problems) means that no algorithm can solve certain problems efficiently for all inputs. This is why approximate solutions and randomized algorithms become valuable. Approximation Algorithms. When exact computation is infeasible (perhaps because the problem is NP-hard), approximation algorithms aim to compute a solution that is "close enough" to optimal. For example, an approximation algorithm might guarantee finding a solution within 10% of the optimal value. Randomized Algorithms. Rather than always following the same deterministic sequence of steps, randomized algorithms make probabilistic choices during execution. This can sometimes allow algorithms to achieve better expected performance than any deterministic algorithm, or to solve problems where deterministic algorithms are stuck. For instance, randomized algorithms are powerful tools for leader election and consensus in distributed systems. </extrainfo>
Flashcards
What two elements do real-time distributed systems combine?
Time-critical constraints and distributed execution.
What do real-time distributed systems require for coordinated actions across nodes?
Predictable latency.
Where are the components located in a distributed computing system?
On different networked computers.
How do components in a distributed system communicate and coordinate actions?
By passing messages.
How do processors interact with memory in shared-memory parallel computing?
All processors share a common memory they can read and write directly.
In terms of coupling, how are parallel systems classified compared to distributed systems?
Parallel systems are tightly coupled, while distributed systems are loosely coupled.
What is the primary communication method used by processors in distributed algorithms?
Message passing.
Which three metrics are used to measure the complexity of distributed algorithms?
Number of messages Bits transmitted Rounds of communication
How is a self-stabilizing algorithm defined?
An algorithm that converges to a correct state from any arbitrary initial configuration.

Quiz

What do the complexity classes P, NP, and PSPACE categorize?
1 of 3
Key Concepts
Distributed Systems
Distributed computing
Real‑time distributed systems
Distributed algorithms
Self‑stabilization
Algorithm Types
Complexity classes
Approximation algorithms
Randomized algorithms
Parallel Computing Models
Parallel computing
Shared‑memory parallelism
Distributed‑memory parallelism