Subjects/Technology/Data and AI/Data Science/Big data

Big data - Architecture and Processing Technologies

Understand big‑data architecture patterns, core processing frameworks like MapReduce and Spark, and tensor‑based analytic techniques.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

What shift in focus characterizes the move to data lakes?

1 of 9

Summary

Architecture and Technologies for Big Data Introduction to Big Data Architecture Big data systems are fundamentally different from traditional data processing because they must handle unprecedented volumes of information across distributed networks. Rather than storing all data in a single centralized database, modern big data architectures spread computation and storage across many machines. This approach enables dramatically faster processing speeds and more flexible data management, though it introduces significant complexity in coordinating work across these distributed systems. The image below illustrates how these distributed systems work together: Distributed Parallel Processing Architectures The core principle of big data architecture is parallelization: breaking large computational problems into smaller pieces that can be solved simultaneously on different machines. When you have 1000 servers working in parallel rather than one machine working alone, you can process vastly larger datasets in reasonable time frames. The Shift from Centralized Control to Data Lakes Traditionally, organizations used data warehouses with strict centralized control: data had to be cleaned, validated, and formatted before entering the warehouse. This approach was rigid and slow. Data lakes invert this model. Instead of controlling data before storage, data lakes store raw, heterogeneous data from many sources, then segregate and analyze it on-demand. This enables rapid experimentation—analysts can quickly pull subsets of data for testing without waiting for centralized approval. The trade-off is that data lakes require more sophistication during the analysis phase, since the data hasn't been pre-cleaned. MapReduce: The Foundational Processing Framework MapReduce is the dominant algorithm for distributed data processing. It works in exactly two phases: The Map Phase: The input data is split across many parallel nodes. Each node independently applies a function to its portion of the data, producing intermediate key-value pairs. For example, if you're counting word frequencies in documents, the map phase might be: "Node 1 processes documents A-C and counts words, Node 2 processes documents D-F and counts words, etc." The Reduce Phase: All intermediate results with the same key are collected together and combined by a reducer function. In the word frequency example, all occurrences of "the" from different nodes would be summed together, all occurrences of "and" would be summed together, and so on. The elegance of MapReduce is its fault tolerance. If one node fails during processing, you only need to recompute that node's map phase—the overall job continues. This is crucial when running computations across hundreds or thousands of commodity hardware machines, where failures are inevitable. Hadoop: The Open-Source MapReduce Implementation Hadoop is the open-source framework that implements the MapReduce paradigm. It's not a different algorithm—it's a practical system that lets you run MapReduce jobs on clusters of inexpensive machines. Hadoop became transformative because it democratized distributed data processing; organizations no longer needed specialized expensive hardware. Hadoop remains important for understanding big data systems, though it's being supplemented by more modern frameworks. Apache Spark: Beyond Simple Map-Reduce While MapReduce is powerful, it has limitations. Every map phase reads from disk, and every reduce phase writes results back to disk. For jobs requiring multiple sequential operations (which is common in data analysis), this creates bottlenecks. Apache Spark solves this by keeping intermediate data in memory rather than writing to disk between operations. This can make certain workloads 10-100x faster. Additionally, Spark supports more complex operation pipelines beyond simple map-reduce patterns. You can express iterative algorithms, complex joins, and machine learning workflows more naturally in Spark than in pure MapReduce. Importantly, Spark still runs on the same distributed cluster philosophy as Hadoop—it just handles the data flow more efficiently. Storage Technologies Behind any processing framework, you need fast access to your data. Distributed file systems (like the Hadoop Distributed File System) spread data across many machines while presenting it as a single logical filesystem to applications. Files are typically replicated across multiple nodes for fault tolerance. Burst buffers are a related technology: they use fast storage (like solid-state drives or RAM) as a temporary cache for high-throughput access during intense computation phases, then write results to slower long-term storage afterward. These storage technologies are essential infrastructure, though they're usually transparent to you as an analyst—the frameworks handle the coordination. Representing Multidimensional Data: OLAP Cubes and Tensors Many datasets naturally have multiple dimensions. For example, sales data might have dimensions: product, region, time period, and customer type. OLAP (Online Analytical Processing) cubes represent this multidimensional data as hypercubes. Each cell in the cube contains an aggregated value (like total sales), and you can slice, dice, and drill down along any dimension. An OLAP cube lets you answer questions like "What were sales of Product X in Region Y during Q3?" with a single lookup rather than scanning a large table. Tensors are the mathematical equivalent—multidimensional arrays where each element is indexed by multiple coordinates. A 3D tensor might be indexed as $T[i,j,k]$, a 4D tensor as $T[i,j,k,l]$, and so on. Array database systems are specialized databases optimized for storing and querying tensor data. They're important when you have truly multidimensional data (scientific simulations, medical imaging, financial time series across many markets) rather than traditional tabular data. Understanding OLAP cubes and tensor representation is important because they're fundamentally different from relational tables and require different query approaches. Query Languages for Data Integration Analyzing big data often means combining data from many different sources—some structured (databases), some semi-structured (JSON, XML), some unstructured (logs, text). Data-mining and integration query languages extend SQL with operators for handling heterogeneous data. Rather than forcing all data into relational form first, these languages can express complex joins across sources with different structures and perform pattern recognition and data mining operations directly in the query. This is crucial in practice because data preparation—joining, cleaning, and combining data from multiple sources—often consumes 80% of the time in data analysis projects. <extrainfo> Analytic Techniques and Visualization Once data is processed and integrated, analysts use various techniques to extract insights: A/B testing: Running controlled experiments comparing two variants to measure impact on user behavior Machine learning: Training models on historical data to make predictions or classify new data Natural language processing: Extracting meaning from text data Data mining: Discovering patterns and relationships in large datasets Visualization tools transform raw data into charts, graphs, and interactive dashboards that humans can understand. Rather than looking at millions of numbers, a well-designed visualization can reveal patterns instantly. Modern tools often support interactive exploration—clicking on a chart to drill down into details, filtering data dynamically, and combining multiple linked views. These techniques are important in practice, though the specific algorithms may not be heavily tested in an exam focusing on architecture and technologies. </extrainfo>

Flashcards

What shift in focus characterizes the move to data lakes?

A shift from centralized control to shared models.

What happens during the "reduce" step of a MapReduce query?

Results are aggregated.

On what type of hardware does MapReduce typically provide fault-tolerant processing?

Clusters of commodity hardware.

What is Hadoop's relationship to the MapReduce paradigm?

It is the open-source implementation of it.

What are the two main features Apache Spark adds beyond simple map-reduce?

In-memory processing Support for complex operation pipelines

What is the primary function of big data visualization tools?

Transforming raw data into charts, graphs, and interactive dashboards.

In what two ways can multidimensional data be represented?

OLAP cubes Tensors (mathematical representation)

What type of database systems support the mathematical representation of multidimensional data as tensors?

Array database systems.

How do data-mining and integration query languages extend SQL to support heterogeneous sources?

By adding operators for semi-structured data.

Quiz

In the MapReduce paradigm, what is the main purpose of the “map” step?

1 of 8

Key Concepts

Big Data Technologies

MapReduce

Hadoop

Apache Spark

Distributed File System

Data Lake

Burst Buffer

Data Analysis Techniques

Machine Learning

Natural Language Processing

A/B Testing

Data Mining

OLAP Cube

Tensor (computing)

Definitions

MapReduce

A programming model that splits tasks across parallel nodes (map) and then aggregates the results (reduce) for fault‑tolerant large‑scale processing.

Hadoop

An open‑source framework that implements the MapReduce model and provides a distributed file system for big‑data storage and computation.

Apache Spark

A fast, in‑memory data‑processing engine that extends MapReduce with advanced APIs for streaming, machine learning, and graph analytics.

Data Lake

A centralized repository that stores raw, unstructured, and structured data at any scale, enabling flexible analysis and model building.

Distributed File System

A storage architecture that spreads data across multiple servers to provide high‑throughput, fault‑tolerant access to large data sets.

Burst Buffer

A high‑speed intermediate storage layer that accelerates I/O for data‑intensive applications by buffering data between compute nodes and permanent storage.

Machine Learning

A field of artificial intelligence that develops algorithms enabling computers to learn patterns from data and make predictions or decisions.

Natural Language Processing

A discipline that combines linguistics and computer science to enable machines to understand, interpret, and generate human language.

A/B Testing

An experimental method that compares two variants of a product or feature to determine which performs better based on statistical analysis.

Tensor (computing)

A multi‑dimensional array data structure used in high‑performance computing and deep learning to represent complex, multi‑modal data.

OLAP Cube

A multidimensional data model that allows fast analytical queries by pre‑aggregating data across multiple dimensions for business intelligence.

Data Mining

The process of discovering hidden patterns, correlations, and anomalies in large data sets using statistical and machine learning techniques.