Big data - Architecture and Processing Technologies
Understand big‑data architecture patterns, core processing frameworks like MapReduce and Spark, and tensor‑based analytic techniques.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz
Quick Practice
What shift in focus characterizes the move to data lakes?
1 of 9
Summary
Architecture and Technologies for Big Data
Introduction to Big Data Architecture
Big data systems are fundamentally different from traditional data processing because they must handle unprecedented volumes of information across distributed networks. Rather than storing all data in a single centralized database, modern big data architectures spread computation and storage across many machines. This approach enables dramatically faster processing speeds and more flexible data management, though it introduces significant complexity in coordinating work across these distributed systems.
The image below illustrates how these distributed systems work together:
Distributed Parallel Processing Architectures
The core principle of big data architecture is parallelization: breaking large computational problems into smaller pieces that can be solved simultaneously on different machines. When you have 1000 servers working in parallel rather than one machine working alone, you can process vastly larger datasets in reasonable time frames.
The Shift from Centralized Control to Data Lakes
Traditionally, organizations used data warehouses with strict centralized control: data had to be cleaned, validated, and formatted before entering the warehouse. This approach was rigid and slow.
Data lakes invert this model. Instead of controlling data before storage, data lakes store raw, heterogeneous data from many sources, then segregate and analyze it on-demand. This enables rapid experimentation—analysts can quickly pull subsets of data for testing without waiting for centralized approval. The trade-off is that data lakes require more sophistication during the analysis phase, since the data hasn't been pre-cleaned.
MapReduce: The Foundational Processing Framework
MapReduce is the dominant algorithm for distributed data processing. It works in exactly two phases:
The Map Phase: The input data is split across many parallel nodes. Each node independently applies a function to its portion of the data, producing intermediate key-value pairs. For example, if you're counting word frequencies in documents, the map phase might be: "Node 1 processes documents A-C and counts words, Node 2 processes documents D-F and counts words, etc."
The Reduce Phase: All intermediate results with the same key are collected together and combined by a reducer function. In the word frequency example, all occurrences of "the" from different nodes would be summed together, all occurrences of "and" would be summed together, and so on.
The elegance of MapReduce is its fault tolerance. If one node fails during processing, you only need to recompute that node's map phase—the overall job continues. This is crucial when running computations across hundreds or thousands of commodity hardware machines, where failures are inevitable.
Hadoop: The Open-Source MapReduce Implementation
Hadoop is the open-source framework that implements the MapReduce paradigm. It's not a different algorithm—it's a practical system that lets you run MapReduce jobs on clusters of inexpensive machines. Hadoop became transformative because it democratized distributed data processing; organizations no longer needed specialized expensive hardware.
Hadoop remains important for understanding big data systems, though it's being supplemented by more modern frameworks.
Apache Spark: Beyond Simple Map-Reduce
While MapReduce is powerful, it has limitations. Every map phase reads from disk, and every reduce phase writes results back to disk. For jobs requiring multiple sequential operations (which is common in data analysis), this creates bottlenecks.
Apache Spark solves this by keeping intermediate data in memory rather than writing to disk between operations. This can make certain workloads 10-100x faster. Additionally, Spark supports more complex operation pipelines beyond simple map-reduce patterns. You can express iterative algorithms, complex joins, and machine learning workflows more naturally in Spark than in pure MapReduce.
Importantly, Spark still runs on the same distributed cluster philosophy as Hadoop—it just handles the data flow more efficiently.
Storage Technologies
Behind any processing framework, you need fast access to your data.
Distributed file systems (like the Hadoop Distributed File System) spread data across many machines while presenting it as a single logical filesystem to applications. Files are typically replicated across multiple nodes for fault tolerance.
Burst buffers are a related technology: they use fast storage (like solid-state drives or RAM) as a temporary cache for high-throughput access during intense computation phases, then write results to slower long-term storage afterward.
These storage technologies are essential infrastructure, though they're usually transparent to you as an analyst—the frameworks handle the coordination.
Representing Multidimensional Data: OLAP Cubes and Tensors
Many datasets naturally have multiple dimensions. For example, sales data might have dimensions: product, region, time period, and customer type.
OLAP (Online Analytical Processing) cubes represent this multidimensional data as hypercubes. Each cell in the cube contains an aggregated value (like total sales), and you can slice, dice, and drill down along any dimension. An OLAP cube lets you answer questions like "What were sales of Product X in Region Y during Q3?" with a single lookup rather than scanning a large table.
Tensors are the mathematical equivalent—multidimensional arrays where each element is indexed by multiple coordinates. A 3D tensor might be indexed as $T[i,j,k]$, a 4D tensor as $T[i,j,k,l]$, and so on.
Array database systems are specialized databases optimized for storing and querying tensor data. They're important when you have truly multidimensional data (scientific simulations, medical imaging, financial time series across many markets) rather than traditional tabular data.
Understanding OLAP cubes and tensor representation is important because they're fundamentally different from relational tables and require different query approaches.
Query Languages for Data Integration
Analyzing big data often means combining data from many different sources—some structured (databases), some semi-structured (JSON, XML), some unstructured (logs, text).
Data-mining and integration query languages extend SQL with operators for handling heterogeneous data. Rather than forcing all data into relational form first, these languages can express complex joins across sources with different structures and perform pattern recognition and data mining operations directly in the query.
This is crucial in practice because data preparation—joining, cleaning, and combining data from multiple sources—often consumes 80% of the time in data analysis projects.
<extrainfo>
Analytic Techniques and Visualization
Once data is processed and integrated, analysts use various techniques to extract insights:
A/B testing: Running controlled experiments comparing two variants to measure impact on user behavior
Machine learning: Training models on historical data to make predictions or classify new data
Natural language processing: Extracting meaning from text data
Data mining: Discovering patterns and relationships in large datasets
Visualization tools transform raw data into charts, graphs, and interactive dashboards that humans can understand. Rather than looking at millions of numbers, a well-designed visualization can reveal patterns instantly. Modern tools often support interactive exploration—clicking on a chart to drill down into details, filtering data dynamically, and combining multiple linked views.
These techniques are important in practice, though the specific algorithms may not be heavily tested in an exam focusing on architecture and technologies.
</extrainfo>
Flashcards
What shift in focus characterizes the move to data lakes?
A shift from centralized control to shared models.
What happens during the "reduce" step of a MapReduce query?
Results are aggregated.
On what type of hardware does MapReduce typically provide fault-tolerant processing?
Clusters of commodity hardware.
What is Hadoop's relationship to the MapReduce paradigm?
It is the open-source implementation of it.
What are the two main features Apache Spark adds beyond simple map-reduce?
In-memory processing
Support for complex operation pipelines
What is the primary function of big data visualization tools?
Transforming raw data into charts, graphs, and interactive dashboards.
In what two ways can multidimensional data be represented?
OLAP cubes
Tensors (mathematical representation)
What type of database systems support the mathematical representation of multidimensional data as tensors?
Array database systems.
How do data-mining and integration query languages extend SQL to support heterogeneous sources?
By adding operators for semi-structured data.
Quiz
Big data - Architecture and Processing Technologies Quiz Question 1: In the MapReduce paradigm, what is the main purpose of the “map” step?
- It splits queries across parallel nodes. (correct)
- It aggregates the intermediate results from all nodes.
- It stores the final output in persistent storage.
- It visualizes the computation results.
Big data - Architecture and Processing Technologies Quiz Question 2: Which analytic technique is specifically used to compare two variants in big‑data experiments?
- A/B testing (correct)
- Machine learning
- Natural language processing
- Data mining
Big data - Architecture and Processing Technologies Quiz Question 3: Which representations are used for multidimensional data in big‑data environments?
- OLAP cubes and tensors (correct)
- Relational tables only
- Key‑value stores exclusively
- Document‑oriented databases
Big data - Architecture and Processing Technologies Quiz Question 4: How do memory‑resident implementations of MapReduce improve performance?
- By keeping intermediate data in RAM. (correct)
- By storing all intermediate data on SSDs.
- By reducing the number of map tasks.
- By compressing data before each reduce step.
Big data - Architecture and Processing Technologies Quiz Question 5: Which combination of technologies provides high‑throughput access to large data sets?
- Distributed file systems with burst buffers (correct)
- Local SSD drives
- Tape archival systems
- Cloud object storage without caching
Big data - Architecture and Processing Technologies Quiz Question 6: What kind of model represents multi‑dimensional arrays and enables multilinear subspace learning?
- Tensor‑based model (correct)
- Vector‑based model
- Decision‑tree model
- Linear‑regression model
Big data - Architecture and Processing Technologies Quiz Question 7: How do distributed parallel architectures achieve faster processing of big‑data workloads?
- By spreading data across many servers to enable concurrent computation (correct)
- By compressing data to reduce storage requirements
- By relying on a single high‑performance server for all tasks
- By employing quantum‑computing techniques for data analysis
Big data - Architecture and Processing Technologies Quiz Question 8: Data‑mining and integration query languages extend SQL with new operators. Which data format can now be directly queried using these operators?
- Semi‑structured data such as JSON or XML (correct)
- Only fully normalized relational tables
- Plain text logs without any structure
- Binary image or video files without metadata
In the MapReduce paradigm, what is the main purpose of the “map” step?
1 of 8
Key Concepts
Big Data Technologies
MapReduce
Hadoop
Apache Spark
Distributed File System
Data Lake
Burst Buffer
Data Analysis Techniques
Machine Learning
Natural Language Processing
A/B Testing
Data Mining
OLAP Cube
Tensor (computing)
Definitions
MapReduce
A programming model that splits tasks across parallel nodes (map) and then aggregates the results (reduce) for fault‑tolerant large‑scale processing.
Hadoop
An open‑source framework that implements the MapReduce model and provides a distributed file system for big‑data storage and computation.
Apache Spark
A fast, in‑memory data‑processing engine that extends MapReduce with advanced APIs for streaming, machine learning, and graph analytics.
Data Lake
A centralized repository that stores raw, unstructured, and structured data at any scale, enabling flexible analysis and model building.
Distributed File System
A storage architecture that spreads data across multiple servers to provide high‑throughput, fault‑tolerant access to large data sets.
Burst Buffer
A high‑speed intermediate storage layer that accelerates I/O for data‑intensive applications by buffering data between compute nodes and permanent storage.
Machine Learning
A field of artificial intelligence that develops algorithms enabling computers to learn patterns from data and make predictions or decisions.
Natural Language Processing
A discipline that combines linguistics and computer science to enable machines to understand, interpret, and generate human language.
A/B Testing
An experimental method that compares two variants of a product or feature to determine which performs better based on statistical analysis.
Tensor (computing)
A multi‑dimensional array data structure used in high‑performance computing and deep learning to represent complex, multi‑modal data.
OLAP Cube
A multidimensional data model that allows fast analytical queries by pre‑aggregating data across multiple dimensions for business intelligence.
Data Mining
The process of discovering hidden patterns, correlations, and anomalies in large data sets using statistical and machine learning techniques.