Deep learning - Core Architectures and Training Techniques
Understand core deep learning architectures, essential training and optimization techniques, and the main challenges with regularization solutions.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz
Quick Practice
How are neurons connected in a fully connected network?
1 of 16
Summary
Deep Learning Architectures and Training Methods
Introduction
Deep learning involves training artificial neural networks with multiple layers to solve complex tasks. This guide covers the main architectures used in practice, the methods for training them effectively, and the key challenges practitioners face. Understanding these concepts is essential for working with modern machine learning systems.
Deep Learning Architectures
Fully Connected Networks
A fully connected network (also called a dense network) is the simplest neural network architecture. Every neuron in one layer connects to every neuron in the next layer. Each connection has a weight that gets learned during training.
Fully connected networks work well for problems where the input data doesn't have inherent spatial structure—for example, predicting house prices from numerical features. However, they become impractical for images because they treat each pixel independently, missing the spatial relationships that make images meaningful.
Convolutional Neural Networks
Convolutional Neural Networks (CNNs) are specifically designed to process images and other data with spatial structure. The key insight is that images have local patterns (like edges or textures) that matter more than individual pixels.
CNNs use convolutional layers, which apply small filters across the entire image. Each filter learns to detect a particular pattern. By using the same filter across different positions, CNNs are much more efficient than fully connected networks and can capture spatial hierarchies. Early layers detect simple patterns like edges, while deeper layers combine these to recognize complex shapes and objects.
CNNs also include downsampling layers (like pooling) that reduce the spatial dimensions, making the network faster and more robust to small shifts in the input.
Recurrent Neural Networks
Recurrent Neural Networks (RNNs) handle sequential data like text, speech, or time series. Unlike feedforward architectures where information only flows forward, RNNs have cycles that allow information to flow backward and persist across time steps.
At each time step, an RNN takes a new input and combines it with information from previous time steps (stored in a hidden state). This allows the network to process sequences of variable length and maintain memory of past context. This is crucial for tasks like language modeling or machine translation, where the meaning depends on word order and context.
Transformers
Transformers are a more modern architecture that processes entire sequences in parallel using self-attention mechanisms. Instead of processing one element at a time like RNNs, transformers compute relationships between all pairs of elements simultaneously.
Self-attention allows the network to weigh the importance of different parts of the input when processing each element. For example, when processing the word "bank" in a sentence, self-attention helps determine whether it refers to a financial institution or a river bank by looking at surrounding context.
Transformers have become the dominant architecture for natural language processing tasks and power models like GPT and BERT. They're also increasingly used in vision and other domains.
<extrainfo>
Generative Adversarial Networks
Generative Adversarial Networks (GANs) consist of two competing neural networks: a generator and a discriminator. The generator tries to create realistic synthetic data (like images) from random noise, while the discriminator learns to distinguish between real and generated data. These two networks compete in a zero-sum game where improving one makes the other's job harder. This adversarial training process can produce remarkably realistic synthetic data.
</extrainfo>
Training Methods and Optimization
Stochastic Gradient Descent
Training a neural network means adjusting millions of weights to minimize the error on training data. Stochastic Gradient Descent (SGD) is the fundamental algorithm for this.
The basic idea is simple: compute how much the network's error changes with respect to each weight (the gradient), then adjust weights in the opposite direction of the gradient. The term "stochastic" means we don't use the entire dataset to compute gradients—instead, we use a small random sample each time. This introduces noise, but it actually helps training by allowing the algorithm to escape local minima.
The update rule is: $w \leftarrow w - \alpha \nabla L(w)$, where $\alpha$ is the learning rate (step size) and $\nabla L(w)$ is the gradient of the loss function.
Mini-Batching
In practice, we never update weights using a single example. Instead, mini-batching computes gradients using a small batch of examples (typically 32-256 examples), then updates weights once based on the average gradient.
Mini-batching has two major advantages: First, it's much more computationally efficient because modern hardware can process multiple examples in parallel. Second, averaging gradients over multiple examples smooths out noisy updates, making training more stable and reliable.
Learning Rate and Weight Initialization
The learning rate $\alpha$ controls how large a step we take when updating weights. This is a critical hyperparameter:
If $\alpha$ is too small, training progresses very slowly
If $\alpha$ is too large, the algorithm may overshoot and diverge
Effective learning rates typically range from 0.001 to 0.1, and many practitioners start with 0.01.
Weight initialization (setting initial weights before training) is equally important. If all weights start at zero or the same value, all neurons behave identically, which is useless. Instead, weights are randomly initialized with small values (often from a normal distribution). This breaks symmetry and allows different neurons to learn different features. Poor initialization can cause training to fail entirely.
Regularization by Weight Decay
Weight decay is a simple regularization technique that prevents weights from growing too large. The idea is to add a penalty term proportional to the squared magnitude of weights to the loss function:
$$L{\text{total}} = L{\text{original}} + \lambda \sumi wi^2$$
where $\lambda$ controls the strength of the penalty. Large weights often indicate overfitting (fitting to noise), so penalizing them encourages the model to find simpler, more generalizable solutions.
Dropout Regularization
Dropout randomly and temporarily disables a fraction of hidden units (typically 50%) during each training iteration. This prevents co-adaptation, where neurons become overly specialized and dependent on each other to work.
By forcing the network to learn redundant representations through dropout, the model becomes more robust. At test time, all units are active but their outputs are scaled down to account for the training-time dropout. Dropout is particularly effective for preventing overfitting in large networks.
Data Augmentation
Data augmentation artificially expands the training set by applying realistic transformations to existing data. For images, this might include:
Cropping sections of images
Rotating images by small angles
Flipping images horizontally
Adjusting brightness or contrast
The key principle is that transformations should preserve the original label. Data augmentation effectively gives the network more diverse training examples, improving generalization. It's especially valuable when collecting more data is expensive.
Challenges in Deep Learning
Overfitting Risk
Overfitting occurs when a network memorizes training data instead of learning general patterns. Deep networks are particularly prone to overfitting because each additional layer adds flexibility—and extra capacity can model rare, irrelevant patterns in the training data rather than true underlying relationships.
Signs of overfitting include: high accuracy on training data but poor performance on test data, or continued improvement on training data while validation performance plateaus. All the regularization techniques above (weight decay, dropout, data augmentation) exist primarily to combat overfitting.
The Vanishing Gradient Problem
When training very deep networks, gradients propagate backward through many layers via the chain rule. In this process, gradients can become exponentially smaller, approaching zero. When gradients vanish, weights in early layers barely update, making training extremely slow or ineffective.
This is particularly severe in RNNs, which apply the same operations repeatedly across time steps. The vanishing gradient problem was a major obstacle that limited deep learning for many years.
Solutions to Vanishing Gradients
Two major solutions emerged:
Long Short-Term Memory (LSTM) networks address the vanishing gradient problem in RNNs through gated recurrent connections. LSTMs use special gates (input, forget, and output gates) that control information flow, allowing gradients to flow directly across many time steps without vanishing. This lets them learn long-range dependencies in sequences.
Residual connections (used in ResNets) add identity shortcuts that skip one or more layers. Instead of computing $y = f(x)$, a residual block computes $y = f(x) + x$. The identity term $x$ provides a direct path for gradients to flow backward, preventing them from vanishing. This allows networks to be trained successfully with hundreds of layers, far deeper than previous architectures.
Both solutions became foundational techniques that enabled modern deep learning.
<extrainfo>
Computational Cost
Training state-of-the-art deep networks requires enormous computational resources. Large models can take weeks to train on specialized hardware like GPUs or TPUs, and the energy consumption is substantial. This high cost motivates research into more efficient architectures, better algorithms, and specialized hardware. However, the computational expense also creates barriers to entry and raises environmental concerns.
</extrainfo>
Flashcards
How are neurons connected in a fully connected network?
Every neuron in one layer connects to every neuron in the next layer.
What structural feature of recurrent neural networks allows them to process sequences?
Cycles in their connectivity.
Which types of layers do convolutional neural networks use to extract spatial hierarchies from images?
Convolutional layers
Down-sampling layers
In a generative adversarial network, what is the role of the generator?
It models a probability distribution.
What type of game do the generator and discriminator compete in within a GAN?
A zero-sum game.
Which mechanism do transformers use to process sequences?
Self-attention mechanisms.
How does stochastic gradient descent update network weights?
By computing gradients on small random batches of training examples.
What are the two primary benefits of using mini-batching during training?
Speeds up computation
Smooths updates
What does the learning rate control during the training of a neural network?
The step size of weight updates.
What is the primary purpose of using random weight initialization?
To break symmetry among neurons.
How does weight decay discourage overly large parameters?
It adds a penalty proportional to the squared magnitude of the weights.
How does dropout prevent the co-adaptation of features during training?
By randomly omitting hidden units during each training iteration.
What happens to gradients in very deep networks as they propagate backward?
They can diminish (vanish), making training difficult.
How do Long Short-Term Memory (LSTM) networks preserve gradients over long sequences?
By using gated recurrent connections.
How do residual connections in ResNet solve the vanishing gradient problem?
They allow gradients to flow directly through identity shortcuts.
What two resources are required in large amounts to train deep networks, motivating the need for specialized hardware?
Compute time
Memory
Quiz
Deep learning - Core Architectures and Training Techniques Quiz Question 1: What is the defining characteristic of a fully connected (dense) neural network layer?
- Every neuron in one layer connects to every neuron in the next layer. (correct)
- Neurons are arranged in a grid and share weights across spatial locations.
- Only a subset of neurons are connected based on a random mask.
- Connections are recurrent, allowing cycles in the network.
Deep learning - Core Architectures and Training Techniques Quiz Question 2: Which mechanism allows Transformers to process whole sequences in parallel without using recurrence?
- Self‑attention mechanisms (correct)
- Convolutional filters
- Recurrent connections
- Pooling layers
What is the defining characteristic of a fully connected (dense) neural network layer?
1 of 2
Key Concepts
Neural Network Architectures
Fully Connected Network
Recurrent Neural Network
Convolutional Neural Network
Generative Adversarial Network
Transformer
Long Short‑Term Memory
Residual Network (ResNet)
Optimization and Regularization Techniques
Stochastic Gradient Descent
Dropout
Weight Decay
Data Augmentation
Vanishing Gradient Problem
Definitions
Fully Connected Network
A neural network architecture where each neuron in one layer is connected to every neuron in the subsequent layer.
Recurrent Neural Network
A network with cyclic connections that enables processing of sequential data by maintaining internal state.
Convolutional Neural Network
An architecture that uses convolutional and pooling layers to learn spatial hierarchies in image data.
Generative Adversarial Network
A framework consisting of a generator and a discriminator that compete in a zero‑sum game to model data distributions.
Transformer
A sequence‑processing model that relies on self‑attention mechanisms and has become dominant in natural language processing.
Stochastic Gradient Descent
An optimization method that updates model parameters using gradients computed on randomly selected mini‑batches.
Dropout
A regularization technique that randomly deactivates hidden units during training to prevent co‑adaptation of features.
Weight Decay
A regularization method that adds a penalty proportional to the squared magnitude of weights, discouraging large parameters.
Data Augmentation
The practice of expanding a training dataset by applying transformations such as cropping, rotation, or flipping.
Vanishing Gradient Problem
The difficulty in training very deep networks because gradients shrink exponentially as they propagate backward.
Long Short‑Term Memory
A gated recurrent network architecture designed to preserve gradients over long sequences and mitigate vanishing gradients.
Residual Network (ResNet)
A deep convolutional architecture that uses identity shortcut connections to allow gradients to flow directly through layers.