RemNote Community
Community

Deep learning - Core Architectures and Training Techniques

Understand core deep learning architectures, essential training and optimization techniques, and the main challenges with regularization solutions.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz

Quick Practice

How are neurons connected in a fully connected network?
1 of 16

Summary

Deep Learning Architectures and Training Methods Introduction Deep learning involves training artificial neural networks with multiple layers to solve complex tasks. This guide covers the main architectures used in practice, the methods for training them effectively, and the key challenges practitioners face. Understanding these concepts is essential for working with modern machine learning systems. Deep Learning Architectures Fully Connected Networks A fully connected network (also called a dense network) is the simplest neural network architecture. Every neuron in one layer connects to every neuron in the next layer. Each connection has a weight that gets learned during training. Fully connected networks work well for problems where the input data doesn't have inherent spatial structure—for example, predicting house prices from numerical features. However, they become impractical for images because they treat each pixel independently, missing the spatial relationships that make images meaningful. Convolutional Neural Networks Convolutional Neural Networks (CNNs) are specifically designed to process images and other data with spatial structure. The key insight is that images have local patterns (like edges or textures) that matter more than individual pixels. CNNs use convolutional layers, which apply small filters across the entire image. Each filter learns to detect a particular pattern. By using the same filter across different positions, CNNs are much more efficient than fully connected networks and can capture spatial hierarchies. Early layers detect simple patterns like edges, while deeper layers combine these to recognize complex shapes and objects. CNNs also include downsampling layers (like pooling) that reduce the spatial dimensions, making the network faster and more robust to small shifts in the input. Recurrent Neural Networks Recurrent Neural Networks (RNNs) handle sequential data like text, speech, or time series. Unlike feedforward architectures where information only flows forward, RNNs have cycles that allow information to flow backward and persist across time steps. At each time step, an RNN takes a new input and combines it with information from previous time steps (stored in a hidden state). This allows the network to process sequences of variable length and maintain memory of past context. This is crucial for tasks like language modeling or machine translation, where the meaning depends on word order and context. Transformers Transformers are a more modern architecture that processes entire sequences in parallel using self-attention mechanisms. Instead of processing one element at a time like RNNs, transformers compute relationships between all pairs of elements simultaneously. Self-attention allows the network to weigh the importance of different parts of the input when processing each element. For example, when processing the word "bank" in a sentence, self-attention helps determine whether it refers to a financial institution or a river bank by looking at surrounding context. Transformers have become the dominant architecture for natural language processing tasks and power models like GPT and BERT. They're also increasingly used in vision and other domains. <extrainfo> Generative Adversarial Networks Generative Adversarial Networks (GANs) consist of two competing neural networks: a generator and a discriminator. The generator tries to create realistic synthetic data (like images) from random noise, while the discriminator learns to distinguish between real and generated data. These two networks compete in a zero-sum game where improving one makes the other's job harder. This adversarial training process can produce remarkably realistic synthetic data. </extrainfo> Training Methods and Optimization Stochastic Gradient Descent Training a neural network means adjusting millions of weights to minimize the error on training data. Stochastic Gradient Descent (SGD) is the fundamental algorithm for this. The basic idea is simple: compute how much the network's error changes with respect to each weight (the gradient), then adjust weights in the opposite direction of the gradient. The term "stochastic" means we don't use the entire dataset to compute gradients—instead, we use a small random sample each time. This introduces noise, but it actually helps training by allowing the algorithm to escape local minima. The update rule is: $w \leftarrow w - \alpha \nabla L(w)$, where $\alpha$ is the learning rate (step size) and $\nabla L(w)$ is the gradient of the loss function. Mini-Batching In practice, we never update weights using a single example. Instead, mini-batching computes gradients using a small batch of examples (typically 32-256 examples), then updates weights once based on the average gradient. Mini-batching has two major advantages: First, it's much more computationally efficient because modern hardware can process multiple examples in parallel. Second, averaging gradients over multiple examples smooths out noisy updates, making training more stable and reliable. Learning Rate and Weight Initialization The learning rate $\alpha$ controls how large a step we take when updating weights. This is a critical hyperparameter: If $\alpha$ is too small, training progresses very slowly If $\alpha$ is too large, the algorithm may overshoot and diverge Effective learning rates typically range from 0.001 to 0.1, and many practitioners start with 0.01. Weight initialization (setting initial weights before training) is equally important. If all weights start at zero or the same value, all neurons behave identically, which is useless. Instead, weights are randomly initialized with small values (often from a normal distribution). This breaks symmetry and allows different neurons to learn different features. Poor initialization can cause training to fail entirely. Regularization by Weight Decay Weight decay is a simple regularization technique that prevents weights from growing too large. The idea is to add a penalty term proportional to the squared magnitude of weights to the loss function: $$L{\text{total}} = L{\text{original}} + \lambda \sumi wi^2$$ where $\lambda$ controls the strength of the penalty. Large weights often indicate overfitting (fitting to noise), so penalizing them encourages the model to find simpler, more generalizable solutions. Dropout Regularization Dropout randomly and temporarily disables a fraction of hidden units (typically 50%) during each training iteration. This prevents co-adaptation, where neurons become overly specialized and dependent on each other to work. By forcing the network to learn redundant representations through dropout, the model becomes more robust. At test time, all units are active but their outputs are scaled down to account for the training-time dropout. Dropout is particularly effective for preventing overfitting in large networks. Data Augmentation Data augmentation artificially expands the training set by applying realistic transformations to existing data. For images, this might include: Cropping sections of images Rotating images by small angles Flipping images horizontally Adjusting brightness or contrast The key principle is that transformations should preserve the original label. Data augmentation effectively gives the network more diverse training examples, improving generalization. It's especially valuable when collecting more data is expensive. Challenges in Deep Learning Overfitting Risk Overfitting occurs when a network memorizes training data instead of learning general patterns. Deep networks are particularly prone to overfitting because each additional layer adds flexibility—and extra capacity can model rare, irrelevant patterns in the training data rather than true underlying relationships. Signs of overfitting include: high accuracy on training data but poor performance on test data, or continued improvement on training data while validation performance plateaus. All the regularization techniques above (weight decay, dropout, data augmentation) exist primarily to combat overfitting. The Vanishing Gradient Problem When training very deep networks, gradients propagate backward through many layers via the chain rule. In this process, gradients can become exponentially smaller, approaching zero. When gradients vanish, weights in early layers barely update, making training extremely slow or ineffective. This is particularly severe in RNNs, which apply the same operations repeatedly across time steps. The vanishing gradient problem was a major obstacle that limited deep learning for many years. Solutions to Vanishing Gradients Two major solutions emerged: Long Short-Term Memory (LSTM) networks address the vanishing gradient problem in RNNs through gated recurrent connections. LSTMs use special gates (input, forget, and output gates) that control information flow, allowing gradients to flow directly across many time steps without vanishing. This lets them learn long-range dependencies in sequences. Residual connections (used in ResNets) add identity shortcuts that skip one or more layers. Instead of computing $y = f(x)$, a residual block computes $y = f(x) + x$. The identity term $x$ provides a direct path for gradients to flow backward, preventing them from vanishing. This allows networks to be trained successfully with hundreds of layers, far deeper than previous architectures. Both solutions became foundational techniques that enabled modern deep learning. <extrainfo> Computational Cost Training state-of-the-art deep networks requires enormous computational resources. Large models can take weeks to train on specialized hardware like GPUs or TPUs, and the energy consumption is substantial. This high cost motivates research into more efficient architectures, better algorithms, and specialized hardware. However, the computational expense also creates barriers to entry and raises environmental concerns. </extrainfo>
Flashcards
How are neurons connected in a fully connected network?
Every neuron in one layer connects to every neuron in the next layer.
What structural feature of recurrent neural networks allows them to process sequences?
Cycles in their connectivity.
Which types of layers do convolutional neural networks use to extract spatial hierarchies from images?
Convolutional layers Down-sampling layers
In a generative adversarial network, what is the role of the generator?
It models a probability distribution.
What type of game do the generator and discriminator compete in within a GAN?
A zero-sum game.
Which mechanism do transformers use to process sequences?
Self-attention mechanisms.
How does stochastic gradient descent update network weights?
By computing gradients on small random batches of training examples.
What are the two primary benefits of using mini-batching during training?
Speeds up computation Smooths updates
What does the learning rate control during the training of a neural network?
The step size of weight updates.
What is the primary purpose of using random weight initialization?
To break symmetry among neurons.
How does weight decay discourage overly large parameters?
It adds a penalty proportional to the squared magnitude of the weights.
How does dropout prevent the co-adaptation of features during training?
By randomly omitting hidden units during each training iteration.
What happens to gradients in very deep networks as they propagate backward?
They can diminish (vanish), making training difficult.
How do Long Short-Term Memory (LSTM) networks preserve gradients over long sequences?
By using gated recurrent connections.
How do residual connections in ResNet solve the vanishing gradient problem?
They allow gradients to flow directly through identity shortcuts.
What two resources are required in large amounts to train deep networks, motivating the need for specialized hardware?
Compute time Memory

Quiz

What is the defining characteristic of a fully connected (dense) neural network layer?
1 of 2
Key Concepts
Neural Network Architectures
Fully Connected Network
Recurrent Neural Network
Convolutional Neural Network
Generative Adversarial Network
Transformer
Long Short‑Term Memory
Residual Network (ResNet)
Optimization and Regularization Techniques
Stochastic Gradient Descent
Dropout
Weight Decay
Data Augmentation
Vanishing Gradient Problem