Deep learning Study Guide
Study Guide
📖 Core Concepts
Deep Learning: A subfield of machine‑learning that uses neural networks with many layers (the “deep” part) to learn hierarchical representations automatically.
Hierarchical Feature Transformation: Data pass through successive layers, each extracting more abstract features (e.g., edges → shapes → objects).
Credit Assignment Path (CAP): The chain of transformations from input to output; depth = #hidden layers + 1. Depth > 2 is generally considered “deep”.
Learning Paradigms: Can be supervised, semi‑supervised, or unsupervised.
Universal Approximation: A single‑hidden‑layer feed‑forward net of finite size can approximate any continuous function, but deep nets are far more parameter‑efficient.
📌 Must Remember
Deep ≈ many hidden layers (often > 3, up to thousands).
CAP depth > 2 → deep learning; depth = 2 → universal approximator but “shallow”.
Key architectures: Fully‑connected (FC), Convolutional (CNN), Recurrent (RNN/LSTM), Generative Adversarial (GAN), Transformer.
Training staples: Stochastic Gradient Descent (SGD), mini‑batching, learning‑rate, random weight init, weight decay, dropout, data augmentation.
Regularization combats overfitting; vanishing gradients are mitigated by LSTM gates, ResNet shortcuts, and careful initialization.
Major milestones: AlexNet (2012), ResNet (2015), GAN (2014), Transformer (self‑attention, NLP).
🔄 Key Processes
Weight Update (SGD)
$$ w \leftarrow w - \eta \, \nabla{w} L(\mathbf{x}, \mathbf{y}) $$
– η: learning rate; L: loss on a mini‑batch.
Forward Pass – Input → layer‑wise transformations → output (e.g., Softmax for classification).
Backward Pass (Back‑propagation) – Compute gradients of loss w.r.t. each weight, propagate through CAP.
Dropout Regularization – Randomly zero out hidden units each iteration; at test time scale activations by keep‑probability.
Residual Connection (ResNet) – Add identity shortcut: $ \mathbf{y}=F(\mathbf{x})+\mathbf{x} $ to preserve gradient flow.
🔍 Key Comparisons
FC vs. CNN – FC: every neuron connects to all neurons in next layer; CNN: local receptive fields + weight sharing → spatial hierarchy, far fewer parameters.
RNN vs. LSTM – RNN: simple recurrence, suffers from vanishing gradients; LSTM: gated cells preserve long‑range dependencies.
GAN vs. Autoencoder – GAN: generator vs. discriminator in a zero‑sum game, learns to sample realistic data; Autoencoder: reconstructs input, learns compact latent codes.
Transformer vs. RNN – Transformer: self‑attention processes all positions in parallel; RNN: sequential processing, harder to parallelize.
⚠️ Common Misunderstandings
“Deep = better” – More layers increase capacity but also risk overfitting and training difficulty.
“One hidden layer is enough” – Universal approximation theorem is true, yet deep nets achieve the same accuracy with far fewer parameters.
“Dropout is a regularizer only at training” – At inference, activations must be scaled; forgetting this yields mis‑calibrated predictions.
“Vanishing gradients only affect RNNs” – Very deep feed‑forward nets also suffer; residual connections solve both.
🧠 Mental Models / Intuition
Layer as “feature extractor”: early layers = edges/phonemes; deeper layers = objects/concepts.
CAP as a pipeline: think of water flowing through pipes; each pipe (layer) reshapes the flow (representation).
Residual shortcut = “express lane” for gradients, allowing them to bypass many “traffic lights” (layers).
🚩 Exceptions & Edge Cases
Weight Decay vs. L2 Regularization – Same effect mathematically, but implementation may differ (explicit penalty vs. optimizer‑level).
Batch size extremes: very small batches → noisy gradients (good for exploration); very large batches → smoother updates but may need learning‑rate scaling.
Transfer learning: pre‑trained CNNs on ImageNet work well even for unrelated visual tasks, but may fail if target domain is dramatically different (e.g., medical imaging with different modalities).
📍 When to Use Which
Image data → CNN (spatial hierarchies).
Sequential audio/text → LSTM/GRU or Transformer (self‑attention for long dependencies).
Generating realistic samples → GAN (generator‑discriminator game).
Limited labeled data → Semi‑supervised methods, data augmentation, transfer learning.
High‑dimensional structured data (graphs, molecules) → Graph Neural Networks.
👀 Patterns to Recognize
Edge → Shape → Object pattern in CNN activation maps.
Vanishing gradient symptoms: loss plateaus early, early layers’ weights change little.
Adversarial vulnerability: tiny pixel perturbations cause large output swings; look for unusually high confidence on odd inputs.
Overfitting signature: training accuracy ≫ validation accuracy, especially after many epochs.
🗂️ Exam Traps
“Any network with one hidden layer can replace a deep network.” – True for approximation theory, false for practical efficiency and generalization.
Confusing weight decay with learning‑rate decay. – They are unrelated; weight decay penalizes large weights, learning‑rate decay changes step size.
Assuming dropout works the same for convolutional layers. – In practice dropout is often applied after fully‑connected layers; spatial dropout is a variant for CNNs.
Mixing up ResNet depth (hundreds of layers) with shallow universal approximators. – Residual shortcuts specifically enable training of those very deep nets.
---
If any heading seemed under‑represented in the source material, a brief “‑ Not enough information in source outline.” note was added.
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or