RemNote Community
Community

Study Guide

📖 Core Concepts Deep Learning: A subfield of machine‑learning that uses neural networks with many layers (the “deep” part) to learn hierarchical representations automatically. Hierarchical Feature Transformation: Data pass through successive layers, each extracting more abstract features (e.g., edges → shapes → objects). Credit Assignment Path (CAP): The chain of transformations from input to output; depth = #hidden layers + 1. Depth > 2 is generally considered “deep”. Learning Paradigms: Can be supervised, semi‑supervised, or unsupervised. Universal Approximation: A single‑hidden‑layer feed‑forward net of finite size can approximate any continuous function, but deep nets are far more parameter‑efficient. 📌 Must Remember Deep ≈ many hidden layers (often > 3, up to thousands). CAP depth > 2 → deep learning; depth = 2 → universal approximator but “shallow”. Key architectures: Fully‑connected (FC), Convolutional (CNN), Recurrent (RNN/LSTM), Generative Adversarial (GAN), Transformer. Training staples: Stochastic Gradient Descent (SGD), mini‑batching, learning‑rate, random weight init, weight decay, dropout, data augmentation. Regularization combats overfitting; vanishing gradients are mitigated by LSTM gates, ResNet shortcuts, and careful initialization. Major milestones: AlexNet (2012), ResNet (2015), GAN (2014), Transformer (self‑attention, NLP). 🔄 Key Processes Weight Update (SGD) $$ w \leftarrow w - \eta \, \nabla{w} L(\mathbf{x}, \mathbf{y}) $$ – η: learning rate; L: loss on a mini‑batch. Forward Pass – Input → layer‑wise transformations → output (e.g., Softmax for classification). Backward Pass (Back‑propagation) – Compute gradients of loss w.r.t. each weight, propagate through CAP. Dropout Regularization – Randomly zero out hidden units each iteration; at test time scale activations by keep‑probability. Residual Connection (ResNet) – Add identity shortcut: $ \mathbf{y}=F(\mathbf{x})+\mathbf{x} $ to preserve gradient flow. 🔍 Key Comparisons FC vs. CNN – FC: every neuron connects to all neurons in next layer; CNN: local receptive fields + weight sharing → spatial hierarchy, far fewer parameters. RNN vs. LSTM – RNN: simple recurrence, suffers from vanishing gradients; LSTM: gated cells preserve long‑range dependencies. GAN vs. Autoencoder – GAN: generator vs. discriminator in a zero‑sum game, learns to sample realistic data; Autoencoder: reconstructs input, learns compact latent codes. Transformer vs. RNN – Transformer: self‑attention processes all positions in parallel; RNN: sequential processing, harder to parallelize. ⚠️ Common Misunderstandings “Deep = better” – More layers increase capacity but also risk overfitting and training difficulty. “One hidden layer is enough” – Universal approximation theorem is true, yet deep nets achieve the same accuracy with far fewer parameters. “Dropout is a regularizer only at training” – At inference, activations must be scaled; forgetting this yields mis‑calibrated predictions. “Vanishing gradients only affect RNNs” – Very deep feed‑forward nets also suffer; residual connections solve both. 🧠 Mental Models / Intuition Layer as “feature extractor”: early layers = edges/phonemes; deeper layers = objects/concepts. CAP as a pipeline: think of water flowing through pipes; each pipe (layer) reshapes the flow (representation). Residual shortcut = “express lane” for gradients, allowing them to bypass many “traffic lights” (layers). 🚩 Exceptions & Edge Cases Weight Decay vs. L2 Regularization – Same effect mathematically, but implementation may differ (explicit penalty vs. optimizer‑level). Batch size extremes: very small batches → noisy gradients (good for exploration); very large batches → smoother updates but may need learning‑rate scaling. Transfer learning: pre‑trained CNNs on ImageNet work well even for unrelated visual tasks, but may fail if target domain is dramatically different (e.g., medical imaging with different modalities). 📍 When to Use Which Image data → CNN (spatial hierarchies). Sequential audio/text → LSTM/GRU or Transformer (self‑attention for long dependencies). Generating realistic samples → GAN (generator‑discriminator game). Limited labeled data → Semi‑supervised methods, data augmentation, transfer learning. High‑dimensional structured data (graphs, molecules) → Graph Neural Networks. 👀 Patterns to Recognize Edge → Shape → Object pattern in CNN activation maps. Vanishing gradient symptoms: loss plateaus early, early layers’ weights change little. Adversarial vulnerability: tiny pixel perturbations cause large output swings; look for unusually high confidence on odd inputs. Overfitting signature: training accuracy ≫ validation accuracy, especially after many epochs. 🗂️ Exam Traps “Any network with one hidden layer can replace a deep network.” – True for approximation theory, false for practical efficiency and generalization. Confusing weight decay with learning‑rate decay. – They are unrelated; weight decay penalizes large weights, learning‑rate decay changes step size. Assuming dropout works the same for convolutional layers. – In practice dropout is often applied after fully‑connected layers; spatial dropout is a variant for CNNs. Mixing up ResNet depth (hundreds of layers) with shallow universal approximators. – Residual shortcuts specifically enable training of those very deep nets. --- If any heading seemed under‑represented in the source material, a brief “‑ Not enough information in source outline.” note was added.
or

Or, immediately create your own study flashcards:

Upload a PDF.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or