RemNote Community
Community

Study Guide

📖 Core Concepts Machine Learning (ML) – Algorithms that learn patterns from data and improve performance on unseen inputs without explicit programming. Supervised Learning – Uses labeled data to learn a mapping \(x \rightarrow y\); tasks: classification (categorical) & regression (continuous). Unsupervised Learning – Finds structure in unlabeled data (e.g., clustering, dimensionality reduction). Reinforcement Learning – Trains an agent to maximise cumulative reward through interaction with an environment modeled as a Markov Decision Process (MDP). Empirical Risk Minimisation (ERM) – Choose a hypothesis \(f\) that minimises average loss $$\min{f}\frac{1}{n}\sum{i=1}^{n} L\bigl(yi, f(xi)\bigr)$$ where \(L\) is a loss function. Generalisation vs. Over/Under‑fitting – Generalisation = good performance on new data; under‑fitting = model too simple; over‑fitting = model too complex, captures noise. Bias–Variance Trade‑off – Error = Bias\(^2\) + Variance + Irreducible noise; increasing model complexity reduces bias but raises variance. Performance Metrics – Accuracy, precision, recall, F1, ROC‑AUC for classification; MSE/MAE for regression. Model Evaluation – Holdout split, k‑fold cross‑validation, bootstrap sampling. 📌 Must Remember ML Definition (Tom Mitchell) – A program learns from experience \(E\) with respect to tasks \(T\) and performance measure \(P\) if its performance on \(T\) improves as measured by \(P\). ERM Objective – Minimise average loss over training set. PAC Learning – Provides probabilistic bounds on algorithm performance; feasible problems are learnable in polynomial time. Decision‑Tree Ensembles – Random forests average many trees to reduce variance and avoid over‑fitting. SVM Margin – Optimal hyperplane maximises the margin (distance) between classes. Kernel Trick – Implicitly maps data to high‑dimensional space without explicit computation. Cross‑Validation Formula – For \(k\) folds, each observation is used for validation exactly once; average metric across folds estimates true performance. Bias Sources – Training data reflecting societal prejudices → algorithmic bias. 🔄 Key Processes Training via ERM Choose loss \(L\) (e.g., squared error, cross‑entropy). Initialise model parameters. Apply gradient‑based optimisation (or other solver) to minimise empirical risk. k‑Fold Cross‑Validation Split data into \(k\) equal folds. For each fold: train on \(k-1\) folds, validate on the remaining fold. Aggregate validation scores → estimate generalisation error. Random Forest Construction For each tree: draw a bootstrap sample from training data. At each split, consider a random subset of features. Grow tree to full depth (or stop early). Aggregate predictions (average for regression, majority vote for classification). Support Vector Machine Training Formulate quadratic optimisation: maximise margin while penalising misclassifications. If non‑linear, replace inner product with kernel function \(K(xi, xj)\). Solve for support vectors → decision function. Neural Network Forward & Backward Pass Forward: compute activations layer‑by‑layer using weighted sums + non‑linearities. Backward: compute gradients of loss w.r.t. weights (back‑propagation) and update via gradient descent. 🔍 Key Comparisons Supervised vs. Unsupervised Supervised: uses \((x, y)\) pairs → predict \(y\). Unsupervised: uses only \(x\) → discover patterns (clusters, low‑dimensional structure). Decision Tree vs. Random Forest Single tree: high variance, prone to over‑fit. Random forest: ensemble of trees → lower variance, better generalisation. Linear SVM vs. Kernel SVM Linear: separates data with a straight hyperplane. Kernel: maps data to higher dimensions to handle non‑linear separation. Batch Gradient Descent vs. Stochastic Gradient Descent Batch: uses full dataset each update → stable but slow. Stochastic: uses one (or few) samples → faster, noisier updates. ⚠️ Common Misunderstandings “More features always improve performance.” – Extra features can increase variance and cause over‑fitting. “Higher training accuracy ⇒ better model.” – May indicate over‑fitting; always check validation/test performance. “SVM outputs probabilities.” – Base SVM is non‑probabilistic; Platt scaling is needed for calibrated probabilities. “Unsupervised learning needs no data preprocessing.” – Scaling/normalisation still crucial for algorithms like k‑means or PCA. 🧠 Mental Models / Intuition Bias–Variance as a “tug‑of‑war” – Imagine a rubber band: pulling tighter (high complexity) reduces bias but makes the model wobble (high variance). Ensemble as “wisdom of the crowd.” – Each tree gives a vote; the majority smooths out individual errors. Kernel Trick as “implicit zoom.” – Rather than actually drawing a bigger picture, you ask a clever question (“are these two points similar?”) that behaves as if you had zoomed in. 🚩 Exceptions & Edge Cases Non‑i.i.d. data – PAC bounds and standard cross‑validation assume independent, identically distributed samples; time‑series data violate this. Imbalanced classes – Accuracy can be misleading; use ROC‑AUC, precision‑recall, or balanced accuracy. Very small datasets – Bootstrapping may over‑estimate performance; prefer leave‑one‑out CV. 📍 When to Use Which Classification with clear margin → Linear SVM (fast, interpretable). Non‑linear boundaries + moderate data size → Kernel SVM or Random Forest. Large, high‑dimensional data → Random Forest or Gradient‑Boosted Trees (handle many features). Image / speech / raw signal → Deep Neural Networks (learn hierarchical features). Few labels, many unlabeled points → Semi‑supervised learning (combine small labeled set with large unlabeled set). Need model interpretability → Decision trees, rule‑based models, or linear models with regularisation. 👀 Patterns to Recognize Training loss ↓, validation loss ↑ → Over‑fitting (stop early, add regularisation). High bias (both train & val error high) → Under‑fitting (increase model capacity). Sparse weight vectors → Likely from L1 regularisation (feature selection). Sharp decision boundaries in feature space → Possible SVM with high‑C (low regularisation). 🗂️ Exam Traps “SVM always provides probabilities.” – Forgetting that raw SVM outputs are scores; probability requires Platt scaling. Confusing “random forest regression” with “random forest classification.” – Regression averages numeric outputs; classification uses majority vote. Assuming cross‑validation eliminates all bias. – CV still suffers from data leakage if preprocessing (e.g., scaling) is done before the split. Mixing up “bias” (statistical term) with “bias” (ethical). – In ML theory bias = systematic error; in ethics it refers to unfairness. Choosing kernel SVM for huge datasets. – Kernel matrix scales \(O(n^2)\); computationally infeasible for large \(n\). --- All bullets are distilled directly from the provided outline; no external information was added.
or

Or, immediately create your own study flashcards:

Upload a PDF.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or