Data mining Study Guide
Study Guide
📖 Core Concepts
Data Mining – The step in Knowledge Discovery in Databases (KDD) that automatically extracts useful patterns from large datasets using machine learning, statistics, and database techniques.
KDD Process – A pipeline: Selection → Pre‑processing → Transformation → Data Mining → Interpretation/Evaluation.
CRISP‑DM – A widely‑used, business‑oriented version of KDD: Business Understanding → Data Understanding → Data Preparation → Modeling → Evaluation → Deployment.
Task Types –
Classification – Assign a record to a predefined category (e.g., spam vs. legit).
Regression – Predict a continuous value by fitting a function that minimizes error.
Clustering – Group records by similarity without pre‑defined labels.
Association‑Rule Learning – Find frequent co‑occurrences (e.g., market‑basket “bread ⇒ butter”).
Anomaly Detection – Spot records that deviate markedly from the norm.
Summarization – Produce a compact description or visual report of the data.
📌 Must Remember
Data mining ≠ data collection or reporting – those belong to other KDD stages.
Overfitting – Model fits training data perfectly but fails on unseen data; always validate with a separate test set.
ROC Curve – Plots true‑positive rate vs. false‑positive rate; area under the curve (AUC) measures classification quality.
Privacy Risk – Aggregated, “anonymized” data can still re‑identify individuals.
Association Rule Metrics – Support (how often itemset appears) and Confidence (conditional probability).
🔄 Key Processes
General KDD Pipeline
Selection – Pick relevant data sources.
Pre‑processing – Clean (remove noise, fill/mask missing values) and integrate data.
Transformation – Encode, normalize, or aggregate into mining‑ready format.
Data Mining – Run algorithms (e.g., decision tree, k‑means).
Interpretation/Evaluation – Convert patterns into actionable knowledge; use metrics (accuracy, ROC, silhouette).
CRISP‑DM Cycle (iterative)
Start with Business Understanding → define success criteria.
Move through Data Understanding → Preparation → Modeling.
Evaluate against business goals; if unsatisfied, loop back to earlier steps.
Deploy the final model into production.
🔍 Key Comparisons
Classification vs. Regression
Classification: Discrete class labels (spam/legit).
Regression: Continuous output (price, temperature).
Clustering vs. Classification
Clustering: No prior labels; discovers natural groups.
Classification: Uses existing labeled examples to predict labels.
Anomaly Detection vs. Outlier Removal
Anomaly Detection: Intentional search for rare but potentially valuable cases (fraud).
Outlier Removal: Discard noisy points to improve model quality.
⚠️ Common Misunderstandings
“Data mining = data analysis.” – Mining focuses on automated pattern discovery in massive data; analysis tests hypotheses on any sized data.
“High support automatically means a good rule.” – A rule can be frequent but have low confidence; both metrics matter.
“If a model scores >90 % accuracy, it’s perfect.” – Accuracy can be misleading on imbalanced data; check ROC/AUC and confusion matrix.
🧠 Mental Models / Intuition
“Fishing vs. Harvesting” – Data mining is like casting a net (automated, broad) to harvest patterns; traditional analysis is like a fisherman targeting a specific species (hypothesis‑driven).
“Training vs. Test as a ‘final exam.’ – Think of the test set as a surprise exam that the model never saw; only passing it proves true learning.
🚩 Exceptions & Edge Cases
Small Sample Size – Using mining algorithms on tiny datasets leads to data dredging (spurious significance).
Highly Imbalanced Classes – Accuracy inflates; use precision, recall, or AUC instead.
Non‑stationary Data – Patterns may drift over time (e.g., fraud tactics); models need periodic re‑training.
📍 When to Use Which
Classification – When you have labeled examples and need categorical decisions (spam filter, credit‑risk).
Regression – When predicting a numeric outcome (sales forecast, house price).
Clustering – When you lack labels and want to explore natural groupings (customer segmentation).
Association Rules – When you want to discover co‑occurring items (market‑basket, cross‑selling).
Anomaly Detection – When rare, high‑impact events matter (fraud, intrusion detection).
Summarization – When stakeholders need a quick, visual overview rather than raw results.
👀 Patterns to Recognize
High support + low confidence → Rule is common but not predictive.
Sharp drop in ROC curve → Model struggles with a specific threshold region.
Cluster silhouette close to 1 → Well‑separated groups; silhouette near 0 → overlapping clusters.
Sudden spike in error on validation set → Possible overfitting or data leakage.
🗂️ Exam Traps
Distractor: “Data mining includes data collection.” – Wrong; collection is a separate KDD step.
Distractor: “Anomaly detection always improves model accuracy.” – Misleading; it may introduce noise if anomalies are mislabeled.
Distractor: “High support alone guarantees a useful association rule.” – Ignoring confidence leads to weak predictive power.
Distractor: “If a model overfits, adding more features will fix it.” – Actually, regularization or more data is needed; extra features often worsen overfit.
---
All information is drawn directly from the provided outline.
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or