Subjects/Technology/Data and AI/Data Science/Data mining

Data mining Study Guide

Study Guide

📖 Core Concepts Data Mining – The step in Knowledge Discovery in Databases (KDD) that automatically extracts useful patterns from large datasets using machine learning, statistics, and database techniques. KDD Process – A pipeline: Selection → Pre‑processing → Transformation → Data Mining → Interpretation/Evaluation. CRISP‑DM – A widely‑used, business‑oriented version of KDD: Business Understanding → Data Understanding → Data Preparation → Modeling → Evaluation → Deployment. Task Types – Classification – Assign a record to a predefined category (e.g., spam vs. legit). Regression – Predict a continuous value by fitting a function that minimizes error. Clustering – Group records by similarity without pre‑defined labels. Association‑Rule Learning – Find frequent co‑occurrences (e.g., market‑basket “bread ⇒ butter”). Anomaly Detection – Spot records that deviate markedly from the norm. Summarization – Produce a compact description or visual report of the data. 📌 Must Remember Data mining ≠ data collection or reporting – those belong to other KDD stages. Overfitting – Model fits training data perfectly but fails on unseen data; always validate with a separate test set. ROC Curve – Plots true‑positive rate vs. false‑positive rate; area under the curve (AUC) measures classification quality. Privacy Risk – Aggregated, “anonymized” data can still re‑identify individuals. Association Rule Metrics – Support (how often itemset appears) and Confidence (conditional probability). 🔄 Key Processes General KDD Pipeline Selection – Pick relevant data sources. Pre‑processing – Clean (remove noise, fill/mask missing values) and integrate data. Transformation – Encode, normalize, or aggregate into mining‑ready format. Data Mining – Run algorithms (e.g., decision tree, k‑means). Interpretation/Evaluation – Convert patterns into actionable knowledge; use metrics (accuracy, ROC, silhouette). CRISP‑DM Cycle (iterative) Start with Business Understanding → define success criteria. Move through Data Understanding → Preparation → Modeling. Evaluate against business goals; if unsatisfied, loop back to earlier steps. Deploy the final model into production. 🔍 Key Comparisons Classification vs. Regression Classification: Discrete class labels (spam/legit). Regression: Continuous output (price, temperature). Clustering vs. Classification Clustering: No prior labels; discovers natural groups. Classification: Uses existing labeled examples to predict labels. Anomaly Detection vs. Outlier Removal Anomaly Detection: Intentional search for rare but potentially valuable cases (fraud). Outlier Removal: Discard noisy points to improve model quality. ⚠️ Common Misunderstandings “Data mining = data analysis.” – Mining focuses on automated pattern discovery in massive data; analysis tests hypotheses on any sized data. “High support automatically means a good rule.” – A rule can be frequent but have low confidence; both metrics matter. “If a model scores >90 % accuracy, it’s perfect.” – Accuracy can be misleading on imbalanced data; check ROC/AUC and confusion matrix. 🧠 Mental Models / Intuition “Fishing vs. Harvesting” – Data mining is like casting a net (automated, broad) to harvest patterns; traditional analysis is like a fisherman targeting a specific species (hypothesis‑driven). “Training vs. Test as a ‘final exam.’ – Think of the test set as a surprise exam that the model never saw; only passing it proves true learning. 🚩 Exceptions & Edge Cases Small Sample Size – Using mining algorithms on tiny datasets leads to data dredging (spurious significance). Highly Imbalanced Classes – Accuracy inflates; use precision, recall, or AUC instead. Non‑stationary Data – Patterns may drift over time (e.g., fraud tactics); models need periodic re‑training. 📍 When to Use Which Classification – When you have labeled examples and need categorical decisions (spam filter, credit‑risk). Regression – When predicting a numeric outcome (sales forecast, house price). Clustering – When you lack labels and want to explore natural groupings (customer segmentation). Association Rules – When you want to discover co‑occurring items (market‑basket, cross‑selling). Anomaly Detection – When rare, high‑impact events matter (fraud, intrusion detection). Summarization – When stakeholders need a quick, visual overview rather than raw results. 👀 Patterns to Recognize High support + low confidence → Rule is common but not predictive. Sharp drop in ROC curve → Model struggles with a specific threshold region. Cluster silhouette close to 1 → Well‑separated groups; silhouette near 0 → overlapping clusters. Sudden spike in error on validation set → Possible overfitting or data leakage. 🗂️ Exam Traps Distractor: “Data mining includes data collection.” – Wrong; collection is a separate KDD step. Distractor: “Anomaly detection always improves model accuracy.” – Misleading; it may introduce noise if anomalies are mislabeled. Distractor: “High support alone guarantees a useful association rule.” – Ignoring confidence leads to weak predictive power. Distractor: “If a model overfits, adding more features will fix it.” – Actually, regularization or more data is needed; extra features often worsen overfit. --- All information is drawn directly from the provided outline.

Or, immediately create your own study flashcards:

Upload a PDF.
Master Study Materials.

Start learning in seconds

Drop your PDFs here or