Subjects/Science/Computer and Information Science/Computer Science/Recommender system

Recommender system Study Guide

Study Guide

📖 Core Concepts Recommender System – an information‑filtering engine that suggests items most relevant to a user. Collaborative Filtering (CF) – predicts preferences by exploiting similarity between users or items based on past interactions. Content‑Based Filtering (CBF) – matches items to a user by comparing item attribute vectors with a user’s weighted feature profile. Hybrid Recommender – combines two or more techniques (CF, CBF, etc.) to offset each other’s weaknesses (cold‑start, sparsity, lack of diversity). Cold‑Start Problem – insufficient data for new users or new items, leading to unreliable recommendations. Implicit vs. Explicit Feedback – implicit: observed behavior (views, clicks); explicit: direct ratings, rankings, or likes. Evaluation Dimensions – accuracy (e.g., RMSE, precision), diversity, novelty, coverage, serendipity, trust. Session‑Based Recommender – uses only the sequence of actions within the current session; no long‑term profile needed. Two‑Tower Model – separate neural nets encode user and item features into a shared embedding space; similarity (dot‑product or cosine) drives ranking. --- 📌 Must Remember CF Assumption: “People who agreed in the past will agree in the future.” k‑Nearest‑Neighbor (k‑NN) and Pearson correlation are the classic similarity measures for memory‑based CF. Matrix Factorization learns latent user/item factors; core for model‑based CF. tf‑idf converts text/item attributes into weighted vectors for CBF. Hybrid Strategies: weighted, switching, mixed, cascade, meta‑level. Accuracy Metrics: $ \text{MSE} = \frac{1}{N}\sum ( \hat{r}i - ri )^2 $ $ \text{RMSE} = \sqrt{\text{MSE}} $ Precision, Recall, DCG. Beyond Accuracy: higher diversity → better user satisfaction; novelty = unexpectedness; coverage = % of catalog ever recommended. Cold‑Start Mitigation: multi‑armed bandits, hybridization, content features, meta‑level models. Session‑Based Models often use RNNs or Transformers to capture order‑dependent signals. --- 🔄 Key Processes Memory‑Based CF (User‑Based): Build rating vector for target user. Compute similarity (e.g., Pearson) to all other users. Select top‑$k$ similar users (neighbors). Aggregate their ratings (weighted by similarity) to predict missing items. Model‑Based CF (Matrix Factorization): Initialize latent factor matrices $U \in \mathbb{R}^{m \times f}$ (users) and $V \in \mathbb{R}^{n \times f}$ (items). Minimize $ \sum{(i,j)\in \mathcal{K}} (r{ij} - Ui^\top Vj)^2 + \lambda (||Ui||^2 + ||Vj||^2) $. Use SGD or ALS to learn $U, V$. Predict rating: $\hat{r}{ij}=Ui^\top Vj$. Content‑Based Recommendation: Extract item attributes → vector $xi$ (e.g., tf‑idf). Build user profile $pu = \sum{i\in \text{liked}} wi xi$ (weights reflect preference strength). Score candidate items by similarity $s(u,i)=\cos(pu, xi)$. Hybrid Weighted Fusion: Compute CF score $s{CF}$ and CBF score $s{CB}$. Final score $s = \alpha s{CF} + (1-\alpha) s{CB}$, where $0\le\alpha\le1$ is tuned on validation data. Two‑Tower Retrieval: Encode user features → $eu$; encode item features → $ei$. Pre‑compute $ei$ for all items and store in an ANN index. At inference, retrieve top‑$k$ items with highest $eu \cdot ei$ (dot product). --- 🔍 Key Comparisons CF vs. CBF Data needed: CF → user‑item interaction matrix; CBF → item attribute metadata. Cold‑start: CF suffers; CBF handles new items if attributes exist. Memory‑Based vs. Model‑Based CF Scalability: Memory‑based requires $O(mn)$ similarity calculations; model‑based scales with latent dimension $f \ll \min(m,n)$. Weighted Hybrid vs. Switching Hybrid Weighted: always combines scores; Switching: picks one technique based on context (e.g., enough data → CF, otherwise CBF). Session‑Based vs. Long‑Term CF Session‑based: no user profile, captures short‑term intent; Long‑term CF: leverages historical preferences. --- ⚠️ Common Misunderstandings “More data always improves CF.” → Sparsity can persist; adding noisy implicit signals may degrade performance. “Content‑based equals diversity.” → Pure CBF often narrows to the same genre; diversity must be explicitly optimized. “Offline RMSE guarantees higher click‑through in production.” → Offline accuracy often poorly correlates with real‑world engagement. “Hybrid = just add the scores.” → Naïve addition can overweight a weak component; proper weighting or switching is essential. --- 🧠 Mental Models / Intuition Similarity as “Neighbourhood” – imagine users/items plotted in a high‑dimensional space; the closer two points, the more likely they share taste. Latent Factors as “Hidden Interests” – each dimension captures an abstract preference (e.g., “action movies”); users and items align on these hidden axes. Hybrid as “Recipe” – think of CF as the base broth, CBF as spices; the final dish’s flavor (recommendations) depends on the right blend. --- 🚩 Exceptions & Edge Cases Extreme Sparsity: When <1 % of the matrix is filled, even matrix factorization may over‑fit; consider adding side‑information (metadata) or using bandits. Highly Dynamic Catalog: Fast‑changing item pool (news) → pre‑computing item embeddings may become stale; schedule frequent re‑training or use session‑based models. Cold‑Start New Users with Rich Profiles: If demographic or social data are available, CBF or meta‑level hybrids can bypass pure CF cold‑start. --- 📍 When to Use Which Cold‑Start New Item: Use content‑based (tf‑idf, metadata) or meta‑level hybrid that feeds item features into a CF model. Large‑Scale Production (billions of items): Deploy two‑tower model with ANN retrieval for sub‑millisecond latency. Short Sessions, No Profile: Choose session‑based RNN/Transformer; ignore long‑term CF history. Need High Accuracy on Dense Rating Data: Prefer matrix factorization or neural collaborative filtering. Goal: Increase Diversity/Serendipity: Apply cascade or mixed hybrid where a diversity‑oriented component re‑ranks CF results. --- 👀 Patterns to Recognize “Long Tail” Rating Distribution → expect high sparsity; look for hybrid or side‑information solutions. Sharp Drop in Precision after Top‑5 → may indicate over‑fitting to popular items; consider diversity/novelty regularization. Session Click‑Stream Shows Repeated Category Switches → signals for a session‑based sequential model rather than static CF. Consistently Low Offline RMSE but Flat CTR → suspect offline metrics mis‑aligned; need online A/B testing. --- 🗂️ Exam Traps Confusing “implicit feedback” with “explicit rating.” Implicit signals are observed behavior; they are noisy and usually binary or count‑based. Assuming k‑NN always outperforms matrix factorization. k‑NN scales poorly and suffers from sparsity; MF is usually stronger on large datasets. Choosing “hybrid = weighted sum” without justification. The exam may ask for the reason why a particular hybrid strategy (e.g., cascade) is better for cold‑start. Mixing up “novelty” and “diversity.” Novelty = new to the user; diversity = variety within the recommendation list. Selecting “session‑based” for users with extensive histories. Session models shine when long‑term data is unavailable or irrelevant. ---

Or, immediately create your own study flashcards:

Upload a PDF.
Master Study Materials.

Start learning in seconds

Drop your PDFs here or