Subjects/Science/Computer and Information Science/Computer Science/Recommender system

Recommender system - Collaborative Filtering In-Depth

Understand collaborative filtering assumptions, memory‑ vs model‑based approaches, and cold‑start mitigation techniques.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

What is the core assumption of collaborative filtering regarding user preferences?

1 of 6

Summary

Collaborative Filtering: Building Recommendations from User Behavior Introduction Collaborative filtering is a recommendation system technique that generates personalized suggestions by analyzing patterns in how users rate or interact with items. Rather than analyzing what makes an item appealing (content-based approaches), collaborative filtering focuses on the principle that if two users agreed on items in the past, they will likely agree on new items in the future. Think of it this way: if you and another user both loved the same movies, games, and books, then a movie that other user enjoyed but you haven't seen yet is probably worth recommending to you. The system doesn't need to know why you both liked those items—it only needs to recognize that your tastes align. This approach is powerful because it requires no knowledge of item features and works across any domain, from movies to music to games to books. However, it does require user behavior data to function. Core Assumptions and How Collaborative Filtering Works Collaborative filtering rests on two key assumptions: First, users who have agreed in the past (rated items similarly) will continue to agree in the future. If you and I both gave five stars to the same book, we probably have compatible tastes. Second, users will like items in the future that they liked before. If you loved action movies last year, you'll probably enjoy new action movies released this year. Based on these assumptions, the system operates by identifying peer users (other users with similar rating patterns) or similar items (items with similar rating histories from users). Once these neighbors are found, recommendations are generated from items that your peers liked but you haven't yet encountered. Explicit vs. Implicit Feedback: How We Learn User Preferences Before collaborative filtering can work, the system must collect data about what users like. This happens in two fundamentally different ways. Explicit feedback comes directly from users expressing their opinions: Asking users to rate items on a numerical scale (1-5 stars) Having users rank collections of items Asking users to choose between two options Allowing users to create lists of favorite items Explicit feedback is precise and unambiguous. When a user gives a 5-star rating, the system knows exactly how much they liked something. Implicit feedback comes from observing user behavior without requiring explicit judgments: Tracking which items users view or click on Measuring time spent on items (longer viewing = more interest) Recording purchase histories Logging media consumption (what songs were played, what videos were watched) Analyzing social network activity and follows Implicit feedback is noisier—clicking on an item doesn't necessarily mean someone liked it (maybe they clicked by accident, or they were reviewing it to critique it)—but it's abundant and requires no extra effort from users. In practice, many systems combine both types. For example, a music streaming platform might use implicit data (what users actually listened to) combined with explicit ratings when available. Two Fundamental Approaches: Memory-Based and Model-Based Collaborative filtering systems are typically categorized into two broad approaches based on how they compute recommendations. Memory-based collaborative filtering (also called neighborhood-based) works by directly comparing user or item rating vectors in their original form. The most common memory-based algorithm is the user-based approach: To recommend items to User A, find other users with similar rating histories Look at items that these similar users rated highly Recommend those items to User A This approach is simple and interpretable—you can explain recommendations by saying "users like you also enjoyed this item"—but it requires storing and comparing many user profiles. Model-based collaborative filtering learns a compressed representation of user and item characteristics through a machine learning model. The most popular model-based approach is matrix factorization: Represent users and items as vectors in a lower-dimensional latent factor space Learn these vectors from the existing rating data Use the learned vectors to predict ratings for user-item pairs that users haven't encountered Matrix factorization is more computationally efficient at prediction time and often produces better recommendations because the learned factors capture hidden patterns that users and items share. However, the learned factors are harder to interpret—you can't always explain why a recommendation was made. Common Algorithms for Computing Similarity Both memory-based and model-based approaches need to measure how similar two users or items are. Two key similarity measures appear frequently. The k-Nearest Neighbors (k-NN) algorithm identifies the most similar users or items based on a distance metric. The algorithm: Computes the distance (or similarity) between a target user/item and all others Selects the k nearest neighbors (where k is typically 10-50) Uses their preferences to generate recommendations k-NN is intuitive and works well in practice, but comparing a user against all others is computationally expensive for large systems. Pearson Correlation Coefficient quantifies linear similarity between two rating vectors. If User A and User B have given consistent ratings to items they've both encountered, their Pearson correlation will be high. This metric is useful because it accounts for the fact that some users might always give high ratings while others are more critical—what matters is whether they rate items consistently relative to each other. Mathematically, the Pearson correlation between two users' ratings is: $$r = \frac{\sum{i} (r{u,i} - \bar{r}u)(r{v,i} - \bar{r}v)}{\sqrt{\sum{i}(r{u,i} - \bar{r}u)^2} \sqrt{\sum{i}(r{v,i} - \bar{r}v)^2}}$$ where $r{u,i}$ is user u's rating of item i, and $\bar{r}u$ is user u's average rating. The Cold Start Problem and Multi-Armed Bandits Collaborative filtering has a critical weakness: the cold start problem. New users have provided no ratings (or very few), so the system cannot find similar users to make recommendations from. Similarly, new items have no ratings from any users, making them invisible to recommendation algorithms. One effective solution to cold start is the multi-armed bandit algorithm, which balances two competing goals: Exploitation: recommending items that match known user preferences (using collaborative filtering as normal) Exploration: occasionally recommending new items to learn whether the user likes them The bandit algorithm gradually learns which new items are genuinely good by recommending them to small groups of users. If those users rate the new item highly, it becomes available for recommendation to similar users. If the new item receives poor ratings, the system stops promoting it. This approach is called a "bandit" because it resembles a gambler choosing between slot machines (arms) to maximize rewards while still trying new machines occasionally. The algorithm mathematically balances the risk of wasting recommendations on bad items against the benefit of discovering valuable new items. <extrainfo> Other cold start solutions include: Content-based hybrid approaches: using item features (genre, director, etc.) to recommend items similar to ones a new user rated highly Contextual bandit algorithms: using user context (location, time of day, device type) to make smarter exploration decisions Social recommendations: using a new user's social network and trust relationships to find suitable peer users </extrainfo>

Flashcards

What is the core assumption of collaborative filtering regarding user preferences?

Users who agreed in the past will agree in the future and like similar items.

How are recommendations generated in collaborative filtering systems?

By locating peer users or items with similar rating histories.

What technique is often employed in model-based collaborative filtering to learn latent factors?

Matrix factorization.

Which algorithm measures similarity between users or items based on their nearest neighbors?

k-nearest-neighbor (k-NN) algorithm.

What metric is used to quantify the linear similarity between rating vectors?

Pearson correlation coefficient.

Which algorithm mitigates the cold start problem by balancing exploration and exploitation?

Multi-armed bandit algorithm.

Quiz

Which of the following is an example of explicit data collection for a recommender system?

1 of 4

Key Concepts

Collaborative Filtering Techniques

Collaborative Filtering

Neighborhood Methods

Memory‑Based Collaborative Filtering

Model‑Based Collaborative Filtering

k‑Nearest Neighbor (k‑NN) Algorithm

Data Collection Methods

Explicit Data Collection

Implicit Data Collection

Recommendation Algorithms

Matrix Factorization

Pearson Correlation Coefficient

Multi‑Armed Bandit Algorithm

Definitions

Collaborative Filtering

A recommendation technique that predicts user preferences based on the preferences of similar users or items.

Neighborhood Methods

Approaches that generate recommendations by identifying peer users or items with similar rating histories.

Memory‑Based Collaborative Filtering

A user‑or item‑based method that directly compares rating vectors to compute similarity.

Model‑Based Collaborative Filtering

Techniques that learn latent factors (e.g., via matrix factorization) to predict preferences.

Matrix Factorization

A mathematical decomposition that represents users and items in a lower‑dimensional latent space for recommendation.

Explicit Data Collection

Gathering user feedback through direct actions such as ratings, rankings, or liked‑item lists.

Implicit Data Collection

Inferring preferences from observed user behavior like views, clicks, purchases, or social activity.

k‑Nearest Neighbor (k‑NN) Algorithm

A similarity‑based method that finds the k most similar users or items to make recommendations.

Pearson Correlation Coefficient

A statistical measure of linear similarity between two rating vectors.

Multi‑Armed Bandit Algorithm

An exploration‑exploitation strategy used to address the cold‑start problem by balancing new item trials with known preferences.