Notes from A Programmer's Guide to Data Mining

Chapter 2: Collaborative filtering

I like what you like

1. If your data is dense (almost all attributes have non-zero values) and the magnitude of the attribute values is important, use distance measures such as Euclidean or Manhattan.

Manhattan Distance -> Fast computation
Euclidean Distance (Pythagorean Theorem) -> Slow computation

Minkowski Distance Metric

When

r = 1: The formula is Manhattan Distance

r = 2: The formula is Euclidean Distance

2. If the data is subject to grade-inﬂation (different users may be using different scales) use Pearson's R.

3. If the data is sparse consider using Cosine Similarity.

Where, . indicate dot product and ||x|| indicates the length of the vector x, calculated as

If we are relying on a single “most similar” person, that will be a problem. Any quirk that person has is passed on as a recommendation. So we can use K-nearest neighbor.

K-nearest neighbor.

We use k most similar people to determine recommendations. The best value for k is application specific—you will need to do some experimentation.