Skip to content

Latest commit

 

History

History
35 lines (29 loc) · 2.15 KB

A_Programmer's_Guide_to_Data_Mining.md

File metadata and controls

35 lines (29 loc) · 2.15 KB

Notes from A Programmer's Guide to Data Mining

Chapter 2: Collaborative filtering

I like what you like

1. If your data is dense (almost all attributes have non-zero values) and the magnitude of the attribute values is important, use distance measures such as Euclidean or Manhattan.

  • Manhattan Distance -> Fast computation
  • Euclidean Distance (Pythagorean Theorem) -> Slow computation

Minkowski Distance Metric

When
r = 1: The formula is Manhattan Distance
r = 2: The formula is Euclidean Distance

2. If the data is subject to grade-inflation (different users may be using different scales) use Pearson's R.

3. If the data is sparse consider using Cosine Similarity.

Where, . indicate dot product and ||x|| indicates the length of the vector x, calculated as

If we are relying on a single “most similar” person, that will be a problem. Any quirk that person has is passed on as a recommendation. So we can use K-nearest neighbor.

K-nearest neighbor.

We use k most similar people to determine recommendations. The best value for k is application specific—you will need to do some experimentation.