I'm in the process of designing a website that is built around the concept of recommending various items to users based on their tastes. (i.e. items they've rated, items added to their favorites list, etc.) Some examples of this are Amazon, Movielens, and Netflix.
Now, my problem is, I'm not sure where to start in regards to the mathematical part of this system. I'm willing to learn the math that's required, it's just I don't know what type of math is required.
I've looked at a few of the publications over at Grouplens.org, specifically "Towards a Scalable kNN CF Algorithm: Exploring Effective Applications of Clustering." (pdf) I'm pretty good at understanding everything until page 5 "Prediction Generation"
p.s. I'm not exactly looking for an explanation of what's going on, though that might be helpful, but I'm more interested in the math I need to know. This way I can understand what's going on.
Let me explain the procedure that the authors introduced (as I understood it):
This can be repeated for a bunch of items, and then we return the N-top items (highest predicted ratings)
The algorithm is very similar to the naive KNN method (search all training data to find users with similar ratings to the target user, then combine their ratings to give prediction [voting]).
This simple method does not scale very well, as the number of users/items increase.
The algorithm proposed is to first cluster the training users into K groups (groups of people who rated items similarly), where K << N (N is the total number of users).
Then we scan those clusters to find which one the target user is closest to (instead of looking at all the training users).
Finally we pick l out of those and we make our prediction as an average weighted by the distance to those l clusters.
Note that the similarity measure used is the correlation coefficient, and the clustering algorithm is the bisecting K-Means algorithm. We can simply use the standard kmeans, and we can use other similarity metrics as well such as Euclidean distance or cosine distance.
The first formula on page 5 is the definition of the correlation:
corr(x,y) = (x-mean(x))(y-mean(y)) / std(x)*std(y)
The second formula is basically a weighted average:
predRating = sum_i(rating_i * corr(target,user_i)) / sum(corr(target,user_i)) where i loops over the selected top-l clusters
Hope this clarifies things a little bit :)
Programming Collective Intelligence is a really user-friendly introduction to the field, with lots of example code in Python. At the very least, it will help set the stage for understanding the math in the academic papers on the topic.
Algorithm of the Intelligent Web (H Marmanis, D Babenko, Manning publishing) is an introductory text on the subjet. It also covers Searching concepts but its main focus is with classification, recommendation systems and such. This should be a good primer for your project, allowing you to ask the right questions and to dig deeper where things appear more promising or practical in your situation.
The book also includes a "refresher" of relevant math topics (mainly linear algebra), but this refresher is minimal; you'll do better on the web.
A pleasant way to discover or get back into linear algebra is to follow Prof. Gilbert Strand's 18.06 lecture series available on MIT OpenCourseWare.
Linear algebra is not the only way to salvation ;-) you may find it useful to brush up on basic statistics concepts such as distribution, covariance, Bayesian inference...
You probably ought to know:
Nice to have:
That said, you can go far with just common sense. If you have a list of properties you want your system to satisfy, you will be able to do a lot just by writing code that satisfies those properties.
Examples might be: