This is a post recording my effort on implementing ALS in Ligra.
The reference implementation that I started with is GraphChi’s Alternating Least Squares implementation. If I have the time, I will try add an Stochastic Gradient Descent implementation.
The bets paper I found that details how SGD and ALS collaborative filtering works is the following
We model our code after this particular implementation, which is supposed to be modeled after GraphLab (Datos)
To run this implementation with a small input, I followed the instruction of the following post
Ligra is a high performance graph framework created by Julian Shun at CMU for his Phd. It has performance comparable to Galois on many applications. Its PageRank is about 2-3x slower than best published results obtained by GraphMat as of Sep 2015. But the framework is very easy to work with and the API is much more flexible, making it a good target to try out my optimizations.
The git repository is here
One thing to note is that for collaborative filtering, we will be working with a weighted graph, that is each edge has a weight associated with it (the rating). The first thing I had to deal with is to get the right adjacency format into the framework,
(1) The header has to be WeightedAdjacencyGraph (2) The weights are at the end of the edges,
This is a sample weighted adjacency matrix I created with 5 nodes and 7 edges
To implement ALS, I decided to adopt a strategy of using vertexMap. But I plan to try out different ways, such as using an edge map to accumulate observation and latent vectors. But the problem I see with that approach is that a large intermediate array of data has to be created.
Now that I have some basic program working with Ligra here that can scan all vertices and their in-neighbors
I have to figure out a format that would work for both GraphChi and Ligra. GraphChi’s collaborative filtering uses a Matrix Market format, that is a little bit annoying. It is not an adjacency matrix format, it is not a N by N matrix (N is the number of vertices). Instead it is a N by M matrix (N is the number of users, M is the number of movies). The content of each cell is the rating for the particular movie by the user.
Well, it seems like we just have to do something special for collaborative filtering when it comes to graph format. Apparently, the original netflix dataset also comes in a similar to Matrix Market format, not really framed as a graph problem.
- MovieIDs range from 1 to 17770 sequentially.
- CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users.
I have yet to figure out the Yahoo music dataset format. In this post, I would be happy if we can get Netflix data.
The way I think I am going to go about getting the datasets into graphs is writing a MatrixMarket to EdgeList converter. Unfortunately, EdgeList is still the best common data structure.
GraphChi does this in toolkilts/collaborative_filtering/io.hpp, convert_matrix_market
sharderobj.preprocessing_add_edge(I, (M==N && allow_square)?J:M + J, als_edge_type((float)val));
M is the number of rows. What this does is assign a new ID to the destination, to make sure destination doesn’t overlap with the source. M+J, is maximum user ID + the movie ID.
I plan to follow the same procedure in generating the graph edge list. I generated Intel format of edge graph and had a converter that generates Ligra weighted graph from our CSR representation of the graph. One more thing to note is that we need to make the graph symmetric, as ALS first go through the users and then the movies, the connections need to be both ways.
Additionally, the user IDs need to be contiguous and the movie IDs need to be contiguous too to make the algorithm work. I need to keep this in mind when we start sorting the vertices based on degree later. This would require a different set of sorting mechanisms that sort the first contiguous users’ vertices and then the movies’ vertices.
Now coming back to implement the update function and the compute function, I decided to reuse as much code from GraphChi as possible. There are a few simple operations, such as triangular form and solve it using Cholesky decomposition.
The original code is here
I decided to reuse the vertex data structures and the Eigen library, along with the Eigen Wrapper.
It turns out that Eigen is a really really easy to use library as it is all header files. All you need to do is copy the eigen directory in the apps directory and copy over the following wrapper file
The INSTALL instruction for Eigen is easy to read, use method 1 and copy the folder to the apps directory.
After that, include eigen_wrapper in the ALS file and it all compiles. There might be some warning messages, I ignored the warning messages for now for convenience.
Once we have the functions and everything, we can quickly copy over the relevant code to get ALS implemented as this version of ALS file
At this stage, we surpassingly have the functionalities of a complete collaborative filtering. However, we have no way of verifying correctness. As a result, the next steps will be computing and printing the least square errors and compare it against GraphChi’s output. Additionally, we might need a smaller common test dataset if we want to speed up the process.
The output for running on smallnetflix data with the command shown in the blog post
yunming@lanka:graphchi-cpp$ ./toolkits/collaborative_filtering/als –training=data/smallnetflix_mm –validation=data/smallnetflix_mme –lambda=0.065 –minval=1 –maxval=5 –max_iter=6 –quiet=1
WARNING: common.hpp(print_copyright:214): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to firstname.lastname@example.org
[training] => [data/smallnetflix_mm]
[validation] => [data/smallnetflix_mme]
[lambda] => [0.065]
[minval] => 
[maxval] => 
[max_iter] => 
[quiet] => 
=== REPORT FOR sharder() ===
edata_flush: 0.782395s (count: 3, min: 0.095316s, max: 0.432129, avg: 0.260798s)
execute_sharding: 1.67286 s
finish_shard.sort: 0.019674 s
preprocessing: 0.301614 s
shard_final: 1.1544 s
6.75944) Iteration: 0 Training RMSE: 1.49727 Validation RMSE: 1.25809
10.0998) Iteration: 1 Training RMSE: 0.791493 Validation RMSE: 1.18716
13.4461) Iteration: 2 Training RMSE: 0.706001 Validation RMSE: 1.16641
16.7715) Iteration: 3 Training RMSE: 0.675242 Validation RMSE: 1.16117
20.0819) Iteration: 4 Training RMSE: 0.656819 Validation RMSE: 1.15222
23.404) Iteration: 5 Training RMSE: 0.654305 Validation RMSE: 1.1495
We used the square root RMSE and we only measure the training phase as we don’t have validation phase set up quite yet.
One last thing I had to fix is that when calculating RMSE, we really want to sum up for the users, not the movies. So we need have some way of identifying which nodes are users, which are movies. GraphChi does this by recording the out degree. The movie have 0 out degree. However, this is was difficult to implement in Ligra because I doubled the number of edges and movies have out degree as well.
One iteration in the program would be one iteration on users with movies fixed, then one iteration on movies with user fixed (alternating iteration).
Currently, I pass a command line argument -nusers to the program, so it will accumulate RMSE for only the user nodes.
One other caveat, the number of edges that we divide by in calculating RMSE in graphchi is the number of ratings. However, we doubled the number of edges in Ligra. As a result, we had to reduce the number of edges by two.
The relevant code in Graphchi for calculating RMSE is here
By default, the loss_type is 1 (starting from 0), so it would be RMSE, sqrt(rmse/num_edges)
We mimic the same way of calculating the RMSE.
Right now with all the components I have, my work flow is
First get the matrix market data, run it through my parser to convert it to a graph format. Run a page rank (weighted compilation, has to be a push version because Ligra’s input adjacency graph is a push graph) to generate a CSR. (In the future, there should some converter that converts an edgelist to a binary CSR format, I should put this on schedule).
Once I have the CSR binary, I can run writeAdj to write a adjacency format that Ligra can recognize. Then I can run my version of ALS and compare against GraphChi.