This is a post documenting my effort to building a large collaborative filtering dataset.
Netflix has 29.4 million users, 36,000 movies following the some of the reportings below
However, the Netflix dataset created for the Netflix prize has only
480,189 (0.4M) users, 17,770 movies and 99072112 ratings
The number of users is far too few, the number of movies is about a factor of 2 less than the real one. We have no idea how many total ratings are there.
To create a more realistic dataset, we try to double the number of users, and the number of movies, assuming that we have the same distribution. We started with the matrix market format, with the following content
480189 17770 99072112
1 1 3
2 1 5
… (more ratings)
480189 (#users) 17770(#movies) 99072112(#ratings)
To do this, we do a two-phase process. We double the number of movies first, by creating an extra non-zero entry (every rating, line is a non-zero entry), but we add the id of the second one by total number of movies (17770) , so the first extra looks like
1 17771 3
This keeps the mapping and distribution of the original movies since there is a one-to-one relationship.
We can do the same for the number of users, by adding the total number of users on the first number of each line
480190 1 3
Once we are done, we have expanded the dataset by 4x.
We can always execute a phase separately to create more user data.
The new header would look lile
960378 35540 396288448
960378 (2x #users) 35540 (2x #movies) 396288448 (4x ratings)
When we convert this into a graph,
we only doubled the number of nodes (#users + #movies), but we have made 4x more edges (#ratings), hopefully maintaining the same power law distribution of the original graph.