This is a post documenting my effort to building a large collaborative filtering dataset.

Netflix has **29.4 million users, 36,000 movies following the some of the reportings below**

https://en.wikipedia.org/wiki/Netflix

https://www.quora.com/Sounds-Downlaod-Do/How-many-movies-does-Netflix-have

However, the Netflix dataset created for the Netflix prize has only

480,189 (0.4M) users, 17,770 movies and 99072112 ratings

The number of users is far too few, the number of movies is about a factor of 2 less than the real one. We have no idea how many total ratings are there.

To create a more realistic dataset, we try to double the number of users, and the number of movies, assuming that we have the same distribution. We started with the matrix market format, with the following content

480189 17770 99072112

1 1 3

2 1 5

… (more ratings)

480189 (#users) 17770(#movies) 99072112(#ratings)

To do this, we do a two-phase process. We double the number of movies first, by creating an extra non-zero entry (every rating, line is a non-zero entry), but we add the id of the second one by total number of movies (17770) , so the first extra looks like

1 17771 3

This keeps the mapping and distribution of the original movies since there is a one-to-one relationship.

We can do the same for the number of users, by adding the total number of users on the first number of each line

480190 1 3

Once we are done, we have expanded the dataset by 4x.

We can always execute a phase separately to create more user data.

The new header would look lile

960378 35540 396288448

960378 (2x #users) 35540 (2x #movies) 396288448 (4x ratings)

When we convert this into a graph,

we only doubled the number of nodes (#users + #movies), but we have made 4x more edges (#ratings), hopefully maintaining the same power law distribution of the original graph.

### Like this:

Like Loading...

*Related*