Expanding the Netflix dataset

This is a post documenting my effort to building a large collaborative filtering dataset.

Netflix has 29.4 million users, 36,000 movies following the some of the reportings below



However, the Netflix dataset created for the Netflix prize has only

480,189 (0.4M) users, 17,770 movies and 99072112 ratings

The number of users is far too few, the number of movies is about a factor of 2 less than the real one. We have no idea how many total ratings are there.

To create a more realistic dataset, we try to double the number of users, and the number of movies, assuming that we have the same distribution. We started with the matrix market format, with the following content

480189 17770 99072112

1   1  3

2   1  5

… (more ratings)

480189 (#users) 17770(#movies) 99072112(#ratings)

To do this, we do a two-phase process. We double the number of movies first, by creating an extra non-zero entry (every rating, line is a non-zero entry), but we add the id of the second one by total number of movies (17770) , so the first extra looks like

1   17771  3

This keeps the mapping and distribution of the original movies since there is a one-to-one relationship.

We can do the same for the number of users, by adding the total number of users on the first number of each line

480190   1   3

Once we are done, we have expanded the dataset by 4x.

We can always execute a phase separately to create more user data.

The new header would look lile

960378 35540 396288448

960378 (2x #users) 35540 (2x #movies) 396288448 (4x ratings)

When we convert this into a graph,

we only doubled the number of nodes (#users + #movies), but we have made 4x more edges (#ratings), hopefully maintaining the same power law distribution of the original graph.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s