{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lab assignment 1\n",
"## RecSys Hello World!\n",
"Code some very basic recommending algorithm from scratch and evaluate usefulness of such recommendations.\n",
"\n",
"- Start with a MovieLens dataset (choose which one you prefer, ML-Latest-small already available from labs folder, others for download from https://grouplens.org/datasets/movielens/)\n",
" - if you select larger datasets (ML-latest), apply some pre-processing to limit its size (e.g. only movies from last few years)\n",
"- Expand the dataset with a new user (you) and your preference on some of the movies (at least 5-10)\n",
"- Implement some variant of user-based or item-based KNN \n",
" - **Describe in comments, what exactly you did: what similarity metric, what aggregation, any pre-processing, any specialties?**\n",
"- Use the system to recommend to you. \n",
" - **Self-evaluate the recommendations. Did you like the recommendations? If not, why not?**\n",
" \n",
"### Assignment completion \n",
"\n",
"- Option 1: Finalize during the labs, let the teacher check your code and give you the credits.\n",
"- Option 2: Upload your solution to the SIS study roaster (Labs #1). Your solution should be evaluated top-down (i.e., output of all cells visible). Do not forget to provide comments / assesments wherever asked to do so. **Deadline: sunday before the next labs**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pre-processing\n",
"### Load dataset\n",
"- using Pandas to load the MovieLens-latest dataset. \n",
"- make some basic stats to simplify you finding relevant movies"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
title
\n",
"
genres
\n",
"
RatingCount
\n",
"
year
\n",
"
\n",
"
\n",
"
movieId
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
122912
\n",
"
Avengers: Infinity War - Part I (2018)
\n",
"
Action|Adventure|Sci-Fi
\n",
"
13.0
\n",
"
2018.0
\n",
"
\n",
"
\n",
"
187593
\n",
"
Deadpool 2 (2018)
\n",
"
Action|Comedy|Sci-Fi
\n",
"
12.0
\n",
"
2018.0
\n",
"
\n",
"
\n",
"
122918
\n",
"
Guardians of the Galaxy 2 (2017)
\n",
"
Action|Adventure|Sci-Fi
\n",
"
27.0
\n",
"
2017.0
\n",
"
\n",
"
\n",
"
168252
\n",
"
Logan (2017)
\n",
"
Action|Sci-Fi
\n",
"
25.0
\n",
"
2017.0
\n",
"
\n",
"
\n",
"
122916
\n",
"
Thor: Ragnarok (2017)
\n",
"
Action|Adventure|Sci-Fi
\n",
"
20.0
\n",
"
2017.0
\n",
"
\n",
"
\n",
"
176371
\n",
"
Blade Runner 2049 (2017)
\n",
"
Sci-Fi
\n",
"
18.0
\n",
"
2017.0
\n",
"
\n",
"
\n",
"
122926
\n",
"
Untitled Spider-Man Reboot (2017)
\n",
"
Action|Adventure|Fantasy
\n",
"
16.0
\n",
"
2017.0
\n",
"
\n",
"
\n",
"
168250
\n",
"
Get Out (2017)
\n",
"
Horror
\n",
"
15.0
\n",
"
2017.0
\n",
"
\n",
"
\n",
"
143355
\n",
"
Wonder Woman (2017)
\n",
"
Action|Adventure|Fantasy
\n",
"
13.0
\n",
"
2017.0
\n",
"
\n",
"
\n",
"
174055
\n",
"
Dunkirk (2017)
\n",
"
Action|Drama|Thriller|War
\n",
"
13.0
\n",
"
2017.0
\n",
"
\n",
"
\n",
"
177765
\n",
"
Coco (2017)
\n",
"
Adventure|Animation|Children
\n",
"
13.0
\n",
"
2017.0
\n",
"
\n",
"
\n",
"
179819
\n",
"
Star Wars: The Last Jedi (2017)
\n",
"
Action|Adventure|Fantasy|Sci-Fi
\n",
"
12.0
\n",
"
2017.0
\n",
"
\n",
"
\n",
"
122906
\n",
"
Black Panther (2017)
\n",
"
Action|Adventure|Sci-Fi
\n",
"
11.0
\n",
"
2017.0
\n",
"
\n",
"
\n",
"
122904
\n",
"
Deadpool (2016)
\n",
"
Action|Adventure|Comedy|Sci-Fi
\n",
"
54.0
\n",
"
2016.0
\n",
"
\n",
"
\n",
"
152081
\n",
"
Zootopia (2016)
\n",
"
Action|Adventure|Animation|Children|Comedy
\n",
"
32.0
\n",
"
2016.0
\n",
"
\n",
"
\n",
"
166528
\n",
"
Rogue One: A Star Wars Story (2016)
\n",
"
Action|Adventure|Fantasy|Sci-Fi
\n",
"
27.0
\n",
"
2016.0
\n",
"
\n",
"
\n",
"
164179
\n",
"
Arrival (2016)
\n",
"
Sci-Fi
\n",
"
26.0
\n",
"
2016.0
\n",
"
\n",
"
\n",
"
122920
\n",
"
Captain America: Civil War (2016)
\n",
"
Action|Sci-Fi|Thriller
\n",
"
22.0
\n",
"
2016.0
\n",
"
\n",
"
\n",
"
122922
\n",
"
Doctor Strange (2016)
\n",
"
Action|Adventure|Sci-Fi
\n",
"
22.0
\n",
"
2016.0
\n",
"
\n",
"
\n",
"
136864
\n",
"
Batman v Superman: Dawn of Justice (2016)
\n",
"
Action|Adventure|Fantasy|Sci-Fi
\n",
"
16.0
\n",
"
2016.0
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" title \\\n",
"movieId \n",
"122912 Avengers: Infinity War - Part I (2018) \n",
"187593 Deadpool 2 (2018) \n",
"122918 Guardians of the Galaxy 2 (2017) \n",
"168252 Logan (2017) \n",
"122916 Thor: Ragnarok (2017) \n",
"176371 Blade Runner 2049 (2017) \n",
"122926 Untitled Spider-Man Reboot (2017) \n",
"168250 Get Out (2017) \n",
"143355 Wonder Woman (2017) \n",
"174055 Dunkirk (2017) \n",
"177765 Coco (2017) \n",
"179819 Star Wars: The Last Jedi (2017) \n",
"122906 Black Panther (2017) \n",
"122904 Deadpool (2016) \n",
"152081 Zootopia (2016) \n",
"166528 Rogue One: A Star Wars Story (2016) \n",
"164179 Arrival (2016) \n",
"122920 Captain America: Civil War (2016) \n",
"122922 Doctor Strange (2016) \n",
"136864 Batman v Superman: Dawn of Justice (2016) \n",
"\n",
" genres RatingCount year \n",
"movieId \n",
"122912 Action|Adventure|Sci-Fi 13.0 2018.0 \n",
"187593 Action|Comedy|Sci-Fi 12.0 2018.0 \n",
"122918 Action|Adventure|Sci-Fi 27.0 2017.0 \n",
"168252 Action|Sci-Fi 25.0 2017.0 \n",
"122916 Action|Adventure|Sci-Fi 20.0 2017.0 \n",
"176371 Sci-Fi 18.0 2017.0 \n",
"122926 Action|Adventure|Fantasy 16.0 2017.0 \n",
"168250 Horror 15.0 2017.0 \n",
"143355 Action|Adventure|Fantasy 13.0 2017.0 \n",
"174055 Action|Drama|Thriller|War 13.0 2017.0 \n",
"177765 Adventure|Animation|Children 13.0 2017.0 \n",
"179819 Action|Adventure|Fantasy|Sci-Fi 12.0 2017.0 \n",
"122906 Action|Adventure|Sci-Fi 11.0 2017.0 \n",
"122904 Action|Adventure|Comedy|Sci-Fi 54.0 2016.0 \n",
"152081 Action|Adventure|Animation|Children|Comedy 32.0 2016.0 \n",
"166528 Action|Adventure|Fantasy|Sci-Fi 27.0 2016.0 \n",
"164179 Sci-Fi 26.0 2016.0 \n",
"122920 Action|Sci-Fi|Thriller 22.0 2016.0 \n",
"122922 Action|Adventure|Sci-Fi 22.0 2016.0 \n",
"136864 Action|Adventure|Fantasy|Sci-Fi 16.0 2016.0 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#moviesDF: information about movies (pandas DataFrame)\n",
"#df: ratings in UID - OID - rating - timestamp format (pandas DataFrame)\n",
"\n",
"moviesDF = pd.read_csv(\"movies.csv\", sep=\",\")\n",
"moviesDF.movieId = moviesDF.movieId.astype(int)\n",
"moviesDF.set_index(\"movieId\", inplace=True)\n",
"\n",
"df = pd.read_csv(\"ratings.csv\", sep=\",\")\n",
"df.columns=[\"UID\",\"OID\",\"rating\",\"timestamp\"]\n",
"\n",
"ratingCounts = df.groupby(\"OID\")[\"UID\"].count()\n",
"moviesDF[\"RatingCount\"] = ratingCounts\n",
"moviesDF[\"year\"] = moviesDF.title.str.extract(r'\\(([0-9]+)\\)')\n",
"moviesDF[\"year\"] = moviesDF.year.astype(\"float\")\n",
"moviesDF.fillna(0, inplace=True)\n",
"\n",
"#use this or similar conditions to find movies you can rate. \n",
"moviesDF.loc[moviesDF.RatingCount >= 10].sort_values([\"year\",\"RatingCount\"], ascending=False).head(20)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
title
\n",
"
genres
\n",
"
RatingCount
\n",
"
year
\n",
"
\n",
"
\n",
"
movieId
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
122904
\n",
"
Deadpool (2016)
\n",
"
Action|Adventure|Comedy|Sci-Fi
\n",
"
54.0
\n",
"
2016.0
\n",
"
\n",
"
\n",
"
187593
\n",
"
Deadpool 2 (2018)
\n",
"
Action|Comedy|Sci-Fi
\n",
"
12.0
\n",
"
2018.0
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" title genres RatingCount \\\n",
"movieId \n",
"122904 Deadpool (2016) Action|Adventure|Comedy|Sci-Fi 54.0 \n",
"187593 Deadpool 2 (2018) Action|Comedy|Sci-Fi 12.0 \n",
"\n",
" year \n",
"movieId \n",
"122904 2016.0 \n",
"187593 2018.0 "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#you can also try to search for particular movie\n",
"moviesDF.loc[moviesDF.title.str.contains(\"Deadpool\")]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# update this with your real ratings (note that OID = movieId from the moviesDF)\n",
"myRatings = {\n",
" \"UID\":[611,611],\n",
" \"OID\":[174055,152081],\n",
" \"rating\":[4.0,5.0],\n",
" \"timestamp\":[0,0]\n",
" }"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
UID
\n",
"
OID
\n",
"
rating
\n",
"
timestamp
\n",
"
movieTitle
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1
\n",
"
1
\n",
"
4.0
\n",
"
964982703
\n",
"
Toy Story (1995)
\n",
"
\n",
"
\n",
"
1
\n",
"
1
\n",
"
3
\n",
"
4.0
\n",
"
964981247
\n",
"
Grumpier Old Men (1995)
\n",
"
\n",
"
\n",
"
2
\n",
"
1
\n",
"
6
\n",
"
4.0
\n",
"
964982224
\n",
"
Heat (1995)
\n",
"
\n",
"
\n",
"
3
\n",
"
1
\n",
"
47
\n",
"
5.0
\n",
"
964983815
\n",
"
Seven (a.k.a. Se7en) (1995)
\n",
"
\n",
"
\n",
"
4
\n",
"
1
\n",
"
50
\n",
"
5.0
\n",
"
964982931
\n",
"
Usual Suspects, The (1995)
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
100833
\n",
"
610
\n",
"
168250
\n",
"
5.0
\n",
"
1494273047
\n",
"
Get Out (2017)
\n",
"
\n",
"
\n",
"
100834
\n",
"
610
\n",
"
168252
\n",
"
5.0
\n",
"
1493846352
\n",
"
Logan (2017)
\n",
"
\n",
"
\n",
"
100835
\n",
"
610
\n",
"
170875
\n",
"
3.0
\n",
"
1493846415
\n",
"
The Fate of the Furious (2017)
\n",
"
\n",
"
\n",
"
100836
\n",
"
611
\n",
"
174055
\n",
"
4.0
\n",
"
0
\n",
"
Dunkirk (2017)
\n",
"
\n",
"
\n",
"
100837
\n",
"
611
\n",
"
152081
\n",
"
5.0
\n",
"
0
\n",
"
Zootopia (2016)
\n",
"
\n",
" \n",
"
\n",
"
100838 rows × 5 columns
\n",
"
"
],
"text/plain": [
" UID OID rating timestamp movieTitle\n",
"0 1 1 4.0 964982703 Toy Story (1995)\n",
"1 1 3 4.0 964981247 Grumpier Old Men (1995)\n",
"2 1 6 4.0 964982224 Heat (1995)\n",
"3 1 47 5.0 964983815 Seven (a.k.a. Se7en) (1995)\n",
"4 1 50 5.0 964982931 Usual Suspects, The (1995)\n",
"... ... ... ... ... ...\n",
"100833 610 168250 5.0 1494273047 Get Out (2017)\n",
"100834 610 168252 5.0 1493846352 Logan (2017)\n",
"100835 610 170875 3.0 1493846415 The Fate of the Furious (2017)\n",
"100836 611 174055 4.0 0 Dunkirk (2017)\n",
"100837 611 152081 5.0 0 Zootopia (2016)\n",
"\n",
"[100838 rows x 5 columns]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#append your ratings to the dataFrame df. \n",
"df = pd.concat([df,pd.DataFrame(myRatings)], ignore_index=True)\n",
"\n",
"#add movieTitles to the df (for clarity)\n",
"movieTitles = moviesDF.title.loc[df.OID]\n",
"df[\"movieTitle\"] = movieTitles.values\n",
"#display the ratings dataframe. Note the new values in the last rows\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Auxiliary methods\n",
"depending on which variant of KNN you chose, the functions below might be useful for you\n",
"- jaccard similarity is probably the simplest reasonable measure of similarity between users or items\n",
" - Check https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.html for Pearsons correlation\n",
" - Check https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html for Cosine sim\n",
"- if you use scikit-learn for similarity, its output is a numpy array, which is considers continuous zero-based indexes\n",
" - Note that in many datasets (MovieLens included), there are gaps in user and item IDs. Therefore, you may need to transform UIDs/OIDs to their continuous zero-based variants"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [],
"source": [
"# Calculate jaccard similarity between two **sets** of IDs (users or items)\n",
"def jaccard_sim(a,b):\n",
" return len(a.intersection(b))/len(a.union(b))"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"# for each user, get a set of all rated items\n",
"def get_itemsets(df):\n",
" itemsets = df.groupby(\"UID\").agg({\n",
" \"OID\": lambda x: set(x)\n",
" })\n",
" return itemsets\n",
"\n",
"#for each item, get a set of all users who rated this item\n",
"def get_usersets(df):\n",
" usersets = df.groupby(\"OID\").agg({\n",
" \"UID\": lambda x: set(x)\n",
" })\n",
" return usersets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Your code goes here"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [],
"source": [
"#return list of OIDs (with the size = top_k) ordered from the most relevant item\n",
"def predict(currUser, dataTrain, top_k):\n",
" #dummy code - implement\n",
" return [1,3,10,91529]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Display recommendations given to yourself:"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"