{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab assignment 2: RS frameworks\n", "\n", "### How not to write everything by yourself?\n", "- utilize one of great many different RecSys frameworks (https://github.com/ACMRecSys/recsys-evaluation-frameworks)\n", "- varying coverage of algorithms, metrics, use-cases etc. \n", "- We want to start simple => LensKit\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Task 1: Setting things up\n", "\n", "- install LensKit (https://lkpy.readthedocs.io/, https://github.com/lenskit/lkpy) \n", "- familiarize yourself with the framework + what is supported (data, algorithms, evaluation methods)\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\lpesk\\AppData\\Roaming\\Python\\Python38\\site-packages\\pandas\\core\\computation\\expressions.py:20: UserWarning: Pandas requires version '2.7.3' or newer of 'numexpr' (version '2.7.1' currently installed).\n", " from pandas.core.computation.check import NUMEXPR_INSTALLED\n" ] } ], "source": [ "from lenskit import batch, topn, util\n", "from lenskit import crossfold as xf\n", "from lenskit.algorithms import Recommender, funksvd, item_knn, basic\n", "from lenskit import topn\n", "\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 2: Simple recommender output\n", "- continue with data from labs1 (including your ratings)\n", "- familiarize yourself with lenskit.Recommender methods\n", "- implement basic training loop for one algorithm (e.g., ItemItem KNN)\n", "- recommend for yourself" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Starting code" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# update this with your real ratings (note that OID = movieId from the moviesDF)\n", "myRatings = {\n", " \"UID\":[611,611],\n", " \"OID\":[174055,152081],\n", " \"rating\":[4.0,5.0],\n", " \"timestamp\":[0,0]\n", " }" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlegenresRatingCountyear
movieId
193581Black Butler: Book of the Atlantic (2017)Action|Animation|Comedy|Fantasy1.02017.0
193583No Game No Life: Zero (2017)Animation|Comedy|Fantasy1.02017.0
193585Flint (2017)Drama1.02017.0
193587Bungo Stray Dogs: Dead Apple (2018)Action|Animation1.02018.0
193609Andrew Dice Clay: Dice Rules (1991)Comedy1.01991.0
\n", "
" ], "text/plain": [ " title \\\n", "movieId \n", "193581 Black Butler: Book of the Atlantic (2017) \n", "193583 No Game No Life: Zero (2017) \n", "193585 Flint (2017) \n", "193587 Bungo Stray Dogs: Dead Apple (2018) \n", "193609 Andrew Dice Clay: Dice Rules (1991) \n", "\n", " genres RatingCount year \n", "movieId \n", "193581 Action|Animation|Comedy|Fantasy 1.0 2017.0 \n", "193583 Animation|Comedy|Fantasy 1.0 2017.0 \n", "193585 Drama 1.0 2017.0 \n", "193587 Action|Animation 1.0 2018.0 \n", "193609 Comedy 1.0 1991.0 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#moviesDF and df: same as in labs1\n", "\n", "moviesDF = pd.read_csv(\"movies.csv\", sep=\",\")\n", "moviesDF.movieId = moviesDF.movieId.astype(int)\n", "moviesDF.set_index(\"movieId\", inplace=True)\n", "\n", "df = pd.read_csv(\"ratings.csv\", sep=\",\")\n", "df.columns=[\"UID\",\"OID\",\"rating\",\"timestamp\"]\n", "\n", "ratingCounts = df.groupby(\"OID\")[\"UID\"].count()\n", "moviesDF[\"RatingCount\"] = ratingCounts\n", "moviesDF[\"year\"] = moviesDF.title.str.extract(r'\\(([0-9]+)\\)')\n", "moviesDF[\"year\"] = moviesDF.year.astype(\"float\")\n", "moviesDF.fillna(0, inplace=True)\n", "moviesDF.tail()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
useritemratingtimestamptitle
1008336101682505.01494273047Get Out (2017)
1008346101682525.01493846352Logan (2017)
1008356101708753.01493846415The Fate of the Furious (2017)
1008366111740554.00Dunkirk (2017)
1008376111520815.00Zootopia (2016)
\n", "
" ], "text/plain": [ " user item rating timestamp title\n", "100833 610 168250 5.0 1494273047 Get Out (2017)\n", "100834 610 168252 5.0 1493846352 Logan (2017)\n", "100835 610 170875 3.0 1493846415 The Fate of the Furious (2017)\n", "100836 611 174055 4.0 0 Dunkirk (2017)\n", "100837 611 152081 5.0 0 Zootopia (2016)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#append your ratings to the dataFrame df. \n", "df = pd.concat([df,pd.DataFrame(myRatings)], ignore_index=True)\n", "\n", "#add movieTitles to the df (for clarity)\n", "movieTitles = moviesDF.title.loc[df.OID]\n", "df[\"movieTitle\"] = movieTitles.values\n", "df.columns = [\"user\",\"item\",\"rating\",\"timestamp\",\"title\"] #LensKit require \"user\",\"item\",\"rating\" column names \n", "\n", "df.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Usage of the \"most popular\" algorithm from LensKit\n", "- Change this to get the methods you're interested in" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
itemscoreuserrank
03561.0000006111
13180.9967376112
22960.9935946113
35930.9905496114
425710.9877826115
\n", "
" ], "text/plain": [ " item score user rank\n", "0 356 1.000000 611 1\n", "1 318 0.996737 611 2\n", "2 296 0.993594 611 3\n", "3 593 0.990549 611 4\n", "4 2571 0.987782 611 5" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top_k = 20\n", "\n", "pop_alg = basic.PopScore(score_method='quantile')\n", "pop_alg_clone = util.clone(pop_alg) # some algorithms behave strange if they are fitted multiple times\n", "pop_rec = Recommender.adapt(pop_alg_clone) #wrapper around an algorithm (select top-scoring items as recommendations by default)\n", "pop_rec.fit(df) #normally, some train-test split should be performed before this\n", "users = [611] #all you're interested in - normally, those are all users in the test set\n", "recs = batch.recommend(pop_rec, users, top_k)\n", "recs.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlegenresRatingCountyear
movieId
356Forrest Gump (1994)Comedy|Drama|Romance|War329.01994.0
318Shawshank Redemption, The (1994)Crime|Drama317.01994.0
296Pulp Fiction (1994)Comedy|Crime|Drama|Thriller307.01994.0
593Silence of the Lambs, The (1991)Crime|Horror|Thriller279.01991.0
2571Matrix, The (1999)Action|Sci-Fi|Thriller278.01999.0
260Star Wars: Episode IV - A New Hope (1977)Action|Adventure|Sci-Fi251.01977.0
480Jurassic Park (1993)Action|Adventure|Sci-Fi|Thriller238.01993.0
110Braveheart (1995)Action|Drama|War237.01995.0
589Terminator 2: Judgment Day (1991)Action|Sci-Fi224.01991.0
527Schindler's List (1993)Drama|War220.01993.0
2959Fight Club (1999)Action|Crime|Drama|Thriller218.01999.0
1Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy215.01995.0
1196Star Wars: Episode V - The Empire Strikes Back...Action|Adventure|Sci-Fi211.01980.0
2858American Beauty (1999)Drama|Romance204.01999.0
50Usual Suspects, The (1995)Crime|Mystery|Thriller204.01995.0
47Seven (a.k.a. Se7en) (1995)Mystery|Thriller203.01995.0
780Independence Day (a.k.a. ID4) (1996)Action|Adventure|Sci-Fi|Thriller202.01996.0
150Apollo 13 (1995)Adventure|Drama|IMAX201.01995.0
1198Raiders of the Lost Ark (Indiana Jones and the...Action|Adventure200.01981.0
4993Lord of the Rings: The Fellowship of the Ring,...Adventure|Fantasy198.02001.0
\n", "
" ], "text/plain": [ " title \\\n", "movieId \n", "356 Forrest Gump (1994) \n", "318 Shawshank Redemption, The (1994) \n", "296 Pulp Fiction (1994) \n", "593 Silence of the Lambs, The (1991) \n", "2571 Matrix, The (1999) \n", "260 Star Wars: Episode IV - A New Hope (1977) \n", "480 Jurassic Park (1993) \n", "110 Braveheart (1995) \n", "589 Terminator 2: Judgment Day (1991) \n", "527 Schindler's List (1993) \n", "2959 Fight Club (1999) \n", "1 Toy Story (1995) \n", "1196 Star Wars: Episode V - The Empire Strikes Back... \n", "2858 American Beauty (1999) \n", "50 Usual Suspects, The (1995) \n", "47 Seven (a.k.a. Se7en) (1995) \n", "780 Independence Day (a.k.a. ID4) (1996) \n", "150 Apollo 13 (1995) \n", "1198 Raiders of the Lost Ark (Indiana Jones and the... \n", "4993 Lord of the Rings: The Fellowship of the Ring,... \n", "\n", " genres RatingCount year \n", "movieId \n", "356 Comedy|Drama|Romance|War 329.0 1994.0 \n", "318 Crime|Drama 317.0 1994.0 \n", "296 Comedy|Crime|Drama|Thriller 307.0 1994.0 \n", "593 Crime|Horror|Thriller 279.0 1991.0 \n", "2571 Action|Sci-Fi|Thriller 278.0 1999.0 \n", "260 Action|Adventure|Sci-Fi 251.0 1977.0 \n", "480 Action|Adventure|Sci-Fi|Thriller 238.0 1993.0 \n", "110 Action|Drama|War 237.0 1995.0 \n", "589 Action|Sci-Fi 224.0 1991.0 \n", "527 Drama|War 220.0 1993.0 \n", "2959 Action|Crime|Drama|Thriller 218.0 1999.0 \n", "1 Adventure|Animation|Children|Comedy|Fantasy 215.0 1995.0 \n", "1196 Action|Adventure|Sci-Fi 211.0 1980.0 \n", "2858 Drama|Romance 204.0 1999.0 \n", "50 Crime|Mystery|Thriller 204.0 1995.0 \n", "47 Mystery|Thriller 203.0 1995.0 \n", "780 Action|Adventure|Sci-Fi|Thriller 202.0 1996.0 \n", "150 Adventure|Drama|IMAX 201.0 1995.0 \n", "1198 Action|Adventure 200.0 1981.0 \n", "4993 Adventure|Fantasy 198.0 2001.0 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rec_items = recs[\"item\"].values.tolist()\n", "moviesDF.loc[rec_items]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 3: Explore parameter space\n", "- check what is the impact on your recommendations if you, e.g., change the neighbors volume, aggregation, or feedback type\n", "- try at least 5-10 configurations\n", "- note what configurations gave you good/bad results, **mark the best one**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 4: Another algorithm\n", "- select another algorithm from the matrix factorization family (e.g., FunkSVD, lenskit_implicit.BPR, or ALS.BiasedMF) \n", "- construct the initial loop and experiment a bit with hyperparameters (e.g., learning rate, regularization, #features, #iterations for FunkSVD)\n", "- note what configurations gave you good/bad results, **mark the best one**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 5: Hyperparameter Tuning\n", "### Try to get the best hyperparameter values automatically, using off-line evaluation\n", "Following steps are nicely described in the LensKit Getting started tutorial (https://lkpy.readthedocs.io/en/stable/GettingStarted.html#Running-the-Evaluation)\n", "- Split data to train and validation sets\n", "- Utilize GRID search, define a few reasonable values for the most meaningful parameters\n", "- For each configuration:\n", " - Train the recommending algorithm (using train set)\n", " - Let the algorithm recommend for all users\n", " - Evaluate the recommendations (select a target metric of your choice, e.g., nDCG)\n", "- select the best configuration (i.e., the one with the highest nDCG)\n", "\n", "- **Use this new configuration to recommend yourself** - how good/bad were the recommendations?\n", "\n", "- Alternatively, check LensKitAuto which can do some of the heavy-lifting for you (https://github.com/ISG-Siegen/lenskit-auto/tree/main)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(80674, 5) (20164, 5)\n" ] } ], "source": [ "#TODO: define hyperparameter configurations to be evaluated (simple grid search is just fine)\n", "for train,test in xf.partition_users(df, 1, xf.SampleFrac(0.2)): #define random sampled train and test sets\n", " print(train.shape, test.shape)\n", " #TODO: foreach hyperparameter setting fit the algorithm, test recommendations and store results\n", " #TODO: identify the best variant of hyperparameters\n", " #TODO: fit the best variant on all data and recommend to you" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 1 }