When it comes to constructing intelligent recommender systems that are able to learn to deliver better suggestions as more information is gathered about users, the most frequent method that is employed is called collaborative filtering.
Collaborative filtering is used by the majority of websites, like Amazon, YouTube, and Netflix, as an integral feature of their highly developed recommendation algorithms. This method may be used to construct recommenders, which provide recommendations to users based on the preferences of other users who are similar to that user in terms of their likes and dislikes. You are going to gain knowledge on the following topics as you read this article:
- Collaborative filtering and it types
- Data needed to build a recommender
- Libraries available in Python to build recommenders
- Use cases and challenges of collaborative filtering
What Exactly Is the Concept of Collaborative Filtering?
The process of collaborative filtering is a method that might exclude content that a user might be interested in based on the responses of other users who are similar to that user.
It does this by scanning across a big population of individuals and locating a subset of users who have preferences that are similar to those of a specific user. It takes into account the things that they enjoy and compiles them into a ranked list of recommendations for them.
There are a lot of different approaches to figure out which users are comparable to one another and combine their selections in order to get a list of suggestions. You will learn how to achieve it using Python in the next post.
The dataset itself
You will require data that not only comprises a collection of objects, but also a set of users who have responded to some of the items, in order to experiment with different recommendation algorithms.
Either an explicit rating (on a scale of 1 to 5 stars) or an implicit one (likes or dislikes) may be given as a response (viewing an item, adding it to a wish list, the time spent on an article).
When you are dealing with this kind of data, you will most often encounter it in the form of a matrix that contains the responses that a set of users gave to certain things from another set of objects. The ratings that a user has given an item would be listed in each row, and the ratings that an item has received would be shown in each column. The following is an example of what a matrix that has five users and five things may look like:
Rating Matrix
The matrix displays the ratings given by five different users for various goods on a scale ranging from one to five. Take, for instance, the fact that the first user rated the third item a four out of five.
The majority of the cells in the matrix are blank since users typically only rate a few different things. It is quite improbable that every user would rate or comment on every item that is accessible to them. A matrix that is mostly comprised of empty cells is referred to as sparse, whereas a matrix that is predominantly comprised of full cells is referred to as dense.
There is a large quantity of data that has been compiled and is now accessible to the general public for the purposes of study and benchmarking. You have your pick from the options shown here, all of them are reliable sources of information.
The MovieLens dataset that was compiled by GroupLens Research is the one that would be ideal for getting started with. In particular, the MovieLens 100k dataset is a reliable benchmark dataset that has 100,000 ratings that were contributed by 943 people for a total of 1682 films, with each user having contributed ratings for at least 20 films.
This data collection is comprised of a large number of files, each of which contains information about a movie, a user, and the ratings that the user has assigned to movies that they have viewed. Those particular ones are the ones that are of interest.
following:
-
u.item
:
the list of movies -
u.data
:
the list of ratings given by users
The ratings are stored in a file called u.data, which is a tab-delimited list that includes the user ID, item ID, rating, and date for each rating. These are the first few lines that are included in the file:
The first five rows of MovieLens’s one hundred thousand data
As was just seen, the file contains information on the rating that a user assigned to a certain movie. This file includes 100,000 ratings of this kind, which will be used to make predictions for the ratings of movies that the users have not watched.
The Process of Collaborative Filtering Includes the Following Stages
The first thing that must be done in order to construct a system that can automatically propose goods to users based on the preferences of other users is to locate users or items that are comparable to one another. The next thing to do is to speculate on how users will assess things that have not yet been given a rating by a customer. Thus, you will require the answers to these questions.
questions:
- How do you determine which users or items are similar to one another?
- Given that you know which users are similar, how do you determine the rating that a user would give to an item based on the ratings of similar users?
- How do you measure the accuracy of the ratings you calculate?
There is no one correct response to each of the first two questions. The term “collaborative filtering” refers to a family of algorithms in which there are numerous approaches to locating individuals or products that are similar to one another, as well as different approaches to calculating a rating based on the ratings of users who are similar to one another. You may find yourself with a kind of collaborative filtering strategy at the end, depending on the decisions you make. In this post, you’ll get to examine the many different strategies that may be used to detect similarities and forecast ratings.
The age of the users, the genre of the movie, or any other data about persons or things are not taken into consideration when determining similarity in a method that is only based on collaborative filtering. This is a point that should be kept in mind since it is an essential consideration. The sole factor that goes into its calculation is the rating, either explicit or implicit, that a user assigns to a particular item. For instance, two users may be regarded comparable even if there may be a large age gap between them if the ratings they provide to 10 different movies are identical to one another.
The third question, which is about how to quantify the accuracy of your predictions, also offers numerous solutions. These answers include error calculation methodologies that may be applied in many other contexts, not only in recommenders based on collaborative filtering.
The Root Mean Square Error (RMSE) is one of the methods that may be used to evaluate how accurate your results are. In this method, you are tasked with predicting ratings for a test dataset consisting of user-item pairings whose rating values have already been determined. The term “error” refers to the disparity that exists between the actual value and the value that was expected. To calculate the root mean square error (RMSE), first multiply all of the error values for the test set by themselves, then determine the average (or mean), and then take the square root of that average.
Mean Absolute Error (MAE) is another metric that may be used to quantify the accuracy of a measurement. With this method, you first determine the magnitude of each mistake by determining its absolute value, and then you take the average of all of the error values.
You don’t need to worry about the specifics of RMSE or MAE at this stage since they are easily accessible as part of a variety of packages in Python, and you will learn more about them in the next sections of this article.
Now, let’s have a look at the many kinds of algorithms that fall under the umbrella of the collaborative filtering family.
Memory-Relianced
The first kind of algorithm falls into the “memory-based” group. These are the kinds of programmes that use statistical methods to analyse the complete dataset before calculating their predictions.
The strategy that is used to determine the rating R that a user U would give to an item I is as follows:
includes:
-
Finding users similar to
U
who have rated the item
I
-
Calculating the rating
R
based the ratings of users found in the previous step
In the next sections, you will get an in-depth look at each of them in its own right.
How to Locate Individuals That Are Similar to Others Based on Their Ratings
Let’s start by constructing a basic dataset so we may have a better grasp on the idea of similarity.
The data comprises four different individuals, A, B, C, and D, each of whom has rated two different movies. The ratings are maintained in lists, and each list has two integers that indicate the rating of each item in the list.
movie:
-
Ratings by
A
are
[1.0, 2.0]
. -
Ratings by
B
are
[2.0, 4.0]
. -
Ratings by
C
are
[2.5, 4.0]
. -
Ratings by
D
are
[4.5, 5.0]
.
To begin with a graphical hint, take the ratings that people have provided for two different movies and plot them on a graph. Next, seek for a pattern. The graph may be represented as follows:
The user who is represented by each point in the graph that can be seen above is plotted against the ratings that they provided to two different movies.
It would appear like a decent approach to determine how similar two points are would be to look at the distance between them, right? The formula for calculating the Euclidean distance between two locations may be used to determine the distance. You are free to make use of the function that is offered by scipy, as shown in the following:
programme:
>>>
>>> from scipy import spatial
>>> a = [1, 2]
>>> b = [2, 4]
>>> c = [2.5, 4]
>>> d = [4.5, 5]
>>> spatial.distance.euclidean(c, a)
2.5
>>> spatial.distance.euclidean(c, b)
0.5
>>> spatial.distance.euclidean(c, d)
2.23606797749979
Calculating the distance between two points may be accomplished with the help of the scipy.spatial.distance.euclidean module, as was shown above. When we use it to compute the distance between the ratings of A, B, and D to those of C, we find that when compared to those of B, the ratings of C are the ones that are the closest to those of B.
Just glancing at the graph reveals that user C is the one who is most closely aligned with user B. But, if we just consider A and D, which of the two is C most similar to?
In terms of their distance from one another, you may argue that C is closer to D. The rankings, however, give the impression that the preferences of C would be more in line with those of A than those of D. This is due to the fact that A, C, and both of them like the second movie nearly twice as much as they like the first movie, although D loves both of the movies equally.
Now, what other tools are available to you outside the Euclidean distance that may help you recognise these kinds of patterns? Is it possible to base a judgement on the angle that exists between the lines that unite the points to the starting point. You are able to examine the angle that exists between the lines that connect the beginning of the graph to each of the points individually, as seen here:
In the graph, there are four lines that go from each point back to the starting point. Since the lines for A and B cross each other exactly, the angle that separates them is completely flat.
If the angle between the lines is increased, then the similarity will decrease, and if the angle is zero, then the users will be highly similar to one another. This is something that may be considered.
In order to compute similarity based on angle, you will need a function that provides a greater similarity (or a smaller distance) when the angle is less, and a lower similarity (or a bigger distance) when the angle is larger. The cosine of an angle is a function that goes from 1 to -1 as the angle goes from 0 to 180 degrees. This means that the cosine of an angle is negative.
The cosine of the angle is a useful tool for determining the degree of similarity between two users. The greater the angle, the lower the cosine, and thus, the lower the similarity between the users, since the cosine is inversely proportional to the angle. You can also obtain the cosine distance between the users by inverting the value of the cosine of the angle and then subtracting it from 1.
Scipy is included with a function that determines the cosine distance between two vectors. It provides a better value for higher iterations.
angle:
>>>
>>> from scipy import spatial
>>> a = [1, 2]
>>> b = [2, 4]
>>> c = [2.5, 4]
>>> d = [4.5, 5]
>>> spatial.distance.cosine(c,a)
0.004504527406047898
>>> spatial.distance.cosine(c,b)
0.004504527406047898
>>> spatial.distance.cosine(c,d)
0.015137225946083022
>>> spatial.distance.cosine(a,b)
0.0
The smaller angle that exists between the vectors of C and A results in a lower value for the cosine distance. Use the cosine distance metric if you wish to rank users in this manner according to their degree of similarity. Notice that just two movies are taken into account in the illustration that was just shown, which makes it much simpler to picture the rating vectors in two dimensions. This is done only for the purpose of making the explanation simpler.
Rating vectors would need additional dimensions if real-world use cases with many items were considered. It’s possible that you should go into the mathematical details of
a resemblance in cosine may also be seen.
Take note that despite having differing ratings, users A and B are regarded to have a cosine similarity score that places them in the same category as having a same level of similarity. This kind of thing happens rather often out there in the real world, and people who evaluate things the way user A does are examples of what are known as “tough raters.” An example of this would be a movie reviewer who consistently hands out ratings that are lower than the average, yet the ranks of the things on their list would be comparable to those of the Average raters, such as a B.
In order to account for such specific user preferences, you will need to level the playing field for all users by eliminating the biases they bring to the table. You may do this by deducting from each item rated by that user the average rating that user has given to all of the products they have evaluated. The following is how it would appear:
like:
-
For user
A
, the rating vector
[1, 2]
has the average
1.5
. Subtracting
1.5
from every rating would give you the vector
[-0.5, 0.5]
. -
For user
B
, the rating vector
[2, 4]
has the average
3
. Subtracting
3
from every rating would give you the vector
[-1, 1]
.
Because of this action of yours, the value of the average rating that each user has provided has been altered to 0. If you try performing the same thing for users C and D, you will see that the ratings have been modified to offer an average of 0 for all users, which puts them all to the same level and eliminates their biases. You can see this for yourself if you try it out.
The cosine of the angle formed by the vectors after they have been corrected is referred to as the centred cosine. This strategy is often used in situations in which the vectors are lacking a significant number of values and it is necessary to insert a common value in order to complete the vector.
There is a possibility that the accuracy of the ratings will suffer if the missing values in the matrix are filled in with arbitrary values. A good option for filling in the missing values could be the average rating of each user; however, the original averages of user A and user B are 1.5 and 3, respectively; therefore, if you filled in all of user A’s empty values with 1.5 and those of user B with 3, you would end up with two users who are very different from one another.
However, after adjusting the values, the centred average of both users is 0, which enables you to capture the concept of the item being above or below average for both users in a more accurate manner. This is because all of the missing values in both user’s vectors have the same value of 0, which is the case because the centred average is 0.
You may locate persons who are similar to one another and even things that are similar to one another by using a variety of methods, some of which include the Euclidean distance and the cosine similarity. (The function that was used above to compute cosine distance also calculates cosine similarity; to calculate cosine similarity, remove the distance from 1.) Note that the formula for centred cosine is the same as that for Pearson correlation coefficient. You will discover that the implementation of centred cosine is referred to as Pearson Correlation in many of the materials and libraries devoted to recommenders.
Methods for Figuring Out the Ratings
After you have compiled a list of users that are comparable to a user U, the next step is to establish the rating R that user U would give to a particular item I. As was the case with similarity, this too may be accomplished in a variety of ways.
It is possible to forecast that a user’s rating R for an item I will be quite close to the average of the ratings given to I by the top 5 or top 10 users who are the most similar to U. The following is an example of the mathematical formula that may be used to get the average rating that n users have given:
According to this formula, the average rating provided by n users who are similar to one another is equal to the sum of the ratings provided by those users divided by the total number of users who are similar to one another, which is n.
There will be scenarios in which the n comparable users that you discovered are not equally similar to the target user U. This is something that you should expect to happen. It’s possible that the top three of them are really similar, but the rest of them aren’t quite as similar to U as the top three are. In such scenario, you could want to think about a strategy in which the rating of the user who is the most similar to you counts more than the rating of the person who is the second most similar to you, and so on. We may be able to do it with the aid of the weighted average.
With the method known as the weighted average, you take each rating and multiply it by a similarity factor (which tells how similar the users are). You may provide additional weight to the ratings by multiplying them with the similarity factor. The higher the weight, the greater the significance the rating would have.
Since a smaller distance indicates a greater degree of resemblance, the similarity factor, which would function as weights, ought to be the inverse of the distance, as was covered before. To illustrate, the cosine similarity may be calculated by taking the cosine distance and subtracting it from 1.
You may determine the weighted average by using this method, providing you have the similarity factor S for each person whose profile is comparable to the profile of the target user U.
The formula that was just shown requires that each rating be multiplied by the user’s similarity factor in order to be considered. The final anticipated rating that will be assigned by user U will be equal to the sum of all of the weighted ratings that have been divided by the total number of weights. Note: Just in case you’re curious about the reasoning behind why the total number of weighted ratings is divided by the whole number of weights and not by, here it is:
5, take into consideration the following: in the earlier calculation for calculating the average, when you divided by, n
n had a value of 1, which was the same as the weight.
When it comes to determining averages, the denominator is always the sum of the weights, and in the instance of the normal average, the weight being 1 indicates that the denominator would be equal to
n .
When you calculate an average using a weighted system, you give more weight to the evaluations of users who are comparable to the users being compared.
You are now aware of how to locate people who are similar to you and how to compute ratings based on the ratings of those users. There is also a kind of collaborative filtering known as similarity-based collaborative filtering, in which rating predictions are made by locating things that are comparable to one another rather than by gathering ratings from individual users. In the next paragraph, you’ll learn more about this variant version.
Collaborative filtering that is based on users as opposed to items
It’s called user-based filtering or user-user collaborative filtering, and it’s the method that’s utilised in the instances that were just presented, where the rating matrix is used to locate comparable users based on the ratings they offer. It is referred to as item-based or item-item collaborative filtering when the rating matrix is used to locate objects that are comparable to one another on the basis of the ratings that have been assigned to them by users.
There is a conceptual divide between the two strategies, despite the fact that the two techniques have a number of mathematical similarities. Here is how the two differ:
compare:
-
User-based:
For a user
U
, with a set of similar users determined based on rating vectors consisting of given item ratings, the rating for an item
I
, which hasn’t been rated, is found by picking out N users from the similarity list who have rated the item
I
and calculating the rating based on these N ratings. -
Item-based:
For an item
I
, with a set of similar items determined based on rating vectors consisting of received user ratings, the rating by a user
U
, who hasn’t rated it, is found by picking out N items from the similarity list that have been rated by
U
and calculating the rating based on these N ratings.
Amazon pioneered the concept of item-based collaborative filtering in their online marketplace. When there are more users than there are things in a system, filtering based on items rather than people is both more efficient and more reliable. It is efficient because, in general, the average rating that an item receives does not fluctuate as rapidly as the average rating that a user gives to other goods. This is one of the reasons why it is successful. It has also been shown to perform better than the user-based method in situations when there are few ratings in a matrix.
Nevertheless, the item-based strategy works badly for datasets containing browsing or entertainment related items such as MovieLens, since the suggestions it sends out look extremely clear to the consumers it is trying to attract as its target audience. For these kinds of datasets, the techniques of matrix factorization, which you will learn about in the following section, or hybrid recommenders, which also take into account the content of the data, such as the genre, through the application of content-based filtering, tend to produce the best results.
You are able to do rapid experiments with a variety of different recommender algorithms by using the package called Surprise. (Further information on this topic will be provided to you at a later point in the text.)
Model Based
A step to decrease or compress the huge yet sparse user-item matrix is included in the Model-based techniques, which fall under the second group of optimization methods. For the purpose of comprehending this stage, a fundamental comprehension of dimensionality reduction may be of great use.
Diminutive Decrease of Dimensions
There are two user entries in the item-user matrix.
dimensions:
- The number of users
- The number of items
If most of the cells in the matrix are empty, decreasing the matrix’s size may enhance the algorithm’s performance in terms of both space and time. In order to accomplish this goal, you may make use of a wide variety of approaches, such as matrix factorization or autoencoders. One way to think about matrix factorization is as the process of converting a huge matrix into the product of many smaller matrices. This is analogous to the process of factoring integers, in which the number 12 may be expressed as either 6 x 2 or 4 x 3. In the context of matrices, it is possible to collapse a matrix A with the dimensions m by n into a product of two matrices X and Y, each of which has the dimensions m x p and p x n. Note: In matrix multiplication, a matrix
It is possible to multiply X by.
Y only in the event that the total number of columns in
The number of rows in X is equal to the value of
Y .
Because of this, the dimensions of the two reduced matrices are the same.
p .
The number of reduced matrices might be greater than two as well, depending on the technique that was used for the dimensionality reduction process.
In reality, the reduced matrices reflect the users and the goods on an individual level. The first matrix has m rows, which stand for the m users, and p columns, which provide information on the qualities or characteristics of the individuals. The same principle applies to the item matrix, which consists of n items and p features. This is an illustration of what the matrix factorization process looks like:
Matrix Factorization
The matrix has been split into two smaller matrices as seen in the figure to the right. The one on the left is known as the user matrix, and it contains m people. The one on the top, meanwhile, is known as the item matrix, and it contains n items. The rating of 4 is decreased or taken into account.
into:
-
A user vector
(2, -1)
-
An item vector
(2.5, 1)
Latent factors are shown by the two columns in the user matrix and the two rows in the item matrix. These latent factors are a sign of hidden qualities that are associated with the users or the objects. The following is an example of one interpretation that might be given for the factorization:
this:
-
Assume that in a user vector
(u, v)
,
u
represents how much a user likes the Horror genre, and
v
represents how much they like the Romance genre. -
The user vector
(2, -1)
thus represents a user who likes horror movies and rates them positively and dislikes movies that have romance and rates them negatively. -
Assume that in an item vector
(i, j)
,
i
represents how much a movie belongs to the Horror genre, and
j
represents how much that movie belongs to the Romance genre. -
The movie
(2.5, 1)
has a Horror rating of
2.5
and a Romance rating of
1
. Multiplying it by the user vector using matrix multiplication rules gives you
(2 * 2.5) + (-1 * 1) = 4
. -
So, the movie belonged to the Horror genre, and the user could have rated it
5
, but the slight inclusion of Romance caused the final rating to drop to
4
.
While the explanation provided above simplifies how factor matrices might reveal such insights about people and objects, in practise they are often far more complicated than the explanation that was given before. It’s possible that there might be as few as one or as many as hundreds or even thousands of these elements. Throughout the process of the model being trained, this number is one of the variables that has to have its optimization attempted.
In the given example, you have two latent components for movie genres; however, in real-world circumstances, it is not necessary to study these latent elements to a great extent. There are patterns in the data that will play their role in the analysis automatically regardless of whether or not you comprehend the significance of what they represent.
The number of latent elements has an effect on the suggestions in such a way that the level of personalization increases proportionally with the number of factors that are being considered. Yet, having an excessive number of variables might cause the model to be overfit. Remember that overfitting occurs when the model is trained to match the training data so well that it does not function well with fresh data. This might lead to inaccurate predictions.
Matrix factorization algorithms may be found here.
The singular value decomposition (SVD) technique is one of the most well-known methods that may be used to factorise a matrix. When matrix factorization was perceived to be doing well in the competition for the Netflix award, SVD was brought to the forefront of public attention. PCA and its variants, NMF, and a variety of other algorithms are further examples of computation methods. In the event that you decide to make use of neural networks, autoencoders may likewise be used for the purpose of dimensionality reduction.
You don’t need to worry about the specifics just now since the implementations of these algorithms can be found in a variety of Python libraries; at this moment, though, you should focus on the big picture. In any case, in the event that you are interested in learning more, the section of the book Mining of Massive Datasets devoted to dimensionality reduction is an interesting read.
Using Python to Construct Recommendations
You may construct a recommender out of a variety of different algorithms, and Python offers a number of libraries and toolkits that give implementations of those algorithms for you to utilise. But, when you are learning about recommendation systems, the one that you should experiment with the most is Surprise.
The Python SciKit known as Surprise comes pre-packaged with a number of different recommender algorithms and similarity measures, which makes it simple to construct and examine recommenders.
The installation instructions are as follows:
pip:
$ pip install numpy
$ pip install scikit-surprise
The installation instructions are as follows:
conda:
$ conda install -c conda-forge scikit-surprise
Important: If you want to replicate the results shown in the examples, you should also consider installing Pandas.
In order to utilise Surprise, you need first have a rudimentary understanding of some of the modules and classes that are accessible in
it:
-
The
Dataset
module is used to load data from files,
Pandas dataframes
, or even built-in datasets available for experimentation. (MovieLens 100k is one of the built-in datasets in Surprise.) To load a dataset, some of the available methods are:-
Dataset.load_builtin()
-
Dataset.load_from_file()
-
Dataset.load_from_df()
-
-
The
Reader
class is used to parse a file containing ratings. The default format in which it accepts data is that each rating is stored in a separate line in the order
user item rating
. This order and the separator can be configured using parameters:-
line_format
is a
string
that stores the order of the data with field names separated by a space, as in
"item user rating"
. -
sep
is used to specify separator between fields, such as
','
. -
rating_scale
is used to specify the rating scale. The default is
(1, 5)
. -
skip_lines
is used to indicate the number of lines to skip at the beginning of the file. The default is
0
.
-
You may load data from a Pandas dataframe or the built-in MovieLens 100k with the help of the following software.
dataset:
# load_data.py
import pandas as pd
from surprise import Dataset
from surprise import Reader
# This is the same data that was plotted for similarity earlier
# with one new user "E" who has rated only movie 1
ratings_dict = {
"item": [1, 2, 1, 2, 1, 2, 1, 2, 1],
"user": ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D', 'E'],
"rating": [1, 2, 2, 4, 2.5, 4, 4.5, 5, 3],
}
df = pd.DataFrame(ratings_dict)
reader = Reader(rating_scale=(1, 5))
# Loads Pandas dataframe
data = Dataset.load_from_df(df[["user", "item", "rating"]], reader)
# Loads the builtin Movielens-100k data
movielens = Dataset.load_builtin('ml-100k')
The data are kept in a dictionary, which is imported first into a Pandas dataframe and then into a Dataset object from the Surprise library in the application that was just described.
Using K-Nearest Neighbors as the Basis for Algorithms (k-NN)
The method that you want to use heavily influences the algorithm you go with to implement the recommender function. Centered k-NN is the method that would work best for the memory-based techniques that were mentioned before. This is due to the fact that the algorithm is quite comparable to the centred cosine similarity formula that was described earlier. It may be obtained in Surprise under the name KNNWithMeans.
Just setup the function by providing a dictionary to the recommender function as an input, and then you will be able to locate the similarities between the items. The necessary keys, such as the, need to be included in the dictionary.
following:
-
Collaborative filtering doesn’t require features about the items or users to be known. It is suited for a set of different types of items, for example, a supermarket’s inventory where items of various categories can be added. In a set of similar items such as that of a bookstore, though, known features like writers and genres can be useful and might benefit from content-based or hybrid approaches.
-
Collaborative filtering can help recommenders to not overspecialize in a user’s profile and recommend items that are completely different from what they have seen before. If you want your recommender to not suggest a pair of sneakers to someone who just bought another similar pair of sneakers, then try to add collaborative filtering to your recommender spell.
The KNNWithMeans configuration is handled by the following software.
function:
# recommender.py
from surprise import KNNWithMeans
# To use item-based cosine similarity
sim_options = {
"name": "cosine",
"user_based": False, # Compute similarities between items
}
algo = KNNWithMeans(sim_options=sim_options)
In the software that was just described, the recommender function is set up to make use of the cosine similarity and to locate things that are comparable by using an item-based approach.
You will first need to generate a Trainset from data before you can test out this recommender. Trainset is constructed using the same data but includes additional information about the data, such as the number of users and items (n users, n items) that are utilised by the algorithm. Trainset is generated using the same data. You have the option of utilising all of the data or only a subset of the data in order to make it. You also have the option of folding the data, which means that portion of the data will be used for training, while the remaining data will be utilised for testing. Remember that using just one set of training data and testing data is often insufficient for the task at hand. When you do so
You should construct more than one pair of training and testing data in order to allow for many observations with differences in the training and testing data. The original dataset should be divided into training data and testing data.
The proper way for algorithms to be
cross-validated with a number of different folds. If you use a variety of different pairings, your recommender will provide you with a variety of different outcomes. MovieLens 100k offers a variety of data splits for training and testing, including u1.base, u1.test, u2.base, u2.test, etc., for a 5-fold cross-validation: u1.base, u1.test, u2.base, u2.test, u5.base, u5.test.
Have a look at the following sample to get an idea of how the user E might rank the movie:
2:
>>>
>>> from load_data import data
>>> from recommender import algo
>>> trainingSet = data.build_full_trainset()
>>> algo.fit(trainingSet)
Computing the cosine similarity matrix...
Done computing similarity matrix.
<surprise.prediction_algorithms.knns.KNNWithMeans object at 0x7f04fec56898>
>>> prediction = algo.predict('E', 2)
>>> prediction.est
4.15
It was predicted by the algorithm that the user E would give the movie a 4.15 rating, which is a score that is potentially high enough to be shown as a recommendation.
You should experiment with the various k-NN based algorithms, as well as the many similarity choices and matrix factorization techniques that are offered by the Surprise package. You may test them on the MovieLens dataset to see if you can get better results than some of the benchmarks. In the next part, we will discuss how to utilise Surprise to determine which parameters work best with your data.
Adjusting the Variables in the Algorithm
A GridSearchCV class that is comparable to the GridSearchCV class provided by scikit-learn is made available by Surprise.
GridSearchCV utilises a dict containing all parameters to test each possible combination of parameters and then provides the parameters that result in the highest accuracy measure.
For instance, memory-based analysis allows you to determine which similarity measure performs the best for your data.
approaches:
from surprise import KNNWithMeans
from surprise import Dataset
from surprise.model_selection import GridSearchCV
data = Dataset.load_builtin("ml-100k")
sim_options = {
"name": ["msd", "cosine"],
"min_support": [3, 4, 5],
"user_based": [False, True],
}
param_grid = {"sim_options": sim_options}
gs = GridSearchCV(KNNWithMeans, param_grid, measures=["rmse", "mae"], cv=3)
gs.fit(data)
print(gs.best_score["rmse"])
print(gs.best_params["rmse"])
The following is an example of the output of the programme above:
follows:
0.9434791128171457
{'sim_options': {'name': 'msd', 'min_support': 3, 'user_based': False}}
Hence, the Centered-KNN method performs the best when used to the MovieLens 100k dataset when an item-based approach is taken and the mean squared deviation (msd) is used as the similarity measure with minimal support 3.
In a similar manner, for model-based techniques, we can utilise Surprise to determine which values for the following factors are optimal:
best:
-
Collaborative filtering can lead to some problems like cold start for new items that are added to the list. Until someone rates them, they don’t get recommended.
-
Data sparsity can affect the quality of user-based recommenders and also add to the cold start problem mentioned above.
-
Scaling can be a challenge for growing datasets as the complexity can become too large. Item-based recommenders are faster than user-based when the dataset is large.
-
With a straightforward implementation, you might observe that the recommendations tend to be already popular, and the items from the
long tail
section might get ignored.
Note: It is important to keep in mind that matrix factorization methods won’t have any similarity metrics since the latent factors will handle the similarity between people or things.
The following piece of software will analyse data in order to determine the optimal settings for the SVD method, which is a matrix factorization.
algorithm:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import GridSearchCV
data = Dataset.load_builtin("ml-100k")
param_grid = {
"n_epochs": [5, 10],
"lr_all": [0.002, 0.005],
"reg_all": [0.4, 0.6]
}
gs = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], cv=3)
gs.fit(data)
print(gs.best_score["rmse"])
print(gs.best_params["rmse"])
The following is an example of the output of the programme above:
follows:
0.9642278631521038
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}
So, in order to get the best results from the SVD method when applied to the MovieLens 100k dataset, it is recommended that you utilise 10 epochs, a learning rate of 0.005, and 0.4 regularisation.
SVD++ and NMF are two more Matrix Factorization-based techniques that may be used in Surprise.
If you follow these examples, you will be able to have a comprehensive understanding of all the parameters that may be used in these algorithms. It is highly recommended that you investigate the mathematical reasoning behind them. Recommenders may be an excellent entry point into the realm of machine learning and a solid foundation upon which to construct an application. This is due to the fact that you won’t have to worry as much about the first implementation of algorithms.
When is the use of collaborative filtering appropriate?
The interactions that users have with things are at the centre of how collaborative filtering works. These interactions have the potential to help identify patterns that the data about the people or objects on its own cannot. Consider the following information while deciding whether or not you should use collaborative filtering:
used:
-
LightFM
:
a hybrid recommendation algorithm in Python -
Python-recsys
:
a Python library for implementing a recommender system
Even though collaborative filtering is used in recommenders on a relatively regular basis, there are a few obstacles that must be overcome before it can be utilised effectively.
following:
-
Item Based Collaborative Filtering Recommendation Algorithms
:
the first paper published on item-based recommenders -
Using collaborative filtering to weave an information tapestry
:
the first use of the term collaborative filtering
While each kind of recommender algorithm comes with its own set of advantages and disadvantages, a hybrid recommender is often the solution to problems of this nature. You may improve the accuracy of your recommenders by taking use of the advantages offered by various algorithms operating together or in a pipeline. In point of fact, the solution that won the Netflix award was likewise a complicated combination of a number of different algorithms.