Recommender Systems with Apache Spark
Introduction
In this blog post, we’ll be discussing Recommender Systems with Apache Spark, an open-source, distributed computing system for big data processing. Recommender systems have become increasingly popular with the growing amount of data online. They are algorithms that analyze large data sets and make recommendations based on the user’s preferences or behavior. In this post, we’ll go over some of the key concepts behind recommender systems and how you can build one with Apache Spark.
- Collaborative Filtering
Collaborative filtering is one of the most popular techniques used in recommender systems. It involves finding similarities between users and recommending items based on those similarities. In collaborative filtering, there are two types of filtering: user-based and item-based.
i. User-based Filtering
User-based filtering is where the system looks for similarities between users and recommends items based on the preferences of other users who are similar to the current user. Let’s look at an example of how user-based filtering works using Apache Spark.
from pyspark.mllib.recommendation import ALS, Rating
data = sc.textFile("data/user-item-ratings.csv")
ratings = data.map(lambda l: l.split(','))\
.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
rank = 10
numIterations = 5
model = ALS.train(ratings, rank, numIterations)
# we can recommend top n products for a given user
recommendations = model.recommendProducts(1, 10)
Here we’ve used the ALS algorithm in Spark for collaborative filtering. The data used is in the form of user-item-rating table, which is read from a CSV file. We trained an ALS model with the ratings data using 10 latent factors and 5 iterations. The model can then be used to recommend products for a user by calling the recommendProducts
method with the user ID and the number of products to recommend.
ii. Item-based Filtering
Item-based filtering is where the system recommends items that are similar to the ones a user has already rated highly. Item-based filtering is faster than user-based filtering since the user similarity matrix doesn’t need to be computed. Let’s look at how to implement item-based filtering with Spark.
from pyspark.mllib.recommendation import ALS, Rating
data = sc.textFile("data/user-item-ratings.csv")
ratings = data.map(lambda l: l.split(','))\
.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
rank = 10
numIterations = 5
model = ALS.train(ratings, rank, numIterations)
# we can recommend top n similar products for a given item
recommendations = model.recommendProducts(1, 10)
Here, we used the same data set as in user-based filtering, but we will now call the recommendProducts
method with an item ID instead of a user ID. The recommendations will be the top n products that are similar to the one with the given ID.
- Matrix Factorization
Matrix factorization is another popular method used in recommender systems. The idea behind matrix factorization is to transform the original data set into matrices and then find low-dimensional representations of those matrices. This is useful for finding similarities between items or users.
In Spark, matrix factorization can be implemented using the ALS algorithm we saw earlier, which performs low-rank matrix factorization. In collaborative filtering, ALS factorizes the user-item rating matrix into two matrices, one for users and one for items, such that the product of those matrices approximates the original rating matrix.
from pyspark.mllib.recommendation import ALS, Rating
data = sc.textFile("data/user-item-ratings.csv")
ratings = data.map(lambda l: l.split(','))\
.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
rank = 10
numIterations = 5
model = ALS.train(ratings, rank, numIterations)
# get the user-factor matrix and item-factor matrix
users = model.userFeatures()
items = model.productFeatures()
Here we’ve trained an ALS model just like before, but we’ve now extracted the user-factors and item-factors matrices from the model. These matrices can be used to compute similarities between users or items.
- Evaluation Metrics
Evaluation metrics are used to assess the performance of a recommender system. Two popular evaluation metrics are RMSE (Root Mean Squared Error) and MAP (Mean Average Precision).
RMSE measures the difference between the predicted rating and the actual rating for a given user-item pair. The lower the RMSE, the better the system’s performance.
from pyspark.ml.evaluation import RegressionEvaluator
data = spark.read.csv("ratings.csv", header=True, inferSchema=True)
(training, test) = data.randomSplit([0.8, 0.2])
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating",
coldStartStrategy="drop")
model = als.fit(training)
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
In this example, we’ve used the RegressionEvaluator
class in Spark to compute the RMSE of an ALS model. We’ve split the data into training and test sets, and trained an ALS model on the training set. We then evaluated the model’s predictions on the test set using the RMSE metric.
MAP measures the precision of a system’s recommendations. The higher the MAP, the better the system’s performance.
from pyspark.ml.evaluation import BinaryClassificationEvaluator
data = spark.read.csv("data/user-item-ratings.csv", header=True, inferSchema=True)
(training, test) = data.randomSplit([0.8, 0.2])
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating",
coldStartStrategy="drop")
model = als.fit(training)
predictions = model.transform(test)
evaluator = BinaryClassificationEvaluator(labelCol="rating", rawPredictionCol="prediction")
P = evaluator.evaluate(predictions)
In this example, we’ve used the BinaryClassificationEvaluator
class in Spark to compute the MAP of an ALS model. We’ve split the data into training and test sets, and trained an ALS model on the training set. We then evaluated the model’s predictions on the test set using the MAP metric.
Conclusion
In this blog post, we covered some of the key concepts of recommender systems, how to implement them using Apache Spark, and how to evaluate their performance. Collaborative filtering and matrix factorization are two popular techniques used in recommender systems. We also looked at two evaluation metrics, RMSE and MAP, which are used to measure the performance of recommender systems. If you’re interested in learning more about Apache Spark and recommender systems, check out the links below.
Link to additional resources:
- Apache Spark documentation: https://spark.apache.org/docs/latest/index.html
- ALS: https://spark.apache.org/docs/latest/ml-collaborative-filtering.html
- Evaluation metrics: https://spark.apache.org/docs/latest/ml-tuning.html#evaluation