Instacart Market Basket Analysis Case Study
Table of Contents:
1. Business Problem
2. Dataset Overview
3. Mapping Real-World Problem As Machine Learning Problem
4. Performance Metric
5. Preprocessing
6. Exploratory Data Analysis
7. First Cut Approach
8. Feature Engineering
9. Machine Learning Modelling and Hyperparameter Tuning
10. Future Works
11. Conclusion
12. References
1. Business Problem:
Instacart is an online grocery shopping app. Back in 2017 Instacart announced a dataset release, which is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. In this Kaggle competition, This instacart dataset is used to perform market basket analysis which means based on a user’s buying pattern we need to find which products will get ordered again by the Instacart user.
The goal of this competition is basically to predict which all products will be purchased in the next order given the user's prior purchase history (i.e. set of previous orders, products in those previous orders, etc.).
2. Dataset Overview:
The dataset consists of over 3 million grocery orders from more than 200,000 Instacart users. For each user, we have around 4 to 100 of their orders, with the sequence of products purchased in each order.
This Dataset comes with 6 different CSV files such as:-
i. products.csv
ii. orders.csv.zip
iii. order_products__train.csv
iv. order_products__prior.csv
v. aisles.csv
vi. departments.csv
Out of these six CSV files, products.csv department.csv, aisles.csv contain information about products and their departments and aisles. The orders.csv contains order details such as whether order id belongs to prior, train, or test set and also provides the week and hour of the day the order was placed and a relative measure of time between the orders.
From the below Schema diagram, we can visualize the relationship between different files. We can say that there are multiple order_ids related to one user_id. Generally, there are two types of users as train and test. For all users, we have some prior order data and for train users, we have all products which are reordered in the next order. So using the prior orders as input data and reordered products in train orders as output, we need to train and evaluate our model, and then predict which products will be reordered next time for the test users using their prior orders data.
3. Mapping Real-World Problem As Machine Learning Problem:
In this problem, we are predicting which all products will be reordered in the next order given the user’s purchase history (a set of orders, and the products purchased within each order, etc.). So we can say it is a recommendation-based problem because we need to recommend the products that will be reordered most likely. In another hand, we can also say that it is a Multi-label classification problem because we need to predict zero or more reorder items for each user.
So, we can consider this as a binary classification problem. Given a user and a product(i.e. product belongs to the history of products purchased by that user), our Machine Learning model needs to predict whether the given product will be reordered or not in the next order.
4. Performance Metric:
In this task, we use the mean F1 score as the performance metric, because we need to avoid both false positives and false negatives; thus we need high precision and recall. So, we have considered the average of all users’ F1 scores as a performance metric.
For this paper, I came to know that there are two approaches to optimize the F1 score.
i. The first one is known as Empirical Utility Maximization(EUM) which is nothing but the general way to optimize the F1 score. In this process, we train a classifier on the training data set, and then using the probability, we find an optimal threshold that maximizes the F1 score. In this case, we get the threshold as 0.19.
ii. The second method is the Decision-Theoretic Approach(DTA) in which we estimate a probability model first and then compute the optimal predictions (in the sense of having the highest expected F-measure) according to the model. This method is commonly not applied to F1 measures.
From the Kaggle discussion section, I found one kernel which implements this method. After implementing this method I got around a 2% boost on my F1 score. So I found that this method is useful in this task.
5. Preprocessing:
In the given dataset, it seems to have clean data but in Orders.csv, we have some missing values.
From the above diagram, we can say that in the day_since_prior_order column for the first order of some users, we have missing values. So, I replaced these missing values with the median over day_since_prior_order of other orders for that user because it will preserve the information about days between orders of that user. Here, I used median in place of average because it is less impacted by noise.
6. Exploratory Data Analysis:
The very first step in solving any case study in data science is to properly analyze the data. Here EDA plays an important role to explore the dataset. Proper EDA gives interesting insights into the data which in turn influences our feature engineering and model selection criterion as well.
Step 1: We load all 6 CSV files into a panda data frame.
Step 2: Basic analysis of the data:
Then, I have done PDF and CDF plots to understand about no of orders per user and order size.
From the above plots, we can say that in the prior dataset, the maximum no of users (i.e. 40%) place 5 orders, and around 80% of users place less than 20 orders and we are heaving 10% loyal users(i.e. made more than 40 orders).
From the above plots, we can say that in the prior dataset, the maximum no of orders contains 5–6 products and 80% of the orders contain less than 15 products.
Step 3: Product-related analysis:
As we know, we need to predict reordered products, hence we need to analyze the products and get some insights.
Here, We have drawn two bar plots to know about which top 10 products get high orders and reorders. From the plot, we got some interesting behavior about the products that the most ordered products have a higher chance of getting reordered.
Now, let's verify the department and aisle that are getting a higher no of orders and reorders.
From the above plots, we can observe that products are reordered more from the department and aisle which are having a high no of orders.
step 4: User related analysis:
In this dataset for each order of a user, some interesting features are provided such as the week and hour of the day the order was placed and a relative measure of time between orders. So, let's analyze the users using these fields.
First, I have done a bar plot to know which hour of the day users are purchasing more items. From the above plot, most of the users place orders between 8 AM to 7 PM and we can consider that the reorder rate is also proportional to the order rate. Now, let’s see what are the top 3 products, departments, and aisle based on the reorder rate between 8 AM to 7 PM.
Then we need to analyze which days of the week users are more active. So, I have shown the following bar plot which depicts buying patterns on different days of the week.
From the above plot, we can say that the maximum no of orders and reorders are placed on weekends as compared to weekdays. Next, I found the top 3 products, departments, and aisles based on reorder rate on weekends.
Now, we have another feature know as day_since_prior_order which can tell us about the duration between orders of a user. So, using this feature, we can analyze how frequently a user places an order.
In the above plot, we can observe that most of the users place their next order after a month or a week. Now, let's find the top 3 products, departments, and aisles based on the reorder rate on weekends and on the 30th day.
From the above observation, the top three Products, Departments, Aisles getting high reorders are as following:
- Products: bananas, Bag of Organic Bananas, Organic Strawberries,
- Departments: produce, dairy eggs, beverages
- Aisles: fresh fruits, fresh vegetables, packaged vegetable fruits.
So we can say that peoples are often reordered fruits, eggs, vegetables, etc.
Step 4: EDA after feature engineering:
I have discussed feature engineering in the below sessions. Here, I have just shown the EDA part of those features. After feature engineering, I have done a univariant analysis with respect to the output variable to know if there is any information preserved or not by the newly created features.
Then I got some interesting results such as there are some features as a single feature that preserves some information as their distribution with respect to the output variable does not overlap perfectly.
I took a sub-set of crated features and done the bivariant analysis on them but I didn’t get any useful information from the analysis. However, from the graphs, I observed that two classes are partially classified which means that these features preserve some information.
7. First Cut Approach:
We can approach this task with simple solutions like we can recommend average no products that a user generally reordered most times in the past.
I get a mean-F1 score of 0.11 with this simple approach. Now, we can compare the performance of other ML models with respect to this one.
8. Feature Engineering:
From research, I came to know that there are no such important features that strongly relates to the output. So, we need to create many features and check which is working best. I have used simple ideas (such as data Count/ratio, Aggregation, and Recent activity) to featurize the data. In this task, given a user and a product, our model needs to predict whether this product will get reordered or not. So I created features such that, the features will preserve information about the user, product, and how a given user is related to a given product.
I. User related feature:
To preserve User related information, I created the following features.
- no_of_order_by_user: This feature will tell about no of orders placed by a user.
- avg_order_size_of_user: This feature says about the average basket size of a user’s order.
- no_of_item_ordered_user: This feature calculates using the total no of items ordered by a user.
- no_of_item_reordered_user: This feature calculates using the total no of items reordered by a user.
- overall_reorder_rate_user: Total no of items reordered by a user/total no of items ordered by a user
- avg_day_between_orders_of_user: This feature will tell about how often a user places an order.
- median_hour_of_day_user_visit: This feature is created to preserve which hour of the day the user places orders.
- median_order_dow_user: Similarly this feature is created to preserve which day of the week user places orders.
- avg_reordered_rate_per_order_user: This feature is created by first computing reorder rate per order of a user. Then we are taking the average reordering rate of all prior orders.
II. Product-related feature:
To preserve Product related information, I have found some features as below that might be helpful:-
- no_of_times_ordered_itemA: This feature says about no of times the given item has been ordered.
- no_of_times_reordered_itemA: This feature says about no of times the given item has been reordered.
- no_of_user_ordered_itemA: This feature shows the no of users who have ordered this item. We can say that it preserves information about the popularity of the item.
- avg_cart_pos_itemA: This feature says about the average cart position of the item in orders. We can say that if this value is less, then the product is important because user generally adds important products to the cart first.
- avg_order_dow_itemA: This feature tells about avg day of the week this product gets ordered most.
- avg_order_hour_of_day_itemA: This feature tells about avg hour of the day this product gets ordered most.
- avg_days_since_prior_order_itemA: This feature says about after how many days given products get ordered.
We have also been given data about the products’ aisle and departments. So, sometimes a product's aisle and department play an important role to get ordered. So I have created some features below that will preserve information about the products’ aisle and department.
- no_of_user_order_aisle_itemA / no_of_user_order_department_itemA: These two features tell about the total no of users who orders from the aisle/department that belongs to given item. We can say that these features preserve the information about the popularity of an aisle/department.
- mean_add_to_cart_order_of_aisle_itemA/mean_add_to_cart_order_of_department_itemA: These features tell about average add to cart position of products that belong to aisle/department of the given product. On another hand, we can say that these two features preserve information about the importance of the aisle/department.
- nb_order_aisle_itemA/nb_order_department_itemA: These features says about the total no of products that get ordered from the aisle/department that belong to the given item.
- nb_reorder_aisle_itemA/nb_reorder_department_itemA: These features tell about how many products are reordered from the aisle/department that belong to the given item.
III.user and item related feature:
After research, I have found some features that may help us to get information about the user-item relationship. Such features are:
- nb_times_userA_order_itemB: How many times user A orders item B. We can say this feature determines the requirement of item B to user A.
- avg_pos_cart_userA_itemB: This feature tells about avg cart position of item B of the user. We can say this feature determines the importance of item B to user A.
- nb_order_userA_itemB_last5: This feature computed by how many times item B is ordered from the last five orders of user A.
- median_day_since_prior_order_userA_itemB: This feature tells about after how many days user A generally orders item B.
- streak_userA_itemB: maximum no of order streak that user A orders item B.
- mean_order_diff_userA_itemB: This feature tells about after how many orders user A generally reorders item B.
- nb_order_userA_not_ordered_itemB: This feature tells about how many orders that user A has placed since the last purchase of item B.
- difference_mean_order_diff_nb_order_not_ordered_userA_itemB: This feature is computed using difference between feature 6 and 7. This feature tells about how much user A is closer to order item B.
- order_ratio_userA_itemB: This feature is computed by dividing the no of orders of userA containing item B by the total no of orders of user A.
The following feature is created to preserve the information about department and aisle:-
- no_order_department_userA_itemB/no_order_aisle_userA_itemB: No of items ordered by userA from the department/aisle of item B
- avg_add_to_cart_order_department_userA_itemB/avg_add_to_cart_order_aisle_userA_itemB: Average add to cart position of user A on products that belong to department/aisle of item B
First I calculate all user, product, user, and product-related features in separate data frames. Then I merged all three data frames into one data frame.
Here you can refer to the code of my feature engineering.
9. Machine Learning Modelling and Hyperparameter Tuning:
As you know there are many machine learning algorithms but you never know which ML model would work well without applying it to the problem. So I applied with a couple of algorithms and checked which performs the best. I have hyperparameter tuned in each of these models because choosing the right hyperparameter set is also very important.
I have experimented with four machine learning models
1. Logistic regression
2. Xgboost
3.Random forest
4.Catboost
To choose the right hyperparameter, first I split my whole training data into 3 parts - train, validation, and test data. Now for each hyperparameter, I trained the ML model using the train dataset and evaluated my model performance using the validation dataset. Then, for the best hyperparameter, I used the test dataset to get the test score. To predict on Kaggle test data, I chose the best hyperparameter set I got and used the whole train dataset to train my final model.
Logistic Regression:-
The Mean-F1 score of Logistic regression is 0.22305
Xgboost:-
The Mean-F1 score of Xgboost is 0.40153
Random Forest:-
The Mean-F1 score of Random forest is 0.38902
Catboost:-
The Mean-F1 score of Catboost is 40267
Addition of PCA features:-
Then to improve the performance further, I added the PCA feature using the purchase pattern of users from the different aisles and also performed hyperparameter tuning for no of the components of PCA.
The Mean-F1 score with Catboost on the old features +newly added PCA features is 0.40348
Summary:
First, I trained a linear model and observed that it performed better than our first cut approach.
Then I tried with tree-based ensemble models. First, I trained the Xgboost model which is a boosting-based ensemble method, and got significant growth than the linear model. Second, I experimented with a random forest model which is a bagging-based ensemble model and it didn’t perform well as compared to xgboost. Next, I train the Catboost model which is also a boosting-based ensemble model. Finally, the Catboost model performed well as compared to other models.
So, to improve the performance further, I added PCA features. Meanwhile, I also tried to remove some less important features but in this process, the performance was getting dropped which means we were loosing some information; hence I removed that part. Finally, I can say that using the Catboost model on our previous feature set(40)and new PCA features(6) I get the best performance of mean_f1~0.4035.
Final Kaggle Score:-
With this final Score, I have reached the top 7% in the Kaggle leaderboard.
Future Works:
- I will try other ML/DL models and tune the hyperparameters to improve the performance further.
- In the future, I will focus on the feature engineering part to engineer more features and improve performance further.
- I also try to create one more model for predicting the basket size of the user (maybe that can help to improve the performance further).
Conclusion:
This was my first self-case study on Machine Learning and also my first Kaggle competition submission. This is my first attempt in blogging as well, so I expect the readers to be a bit generous and ignore the minor mistakes I might have made. I hope I have given you a good understanding of the problem at hand and how I have approached this problem. Feel free to provide comments if you think that can improve this blog, I will definitely try to make the required changes.
I also deployed my solution here and for detailed code, you refer on github.
Thanks for reading!
References:
- https://www.kdd.org/kdd2016/papers/files/adf0160-liuA.pdf
- https://arxiv.org/ftp/arxiv/papers/1206/1206.4625.pdf
- https://www.kaggle.com/c/instacart-market-basket-analysis/overview
- https://medium.com/kaggle-blog/instacart-market-basket-analysis-feda2700cded
- https://www.kaggle.com/c/instacart-market-basket-analysis/discussion/37221
- https://www.appliedaicourse.com/
You can also find and connect with me on LinkedIn.