An Approach for Recommending Similar Fashion Products.
Table of Contents:
1. Business Problem
2. Mapping Real-World Problem As Machine Learning Problem
3. Dataset Overview
4. Exploratory Data Analysis and Preprocessing
5. Module 1: Pose Detection
6. Module 2: Fashion Article detections
7. Module 3: Similar Image Recommendation
9. Future Works
1. Business Problem:
Nowadays e-fashion influencers regularly post images and users might be willing to mimic the looks of their influencers. Generally, users are interested in buying the entire look of the model. Here in this problem, given a full pose image of a model, the goal of our method is to recommend similar fashion products corresponding to the entire set of fashion articles worn by a model in the full-shot image.
However, fashion product recommendation is challenging in a way that there are enormous amounts of variations present in the fashion articles viz. color, texture, shapes, viewpoint, illumination, and styles. This problem is not only important to promote cross-sells for boosting revenue, but also for improving customer experience and engagement.
2. Mapping Real-World Problem As Machine Learning Problem:
We can say that this is challenging because, in contrast to performing recommendations for a single primary article (for the query), we need to recommend similar products for each fashion article present in the entire set of the fashion articles worn by the model. So The main problem can be divided into the following stages:
Stage 1: Full Pose image Detection
This will ensure that the fashion image uploaded by the user is a full shot image and we need to recommend similar products for all fashion articles present in that image.
Stage 2: Fashion object detection
Here we will detect fashion objects present in the full shot image.
Stage 3: Recommending similar products
In this stage, we will be retrieving and recommending similar products from our database using the embedding space.
3. Dataset Overview:
In this Task, I have used the Publicly available Street2Shop Dataset. The dataset contains 404,683 shop photos collected from 25 different online retailers and 20,357 street photos, providing a total of 39,479 clothing item matches between street and shop photos. Each of the street photos contains the bounding box location of the clothing item and we also have a similar shop image. We prefer this dataset for the problem because this dataset consists of both bounding boxes and similar images so that this single dataset can be used in all stages.
4. Exploratory Data Analysis and Preprocessing
The Street2shop dataset consists of 20,357 street images and 404,683 corresponding shop images. This dataset contains 11 clothing categories such as bags, belts, dresses, eyewear, footwear, hats, leggings, outerwear, pants, skirts, tops.
Here, in this dataset, we have a text file that has information regarding image ids and image URLs, and JSON files providing fashion product details such as product id, product type, and bounding box location.
a) EDA For Full Pose Image Detection
As the dataset provides the image links, we first download all the street images using multiprocessing.
Below, we have drawn the distribution of object categories in the street and shop images. From the plot, we can observe that we have more images that contain dress and footwear as compared to others.
Then we have analyzed the distribution of the number of products present in street images and from the below image we can say that approx 50% of street images consist of a single fashion object.
b) EDA for Fashion object detection
First, we implemented the full pose image detection using pre-trained Posenet and filtered full pose images from Street images. Now we get only 9133 street images that are used in fashion object detection.
Above, I have shown a sample image from which we can observe that sometimes the same object having multiple bounding boxes. Because in this dataset they are giving bounding box to each in shop image present in street images. Generally, one object can be available in the different shops. So, here for the same object, sometimes we are heaving multiple product_ids. So, I have removed redundant bounding boxes for the same object using the IOU score.
Then, I have analyzed the distribution of categories and no of products in the full pose images.
From the above plots, we can say that the distribution categories are the same as the previous and in the distribution of the number of objects, we got around 5000 images to contain a single object. But, generally, in full pose images, we get more than one fashion item. So we have analyzed full pose images that are heaving a single object.
From the above plot, we can observe that the around 3000 full pose image contains single dress objects. Below we have shown Sample full Pose images that contain the single object.
Here, we can observe that the images contain a single object each of which that are not well annotated. So we have removed the image that contains a single object and now we have around 3000 preprocessed images for the object detection task.
c) EDA for Recommending similar products
As suggested in the paper, we have decided to train a triplet network for recommending similar products. So, using the image link, we have retrieved around 40K triplet images.
We have shown a sample triplet image below.
5. Module 1: Pose Detection
Our problem is related to recommending similar products for the whole set of fashion articles present rather than similar products for a particular fashion article. First, we need to ensure that the image should be a full pose image. Then we can process that image for different fashion article detection. So we need to estimate and detect the position of key body parts. In other words for a given image of a person, pose estimation would be able to map the positions of his/her elbows, shoulder, knees, ankle, etc. This could be used to predict if the person is standing, walking, or dancing. For this task, we used a pre-trained Posenet model to ensure that the given image is a full pose image.
What is Posenet?
PoseNet is a machine learning model that estimates human pose in real-time. PoseNet can be used to estimate either a single pose or multiple poses. I have only used the single-pose estimation algorithm because according to our problem, we mostly use single-person images. When an input image is fed to it, we get keypoint offset and keypoint heatmap.
The Posenet model will return two vectors keypoint confidence scores (depth 17) and keypoint offset (depth 17*2). As the depth is equal to no of key points, we retrieve the position by using these two vectors. First, we can retrieve the x, y index of the key points in the heatmap with the highest score for each channel. Then, to get the key points of each channel, the x & y indices are multiplied by the output stride and then added to their corresponding offset vector which is on the same scale as the original image.
keypointPositions = heatmapPositions *outputStride + offsetVectors
Using these key points, we can predict if the image is a full pose image or not by using a simple if-else conditional clause. Below, we have provided some of the sample full pose images that we got from the Posenet model.
Sample Output of Posenet Model
6. Module 2: Fashion Article detections
Next, we need to detect different fashion articles present in the image. There are many object detection algorithms that we can use such as RCNN, Faster RCNN, YOLO. We have used YOLO-based models for fashion article detection.
Yolo achieves high accuracy while also being able to run in real-time. The Algorithm looks only once at the image in the sense that it requires only one forward propagation pass through the neural networks to make predictions.
Read more about YOLO here.
To train the custom YOLOV4 object detection model, we first load all images into the Colab. Then, clone the AlexeyAB repository on the drive and create the train and test config files by making the following changes in the custom config file. Custom config file can be found at darknet/cfg directory and is named as yolov4-custom.cfg.
- change batch to batch=32 for Train and 1 for Test.
- change line subdivisions to subdivisions=16(Train),1(Test)
- set network size width=608 height=608 or any value multiple of 32
- change max_batches to (classes*2000 but not less than the number of training images and not less than 6000), f.e. max_batches=2200 if you train for 3 classes
- change line steps to 80% and 90% of max_batches, f.e. steps=4800,5400
- change [filters=255] to filters=(classes + 5)x3 in the 3 [convolutional] before each [yolo] layer, keep in mind that it only has to be the last [convolutional] before each of the [yolo] layers.
- change classes=80 to your number of objects in each of 3 [yolo]-layers.
So if classes=1 then it should be filters=18. If classes=2 then write filters=21.
Then, we need .txt files for each image and those text files will store the class id and bounding box for all objects present in the particular image. then we need to create train.txt and test.txt which will store the path of corresponding images.
Make changes in the makefile to enable OPENCV and GPU-
Run make command to build darknet-
Download the pre-trained YOLOv4 weights Training-
Now, train your custom detector-
Outputs of YoloV4 Custom Fashion Article detection Model
After running 8000 epochs, we are getting Mean Average Precision of 87%. The higher the mAP the better it is for object detection. Below, we have provided some sample Inference images, from which we can see that the model is detecting all the fashion articles present in the image correctly. Hence we can say that the yolov4 model is working reasonably well.
7. Module-3:Similar Image Recommendation
Having extracted the relevant fashion articles from the full-shot look image, we now need to retrieve similar fashion products from the image database. Then to recommend similar products, we need a common embedding model which will group the similar articles together while moving away from the dissimilar ones. For this, we make use of triplet-based network architecture to learn our embeddings.
What is a Triplet network?
As the name suggests, the triplet network consists of three identical Convolutional Neural Networks (CNN) with shared weights each of which may be regarded as a branch. To train it, one requires triplets of images (A, P, N) such that the first two(A, P) are semantically similar, while the third(N) being dissimilar to the first two. The objective for training the network is to minimize the following weighted triplet loss. Generally, triplet tries to bring the embeddings of anchor (A) and positive (P) closer, while moving away from the negative (N) one.
To obtain the embeddings for computing image similarity, we train a ResNet-based triplet network and also used a weighted triplet loss as mentioned in the paper. To form triplets, the first image is a cropped fashion article object taken from the street image and the second image belongs to the shop image of that product. The third image is randomly sampled, containing an image from a different garment for the same article type. Then, we trained the triplet network as follows.
After training the above model, we observed that the loss is converging slower and after 7–10 epochs starts to overfit. So, to train the model properly, we tried different base models, parameters, and different learning rate schedulers but those methods didn’t work much well. Then from analysis, we conclude that all triplets are not useful for learning. The Triplets, which disobey the triplet condition, are known as hard and semi-hard triplets, and these types of triplets are important for the triplet model to learn properly. There are only 30% of triplets that belong to this type. So to address this problem, we first trained the first epoch on the complete set of triplets, then for other epochs, we are training only using the hard and semi-hard triplets from the whole dataset using the previous epoch weight. After using the semi-hard triplet mining method, the loss converges faster and giving a good performance.
To Retrieve similar products from the database, we used cosine similarity. Above, we have provided some sample Inference images of the recommended similar products. Here, we observe that the triplet model with semi-hard triplet mining performs reasonably well.
8. Future Works
- In the feature, we can experiment with the conditional similarity network to improve the triplet network performance.
- We also implement a locality-sensitive hashing technique to retrieve similar images much faster.
I have provided the deployment video above. I hope I have given you a good understanding of the problem at hand and how I have approached this problem. As I am new to data science and blogging as well, so I expect the readers to be a bit generous and ignore the minor mistakes I might have made. Feel free to provide comments if you think that can improve this blog, I will definitely try to make the required changes.
For detailed code, you refer to GitHub.
Thanks for reading!
You can also find and connect with me on LinkedIn.