The Cool Part

In a nutshell, I got to use machine learning to profile users and recommend them products based on browsing data and user profile analyses.

The Problem (Business Requirements)

For this project the problem I was presented with was a catalog of 30,000+ products in the same genre. How do we show users the products they want to see without having to click “Next Page” 1,000 times. Of course we could have bucketed users based on demographic, incoming traffic channel, geographic location, etc. But all of that required setting very manual outlines of who our customer was, something prone to error and human bias.

The Solution

The solution was two-fold, create an interactive way to gather user interests, and then use that data to recommend products more likely to convert that user.

Our Head UX Engineer did a write up on the user story and interactive requirements [here]. I’ll try to stick to the more technical side.

Overview

At a high level, we feed product attribute vectors into a program that clusters them based on product attributes. When a user interacts with a few products we find the nearest cluster and suggest products from that cluster (Using business logic on top of this like top sellers and revenue generating products due to profit margin and such) until we have enough data to profile them. Lastly we combine product and user features (User features are generated through Collaborative Filtering, popularized by Netflix) to suggest the ideal products to the customer.

Tech Stack:

  • R
  • Spark
  • PHP
  • Postgresql

The First Step – Product Tagging

The first step was to tag products in a meaningful and easy to consume way. We decided on creating a vector of features that best described each product within our genre. Since we sold lingerie our attributes became stuff like “Lacy”, “Chest Coverage”, “Sheerness”, etc. So each product was assigned a vector with these attributes ranked from one to ten. In order to get these attributes we outsourced some data entry and had several people rank each product and took the mean vector for each product as the true representation (We built a page that gave our PM the ability to generate a unique link to send to freelancers that allowed them to tag products and for her to track their progress). In the end each product, to our program, looked something like this

1
lace_up_bustier <- [1.7, 6.8, 9.2, 4, 2.1, 4.7, 7.9, 10, 9.2];

Step Two – Product Clustering

After all, most, of our products were tagged with attributes, I set to work on clustering them. This was the first use of Spark’s ML libraries. Hooking up R with Spark made it easy to load all the products into a DataFrame, run some clustering, and tweak until my heart’s content. Mostly just working with the number of clusters. I exported to CSV and built a small tool to visualize the clusters to get an intuitive feel for what product images felt the most similar.

https://imgs.xkcd.com/comics/machine_learning.png

Step 3 – Collaborative Filtering (Part 1)

Product Features alone aren’t enough to suggest products to people. Since some users will like Bustiers and Swimwear. Two wildly different products, but they aren’t mutually exclusive in interest. To solve this we need to mix in some User Features. After a bunch of research (cough Googling cough), I found out about the Netflix Prize competition. This was an open contest Netflix posted to anyone who wanted to improve their recommendation algorithm, the winner used something called “Collaborative Filtering”. Which essentially provides a matrix containing [user_id, movie_id, rating]. At a high level the algorithm works like this.

We know “Person A” likes the movies Star Trek, Gattaca, and Repo! ‘The Genetic Opera’. And we know “Person B” likes Star Trek and Gattaca

We know that “Person A” has rated the following movies like so:

1
2
3
4
5
{
"Gattaca" : 4.5,
"Star Trek" : 5,
"Repo! The Genetic Opera" : 4.5
}

And “Person B” has rated their movies like this:

1
2
3
4
{
"Gattaca" : 5,
"Star Trek" : 4.5
}

Intuitively, we can see that Person B has similar enough interests to Person A that we can guess they will like “Repo! The Genetic Opera”. I just Used the ALS (Alternating Least Squares) method of Collaborative Filtering to make sure. Which, as far as I understand it, is just a truck load of matrix factorization with a hint of magic sprinkled on top.

Step IV – Collaborative Filtering (Part 2)

So now we know we have the ability to pass in a movie and get a predicted rating for that user profile. However, ratings are explicit data; the user performed an action saying “I think this thing is worth 3 out of 5”, which is a pretty solid indication. Unless the users were dishonest, in that case, they deserve the algorithm to work against them.

Anyways, in eCommerce we deal mostly with implicit data. Clicks, viewing time, adding to wishlists, purchases, etc; I wanted to add some optional rating feedback feature but nobody would let me. It totally doesn’t keep me up at night thinking what could have been… Ahem Sadly, working with implicit data wasn’t giving me the results I wanted. So I wrote a linear function to convert implicit data to a suggested rating. It ate implicit data, weighted each interaction against each other, and spit out a rating from 1 – 5 (And sometimes 6, if they so happened to buy a product 15 times or something else crazy).

After the implicit → explicit conversion, I set to transforming all of our user interaction data to a [user_id, product_id, suggested_explicit_rating] matrix. I used all user interaction data for the past year, including inactive users, since while they no longer use the site, I felt their interactions would still help the final model. This was several million total data points, I worked on expanding it later but the results of the final model did not change much so I kept it to the last year of interaction.

The Fifth One – Collaborative Filtering (The Filtering!)

Lastly, I utilized Spark’s ML libraries, which included an ALS Collaborative Filtering method builtin to build the model. With this I built a service that took a user_profile and a product and gave a predicted rating. With that service I created a user-centric API that piled on all three layers; Business Logic, Product Feature Clustering, and User Feature Collaborative Filtering.

⬆︎TOP