Scaling Recommendation Engine: 15,000 to 130M Users in 24 Months
Delivering users with precise product recommendations (recs) is the creative force that drives Retention Science to continue to iterate, improve and innovate. In this post, our team unveils our iteration from a minimum viable product to a production-ready solution.
Here’s the chronology of events:
Month 1: Cold Start on a winter night
Our first task was to provide product recommendations for 15,000 customers tracked by an e-commerce client. Metrics such as click and redeem rate were to be compared with a baseline already being used to measure our impact.
The initial approach (we called it Rec 101) was a simple cold start model, but it proved to be a reliable source on several occasions. It served 2 important purposes: (1) No user was left without a recommendation, and (2) You could draw a connection from a user to the item through some prior modeling.
Simple Rule: Annotate the top K items (per user attribute) based on number of interactions and purchases.
It could be considered as a simple weighted prior model not factoring in the user’s posterior to obtain some likelihood for a user-item pair. Not only did this simple rule (represented by 14 lines of SQL) beat complicated algorithms in terms of open, click and redeem rate, but it made clients thousands of additional dollars in revenue, some of which was attributed to our approach.
Month 2: Something better than a Cold Start?
To reach a minimum viable product, 3 engineers brainstormed on the perfect rule that would help achieve better click and redeem rates. Naturally, heated discussions arose from approaching the data in several ways, ultimately uncovering some factor we believed best described the data.
The resultant rule: “Select the top items from the category that the user has already bought.”
An unholy amount of SQL later (for joins and aggregations), we had “Rec 102” on production, with runs completing in a few minutes on a modest-sized machine. What we didn’t realize at the time was that our manual exploration had uncovered a rule that would be corroborated by our future use of classical discriminative learning. The users’ purchase categories turned out to be a very strong latent factor (an intern good at SVD and linear algebra proved it 7 months later).
Month 3: Manual Checks before send
Extensive internal tests were conducted before each send. Several users’ recs were spot-checked through a nightly internal evaluation email for QA purposes (see Fig 1.1 for a snapshot). We even created our own accounts and used our clients’ websites to give us more insight. Occasionally, predictions would be hand-curated if they looked off. While this level of manual intervention was not scalable, it was the first step to successfully scale.
Fig 1.1 Snap shot of our nightly internal evaluation email
Month 5: De-duplication!
During one of the internal tests before sending out recs, one of us had the following recs:
Nike All Star Size 4
Nike All Star Size 3
Reebok Air Comfort (See fig below)
No user needs to be recommended multiple sizes of the same shoe. The “jaccard-semantic duplication annotation scheme,” was implemented which would figure out duplicate items and categories based on their textual description. Though de-duplication did not increase our target metrics, it did remove red flags and client complaints and led to a decrease in email unsubscribes. Several e-commerce companies struggle with this even today.
Month 6: Visualization
So by now, a couple of rec schemes were on production, translated to some hacky Python code, but still good enough to generate a statistically significant amount of lift. Some sort of monitoring/reporting was needed before sending out recs everyday. Frequency distributions of items (some papers suggested long tail distributions are good) and their redemption rates for different rec schemes were created. These distributions were consolidated in the form of a simple report.
Fig 1.2 One of our early recommendation visualization done on Python + MatplotLib, A long tail distribution for the category rec schemes
Month 7: Feedback
A stable point was reached that ensured each send was reasonably effective. The impact of the algorithms on the business was measured. Unfortunately, the results of these sends were not incorporated back into the models.
This led us to tap the feedback data to figure out specifically which recs schemes were performing well in order to gain insight into items that interested people. A few rules were added to incorporate this information to our existing recs. It was useful in picking up trending items and filtering items that did not interest users.
Month 8: Feature Engineering
Isn’t this supposed to be step #1 in any machine learning experiment!? Not really, this came in about the 30% mark.
Data was split into behavioral and transactional data. Behavioral data was usually high-volume, high-velocity, and noisy, while transactional was relatively low-volume, low-velocity, and clean(er).
Different signals were mined which we called user-item affinity (a given user’s preference for a given item) and normalized it in the user space (row) or the item space (column). A lot of time was spent in cleaning the data and ensuring our input vector space was sane. This User Item Affinity is the input to almost any recommendation scheme (popularly known as the UI matrix) and it became helpful to have a consistent dataset up stream for all data scientist to work on.
Features weighted by the type of the user’s interaction with the item (a purchase is much more significant than a click or view) and the recency of interaction made them more discriminative in nature. It was a good foundation to build sophisticated algorithms that would consume this.
This led us to explore multiple User-Item matrices that we would later try to factorize and explain the origin of the data.
Month 9: The Banana problem
“All of our users are getting bananas,” screamed one of our clients. This was the first time in 8 months that we thought of giving up. Our CTO defended us, explaining that “we weren’t wrong–everyone love bananas, so that’s what the models picked up!”
The problem was that most users were already going to buy those bananas. Not only were we recommending the users items that they already knew about, but worse, they would seem to lose interest in any recommendations. The objective metrics used to measure recs (such as redemption rate) would be misleading in this case.
A week later, this was fixed by another rule: “Remove all items above 99 percentile in the affinity score.”
It taught us that sometimes it’s wrong to be very correct. We learned the hard way the importance of balancing exploration and exploitation, and we added parameters into our models to allow us to make adjustments as needed.
We also added metrics to measure the diversity and novelty of our recommendations. These metrics turned out to be a big help a year later. Several large e-commerce companies were struggling with the same problem of not letting their users explore enough of their inventory, and we were able to recognize the problem early enough and solve it.
Month 11: Exploring “off the shelf” solutions
The Netflix challenge had ended and several machine learning libraries started mushrooming. We too wanted to piggy-back from open-source projects. We set out to explore platforms like Apache Mahout, Vowpal Wabbit and packages in R/Python to see if they fit our needs.
It was tough to play catch up in this game. Several of these solutions were not easy to incorporate due to the infrastructure and the exhaustive tuning required on these algorithms. However it depicted the future trend of recommendation science. It enabled our creative juices to flow further. We incorporated a few algorithms like collaborative filtering, matrix factorization of our U-I matrices (SVD and ALS), and content-based filtering in the process.
We had reached 7 different rec schemes now and they were fiercely competitive with each other!
Month 12: ‘Robust-ification’
By now we had piled up a lot of tech debt, our code was smelly and fragile, and ‘robust-ification’ was needed. During Christmas break our team realized we could incorporate data-driven techniques to better calculate hardcoded thresholds.
1. Several manual processes were automated in small cycles.
2. Emphasis was given to performance and distributed computing.
3. Code reusability was key in reducing tech debt.
4. Correlated algorithms were removed and maintainability became easier.
5. We broke our pipeline into Data Loaders, Algorithms, Reporters and Evaluators.
The recommendation models weren’t modular, several of them had to be run by hand. In order to scale, we had to modularize them and make them reusable in different settings.
Month 13: Exhaustive A/B testing
For each send we needed to figure out how well each algorithm was performing, and more importantly, which type of users liked which algorithm. An extensive and unbiased A/B test platform was created for this purpose, and the results were used to improve each algorithm. Any new rec algorithm would face the wrath of this A/B test. If it didn’t perform as well as the others we wouldn’t waste time putting it into production and maintaining it in our codebase.
Fig 1.3 Our internal A/B test framework evaluating 5 different recs
Month 15: Embracing “Big data”
The recommendation engine was scaled to about 7M users by now. A few more e-commerece companies started using our services, and it was needed to generalize the algorithms. Our local Python scripts and SQL queries were quickly becoming bottlenecks. The transactional data was still easy to handle, but the behavioral data was getting large.
We needed to adapt to Big Data, Distributed and cloud computing. Our team were early adopters of Spark (0.6 beta), and although Pig and Hive were popular frameworks, we gambled with Spark. Again, it proved to be a good decision, as Spark soon became the industry leader for Machine Learning and Big Data analytics due to its elegant APIs for manipulating distributed data, growing machine learning library, and effective fault tolerance mechanisms.
We kept our most frequently accessed datasets on HDFS for speed, and moved to Amazon S3 as our secondary cloud storage. For our behavioral data we started using Kinesis, and load balancers ensured that every user’s action was captured on the phone or on the website. By utilizing this Spark + HDFS + Kinesis combination, we were able to horizontally scale our algorithms across 75 different clients.
Month 18: Redicto (A powerful parallel generalized data transformer on Spark)
We were now providing recs close to 30 e-commerce houses. Each company required its different rules.
These rules could be as simple as:
– Male users should not get any female or neutral items
– People from the state of CA should not get products from category X
or as complex as:
– Tag all items with custom domain tags and limit users’ recs to only 1 item per tag
– Exclude certain tags if they have purchased certain other tags and only include certain tags if they have purchased certain other tags.
To handle all of these custom filters, the most efficient and flexible approach was to build an internal API service from scratch. It fit in the model of service oriented architecture and allowed us to scale filtering and selection of recs. Hence, Redicto (link to previous blog: https://www.retentionscience.com/rs-labs-building-better-recommendation-engines-through-redicto) was born.
Machine Learning models annotated some of the metadata for these rules, and Redicto would take care of the rest. For example, NLP + Clustering was used to figure out whether an item is male-specific, female-specific or neutral based on its semantic description.
This proved to be a great differentiator a few months later and marketers were enthusiastic that their rules were part of the recommendation engine. We worked closely with them to show the lift (if any) brought them by these explicitly defined rules.
Apart from marketer’s custom rules, we have also implemented some of our own rules:
– Expensive item filter
– Already-bought filter
– Already-recommended filter
– Item and User Gender filter
Fig 1.4 Redicto de-duping recs for a user who wants to buy shoes
Month 20: Scala Refactor and Tests
Scala was adapted as the language of choice and refactored all our procedural code into a single functional machine learning repository. Comprehensive tests were written both in the feature engineering layer and algorithmic layer. Continuous deployment and micro services were adopted. Our process was broken into 4 layers – Data Ingestion, Feature Engineering, Modeling and Prediction, Visualization + Feedback and Reporting
Month 24: Rec Visualization dashboard
Several front-end gurus in our team did a fantastic job of converting our PDF reports into an interactive dashboard covering more than 50 objective metrics per algorithm. It gave us valuable insight on what each algorithm was capturing. The dashboard helps us catch red flag issues or breakdowns more quickly. An alerting system was put on top of this dashboard to ensure Data scientists were on top of.
Fig 1.5 Our internal recommendation engine dashboard tracking every send in real time
Conclusion and Future Work:
14 different types of recs and almost 60 different flavors run on production. Models have successfully learned from sending out 3B multi-channel recs to close to 130M users over the last 3 years.
We’ve constantly tried to evolve the infrastructure such that it’s modular and as re-usable. The same infrastructure is reused for other predictions including user churn, timing and customer lifetime value.
Our current architecture looks like this:
Fig 1.6 Our data architecture supporting the recommendation engine.
Looking to leverage your data to generate some recommendations? Write to us at email@example.com if our stack resonates with your needs.
About The Author
Vedant Dhandhania is a Machine Learning Engineer at Retention Science. He helps predict customer behavior using advanced machine learning algorithms. His passion lies in the intersection of Signal Processing and Deep Learning.
header photo source: evolo.us