How to Approach a Data Science Problem [RS Labs]

We take on a lot of interesting problems at Retention Science. My biggest project involved building a new recommendation scheme based on item similarities. And now this scheme is part of Retention Science’s Smart Recommendations stack.

At RS, both Data Scientists and Software Developers follow the Agile Software Developmentmethodology. Agile Software Development is a more customer-centric methodology, and is based on four core principles:

Individuals and interactions over processes and tools

Working software over comprehensive documentation

Customer collaboration over contract negotiation

Responding to change over following a plan

This methodology helps us cater to the changing needs of our clients quickly and evolve our product at a much faster rate. We mostly end up following a scrum-type framework, and our sprint duration is typically 15 days for the Development Team and 1 month for the Data Science Team.

In this blog, I’ll go through the stages of the Data Science Development Lifecycle at Retention Science, and how practicing Data Science fits into that framework.

RS Data Science Development Lifecycle

dsdev
Before we get started, a quick explanation of the visualization above. Data Science is, at its core, an application of the scientific method. The different stages represented in the spheres above outline each step you take when developing software for data science. For my project, I went through each stage in order to build my recommendation scheme and discovered some interesting takeaways each step of the way.

Problem

I was tasked to build a new recommendation system that lists out the top related items (or products) based on an item. One quick and easy way to solve this problem is to look at all of the customers’ history of orders and construct the table with two columns: item-pairs and counts (although I count only if a user has bought both items). Then, we could easily pull out the the top related items.

This approach looked simple, but didn’t turn out to be flexible, since incorporating other information (e.g. user actions on a website) is difficult in this setup. At a software company with live clients and live demands, the inflexibility was a key issue. Additionally, this approach was not scalable, given the way we execute the jobs in our system and the number of clients we have. For instance, our recommenders are required to build recommendation that scale to 15M users and 20K items!

Key takeaways from this phase:

• Get to know the system completely (including the codebase).
• Spend decent amount of time looking through related research papers.
• Finally, clearly articulate the problem that you are trying to solve.

Analysis

As a Data Scientist, in addition to knowing the workings of the system, you also need to understand your data. Just looking at a few samples gives you a good insight into the data. Performing simple aggregates, checking for outliers, and visualizing the data (via notebooks like Zeppelin / IPython) are a few ways to do this analysis.

dsanalysis

While performing the analysis, I found that a significant percentage of items in our items table were duplicates over 80%, in the case of one client. For example, I observed that two different item IDs both referred to the exact same food product. This issue could have significantly brought down the algorithm’s performance since it wouldn’t make sense to recommend duplicate items.

Key takeaways from this phase:

• Create a sample dataset so that you can test your “analysis code” before you run on complete dataset.
• Collapse the data to 1’3 dimensions, in ways you think seem meaningful.
• Try and restrict the data points to less than 100,000 for visualization.
• If you have multiple similar data sources (multiple clients), look at at least 3’4 of them.
• Finally, identify the steps you want to take to clean/process your data, often referred to as Data Cleaning.

Prototyping

This is the phase where you actually start designing your algorithm or building your model. It is worth noting that most newer companies (including Retention Science) have set up their infrastructure via the cloud [like Amazon Web Services]. This, coupled with the use of notebooks, has made prototyping a lot faster and easier.

The algorithm which I finally came up was very similar to Amazon’s item recommendations. In short, we calculate the similarities between pairs of items. The similarity metric can be Cosine or Jaccard Similarity. There are multiple ways to construct the item vectors that are inputs to the similarity calculation; one way is to utilize orders history. For instance, consider three users (U1, U2 & U3), and two items (I1 & I2). Let’s say I1 was bought by U1 & U2, and I2 was bought by U2 & U3. In this case, I1 vector is [1, 1, 0] and I2 vector is [0, 1, 1]. The rest of the algorithm follows from that.

The next big problem which I encountered was scalability, which turned out be a well-known issue in the recommender community. I came across Twitter’s all-pair similarity algorithm called DIMSUM which exactly addressed my requirements. After that: I had my first prototype ready!

Key takeaways from this phase:

• Do a comprehensive reading of related research papers.
• Prototype your algorithm quickly, test it, identify, and improve on its weaknesses.
• Finally, identify the pros and cons of your algorithm.

Validation (or Reporting)

dsvalidation

Prototyping and validation/reporting go hand in hand. At Retention Science, we follow a Model-Runner-Reporter style for all our data science algorithms. That is, each model is abstracted from the code that runs it, and the code that reports its output. Since our platform caters to multiple clients, it is necessary that we keep track of production models. Hence, reporting becomes very important.

In this phase, you actually identify all the critical points of your algorithm/model and write out corresponding reporting statistics/metrics. For instance, one of the critical points in my algorithm was that the matrix should be sparse, so I went ahead and included matrix sparsity to be one of the reporting metrics.

Key takeaways from this phase:

• Definitely generate and look at some sample outputs.
• Make sure that the reporting metrics cover all the critical points of the code.
• Finally, separate the reporting metrics out from the prototype code.

Development

Once you have your algorithm validated, you start writing the production code. Code on!

Key takeaways from this phase:

• Modularize the code.
• Optimize the code.
• Keep the code neat and clean.

Testing

Testing, I now believe, is the most important phase until this challenge, I hadn’t thought that to be true. In short, this phase determines whether the code that you have written in development works as expected for every scenario.

As your code crosses 1000 lines, it becomes difficult to write foolproof code. Reasons include human errors, miscommunication, and much more. In fact, a few companies even practice test-driven development, where tests are written before you begin the development. I wrote unit tests for my classes with ScalaTest. I followed the nice test format of Given… When… Then.

Key takeaways from this phase:

• Write at least one test for every method that you can
• Ensure maximum code coverage

Deployment/Maintenance

Deployment is a self-explanatory phase when the model or algorithm you’ve prototyped, developed, and tested, gets sent out into the “real world,” for the clients to use. Maintenance consists of the care that comes afterward, once clients start using the developed system and the problems you didn’t notice until then start popping up.

Key takeaways from the Deployment phase:

• Validate that the models are working as expected.
• Monitor/keep track of your jobs.

Key takeaways from the Maintenance phase:

• Write documents just to give a high level overview.
• Follow the style guide and write comments wherever necessary.
• Code with the rule that others are going to read it!

This sums up the entire RS Data Science Lifecycle, and the steps I took to complete my project. It’s always rewarding when the end product is utilized!


About the Author

Deepak Angadi is a graduate student in Computer Science at the University of Southern California.