Evolution of a data system

We’re building Beneath to make it easier to put data science into production. But what does “production” really mean and why is it so hard? In this post, we’ll show what the conventional path looks like for turning a machine learning model into a production data app by laying out the design and evolution of a simple Reddit analytics tool.

Let’s analyze the seriousness of Reddit posts

Reddit is a serious place for serious people, but sometimes subreddits become corrupted by miscreants who spread useless banter. To avoid such unpleasantries, we want to build a web app that can advise us of the seriousness of different subreddits.

For our project, we’ll use machine learning to score the seriousness of every individual Reddit post. We’ll aggregate the scores by subreddit and time, and we’ll expose the insights via an API that we can integrate with a frontend. We want our insights to update in near real-time so we’re reasonably up-to-date with the latest posts.

So we’re clear on what the system should do, here’s the API interface:

Let’s dive in.

Phase 1: building the data ingestion engine

To start, we want to extract posts from Reddit and write it into our own storage system. Our storage system will have two components: a message queue and a database.

With our storage system in place (in theory), let’s write the first scripts of our data pipeline.

We need a way to deploy and run our code in production. We like to do that with a CI/CD pipeline and a Kubernetes cluster.

We’ll use a cloud provider to provision the message queue, database and Kubernetes cluster. We prefer managed services when they’re available, so we won’t deploy the message queue or database directly on Kubernetes.

Here’s a diagram of what our system looks like so far:

Phase 1

Once all this is up and running, we need to validate that the data is flowing. An easy way to do that is to connect to our Postgres database and run a few SQL queries to check that new posts are continually added. When everything looks good, we’re ready to move on.

Phase 2: training the machine learning model

Now that we have the raw data in Postgres, we’re ready to develop our moneymaker, the seriousness scoring model. For this example, we’ll keep things simple and use a Jupyter notebook that pulls historical posts from the Postgres database.

Note that there are other ways to train a machine learning model. Fancy “MLaaS'' and “MLOps” tools can help you continuously train, monitor and deploy models. If you want to integrate with one of these tools, you’ll likely connect your database to enable training, and you’ll ping an API to make an inference.

Here’s our system augmented with our ML development environment:

Phase 2

Phase 3: applying the model and aggregating the scores

Now it’s time to build the workers that will apply the model to new posts, and write out the resulting seriousness scores. That’s two different scripts:

Next up, we want to aggregate our results by subreddit and time. We’ll use dbt, which allows us to schedule periodic SQL queries. We’ll schedule two aggregating queries:

With that, we have all the data that we want for our app available in Postgres. Here’s what the system looks like now:

Phase 3

Phase 4: completing a web app

Our last step is creating the interfaces for accessing our Reddit insights. We need to set up a backend API server and write our frontend code.

Deploy the API server and frontend code to Kubernetes, and we have ourselves a full stack analytics application! Here’s what the final design looks like:

Phase 4

Improving the stack

Our Reddit analytics app is now ready to share with the world (at least on paper). We’ve set up a full stack that spans data ingest, model training, real-time predictions and aggregations, and a frontend to explore the results. It’s also a reasonably future proof setup. We can do more real-time enrichment thanks to the message queue, and we can do more aggregations thanks to dbt.

But it has its limitations. For scalability, we’re limited by the throughput of Postgres and RabbitMQ. For latency, we’re limited by the batched nature of dbt. And there are “governance” concerns, like how we monitor data quality, how we document the data formats, how we upgrade the models, and how we secure and share data.

It can be difficult to maintain an overview and stick to best practices given all this complexity. Our goal with Beneath is to simplify the data stack. Data scientists should be able to put their work into production (by themselves!), and still spend most of their time on the data science. If you’re excited about that idea, give it a try.