Forecasting of content readers

A data science project applied to online content creation

5 min readJan 18, 2021

I have written a handful of articles on Medium: my goal is to share them with as many readers as possible.

Having goals and expectations is essential to foster motivation. Exceeding expectations is a powerful source of motivation, but not meeting them can put the whole project in danger.

Based on a scientific approach, I wanted to set reasonable expectations so that I maximise my chances to exceed them, and benefit from this motivation boost to move faster towards my goal.

My crystal ball seems to be broken… Source

Before diving into technical details, have you heard about Make Your Brain Work by Amy Brann? It is an amazing book on how to make the most out of your brain.

The author analyses day-to-day business situations from a neuroscience point of view and shows how they can relate to famous scientific experiments. She also provides tips and discusses benefits of mastering our brain’s functionalities. Definitely recommended!

Note that the provided link is an Amazon affiliated link: I may earn money from this recommendation.

These series of articles describe a real-world data science project, where the goal is to predict how many readers blog posts are likely to reach within a given period of time. I will use them to set personal expectations but one can imagine that a business could adjust marketing expenses according to these predictions to reach its goals.

This project will typically follow 4 steps.

Project planning

Scrap historical data from Medium statistics page
Prepare the data to make it suitable for machine learning
Build a predictive machine learning model
Estimate accuracy of predictions

Let’s think about a few obstacles we may find during this journey:

Lack of historical data: predictions may be of high uncertainty if there are not much data points (i.e. daily readers)
Lack of predictive features: who knows how the Medium algorithm works?

Scrap the data

In machine learning, the main assumption is that the future is like the past, so predictions are made out of historical data.

Historical readers data is located inside the stats web page of Medium accounts.

Daily readers data can be visualized with a bar chart

With R, there are two well-known packages used for scraping the web: rvest which is part of the tidyverse and RSelenium which was developed by rOpenSci. Both packages can be installed through CRAN.

As a rule of thumb, I choose RSelenium when I need to interact with the website before scraping the data (navigate, click…), and rvest when basically all the data can be found inside the source code of the web page.

French equivalent of Right click > Inspect. Makes your web browser display source code of a given element on the web page

🎉 Readers data is directly accessible from the source code of the page, at least for the last 30 days. This is the best scenario as you just need to read the source code of the page to get the data of interest. However, a click is required to display older data points, so we may need RSelenium at some point…

… but I was not granted access to the stats page (error code 401). According to this error code mapping, signing in is required before heading to the stats page…

… but Medium sends magic links inside emails for users to authenticate and sign in, so my RSelenium bot would have to check my emails inbox and click this link to keep going… This is where I decided to end my web scraping journey. 😌

I have a basic knowledge of web scraping, and I would be very interested to read from you if you know how to reach Medium account’s stats page programmatically!

Practical solution

This is one of these moments where you can transform having just a few data points into an opportunity.

I manually downloaded each page for the last 4 months I have been writing on Medium. All the daily readers data is located into these downloaded web pages.

Data quality

I performed a new extraction a few days later to make sure the readers data remained unchanged: there is no re-count process or adjustments once a day ended.

However, the final day, for which data is incomplete, will be ignored in the rest of the study.

Data wrangling

I use rvest to navigate inside the downloaded web pages, as if I was actually visiting the stats page!

Web scraped data is often messy HTML code. Readers data is located in the last attribute.

The data-tooltip attribute’s values contains the number of readers at a given date. The former is straight forward whereas the latter requires a few operations to build a date matching ISO standard.

Extracting the number of readers basically consists in extracting the first numbers from the raw text.

The month is extracted as a plain text and mapped to its corresponding number (1-January…12-December). The day is ending the raw text and the year is assumed 2020. Concatenating them all together creates a date in a convenient format.

We now have readers historical data for each day:

Conclusion

It was the first step of this data science project: find and load the data.

It was not possible to use web scrapping tools to collect the data automatically. In any case if it can be done later, the rest of the pipeline will still be valid (as long as Medium does not modify the source code 😊).

Although we performed some wrangling operations to start working with historical readers data, there is still a lot of data processing tasks before building our first forecasting model.

The next article will focus on feature engineering to enrich this initial data set and create predictive inputs to leverage when forecasting readers. For that purpose I tracked all the actions I have performed on social networks to share my articles…

…Stay tuned!

References

rOpenSci GitHub page for RSelenium: https://github.com/ropensci/RSelenium
tidyverse GitHub page for rvest: https://github.com/tidyverse/rvest