Forecasting of content readers
A data science project applied to online content creation
I have written a handful of articles on Medium: my goal is to share them with as many readers as possible.
Having goals and expectations is essential to foster motivation. Exceeding expectations is a powerful source of motivation, but not meeting them can put the whole project in danger.
Based on a scientific approach, I wanted to set reasonable expectations so that I maximise my chances to exceed them, and benefit from this motivation boost to move faster towards my goal.
Before diving into technical details, have you heard about Make Your Brain Work by Amy Brann? It is an amazing book on how to make the most out of your brain.
The author analyses day-to-day business situations from a neuroscience point of view and shows how they can relate to famous scientific experiments. She also provides tips and discusses benefits of mastering our brain’s functionalities. Definitely recommended!
Note that the provided link is an Amazon affiliated link: I may earn money from this recommendation.
These series of articles describe a real-world data science project, where the goal is to predict how many readers blog posts are likely to reach within a given period of time. I will use them to set personal expectations but one can imagine that a business could adjust marketing expenses according to these predictions to reach its goals.
This project will typically follow 4 steps.
Project planning
- Scrap historical data from Medium statistics page
- Prepare the data to make it suitable for machine learning
- Build a predictive machine learning model
- Estimate accuracy of predictions
Let’s think about a few obstacles we may find during this journey:
- Lack of historical data: predictions may be of high uncertainty if there are not much data points (i.e. daily readers)
- Lack of predictive features: who knows how the Medium algorithm works?
Scrap the data
In machine learning, the main assumption is that the future is like the past, so predictions are made out of historical data.
Historical readers data is located inside the stats web page of Medium accounts.
With R, there are two well-known packages used for scraping the web: rvest
which is part of the tidyverse and RSelenium
which was developed by rOpenSci. Both packages can be installed through CRAN.
As a rule of thumb, I choose
RSelenium
when I need to interact with the website before scraping the data (navigate, click…), andrvest
when basically all the data can be found inside the source code of the web page.
🎉 Readers data is directly accessible from the source code of the page, at least for the last 30 days. This is the best scenario as you just need to read the source code of the page to get the data of interest. However, a click is required to display older data points, so we may need RSelenium
at some point…
… but I was not granted access to the stats page (error code 401). According to this error code mapping, signing in is required before heading to the stats page…
… but Medium sends magic links inside emails for users to authenticate and sign in, so my RSelenium
bot would have to check my emails inbox and click this link to keep going… This is where I decided to end my web scraping journey. 😌
I have a basic knowledge of web scraping, and I would be very interested to read from you if you know how to reach Medium account’s stats page programmatically!
Practical solution
This is one of these moments where you can transform having just a few data points into an opportunity.
I manually downloaded each page for the last 4 months I have been writing on Medium. All the daily readers data is located into these downloaded web pages.
Data quality
I performed a new extraction a few days later to make sure the readers data remained unchanged: there is no re-count process or adjustments once a day ended.
However, the final day, for which data is incomplete, will be ignored in the rest of the study.
Data wrangling
I use rvest
to navigate inside the downloaded web pages, as if I was actually visiting the stats page!
The data-tooltip
attribute’s values contains the number of readers at a given date. The former is straight forward whereas the latter requires a few operations to build a date matching ISO standard.
Extracting the number of readers basically consists in extracting the first numbers from the raw text.
The month is extracted as a plain text and mapped to its corresponding number (1-January…12-December). The day is ending the raw text and the year is assumed 2020. Concatenating them all together creates a date in a convenient format.
We now have readers historical data for each day:
Conclusion
It was the first step of this data science project: find and load the data.
It was not possible to use web scrapping tools to collect the data automatically. In any case if it can be done later, the rest of the pipeline will still be valid (as long as Medium does not modify the source code 😊).
Although we performed some wrangling operations to start working with historical readers data, there is still a lot of data processing tasks before building our first forecasting model.
The next article will focus on feature engineering to enrich this initial data set and create predictive inputs to leverage when forecasting readers. For that purpose I tracked all the actions I have performed on social networks to share my articles…
…Stay tuned!
References
- rOpenSci GitHub page for
RSelenium
: https://github.com/ropensci/RSelenium - tidyverse GitHub page for
rvest
: https://github.com/tidyverse/rvest