Essential R packages for data science projects

Leverage genius work from R community in your projects!

Eric Bonucci
6 min readAug 15, 2020

There are more than 16 000 packages on the Comprehensive R Archive Network (CRAN) that gather a lot of commonly used methods in data science projects. Time runs fast, and it may takes days to code functionalities for sometimes basic tasks… Fortunately, we can leverage many packages to focus on what is essential for projects to be successful!

Quick reminder: install and use packages

The most common way is to install a package directly from CRAN using the following R command:

# this command installs tidyr package from CRAN
install.packages("tidyr")

Once the package is installed on your local machine, you don’t need to run this command again, unless you want to update the package with its latest version! If you want to check the version of a package you installed, you may use:

# returns tidyr package version
packageVersion("tidyr")

RStudio IDE also provides a convenient way to check if any update is available for installed packages in Tools/Check for packages updates…

Update all your packages in a few clicks using RStudio

Last but not least: how to use a package now it is installed :) You may either specify the package name in front of its included method:

stringr::str_replace("Hello world!", "Hello", "Hi")

Or run the following command to load all the package’s functions at once:

# load a package: it will throw an error if package is not installed
library(stringr)

Now you’re ready to go!

If you want to learn basically everything about R packages development, I highly recommend Hadley Wickham R packages book (free online version).

Fetching data

Fetching data is often the starting point of a data science project: data can be located in a database, an Excel spreadsheet, a comma-separated values (csv) file… it is essential to be able to read it regardless of its format, and avoid headaches before even starting to work with the data!

  • When data is located in a .csv files or any delimited-values file

The readr package provides functions that are up to 10 times faster than base R functions to read rectangular data.

Great R packages usually have a dedicated hex sticker: https://github.com/rstudio/hex-stickers

Convenient methods exist for reading and writing standard .csv files as well as custom files with a custom values separation symbol:

# read csv data delimited using comma (,)
input_data <- readr::read_csv("./input_data.csv")
# read csv data delimited using semi-colon (;)
input_data <- readr::read_csv2("./input_data.csv")
# read txt data delimited using whatever symbols (||)
input_data <- readr::read_delim("./input_data.txt", delim = "||")

In addition to good looking stickers, great R packages also have cheat sheets you can refer to!

  • When data is located in an Excel file

Microsoft Excel has its own file formats (.xls and .xlsx) and is very commonly used to store and edit data. The package readxl enables efficient reading of these files into R, you can even only read a specific spreadsheet:

# read Excel spreadsheets
input_data <- readxl::read_excel("input_data.xlsx", sheet = "page2")
  • When data is located in a database or in the cloud

When it comes to fetching data from databases, DBI makes it possible to connect to any server, as long as you provide the required credentials, and run SQL queries to fetch data. Because there are many different databases and ways to connect depending on your technical stack, I suggest that you refer to the complete documentation provided by RStudio to find the steps that suit your needs: Databases using R.

Make sure to check if a package exists to connect to your favorite cloud services provider! For example, bigrquery enables fetching data from Google BigQuery platform.

Wrangling data

You may have noticed a lot of the previously mentioned packages are part of the tidyverse. This collection of packages forms a powerful toolbox that you can leverage throughout your data science projects. Mastering these packages is key to become super efficient with R.

The pipe operator shipped with the magrittr package is a game changer https://github.com/tidyverse/magrittr

Data wrangling is made easy using the pipe operator, which goal is simply to pipe left-hand values into right-hand expressions:

# without pipe operator
paste("Hello", "world!")
# with pipe operator
"Hello" %>% paste("world!)

It may not seem obvious in this example, but this is a life-changing trick when you need to perform several sequential operations to a given object, typically a data frame.

Data frames usually contains your input data, making it the R object you probably work the most with. dplyr is a package that provides useful functions to edit, filter, rearrange or join data frames.

library(dplyr)
# mtcars is a toy data set shipped with base R
# create a column
mtcars <- mtcars %>% mutate(vehicle = "car")
# filter on a column
mtcars <- mtcars %>% filter(cyl >= 6)
# create a column AND filter on a column
mtcars <- mtcars %>%
mutate(vehicle = "car") %>%
filter(cyl >= 6)

Now you should understand my point about the power of the pipe operator :)

There is so much more to say about data wrangling that you can find entire books discussing the topic, such as Data Wrangling with R. In addition, a key work on leveraging tidyr functionalities is R for Data Science. A free online version of the latter can be found here. Please notice that these are Amazon affiliated links so I will receive a commission if you decide to buy the books.

Visualization

One of the main reason R is a very good choice for data science projects may be ggplot2. This package makes it easy and eventually fun to build visualizations that looks good and gather a lot of informations.

You may find inspirations from this Top 50 ggplot2 visualisation article : http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html

ggplot2 is also part of the tidyverse collection, that’s why it works perfectly with shapes of data you typically obtain after tidyr or dplyr data wrangling operations. Managing to plot histograms and scatter plots is rather quick. Then many additional elements can be used to enhance your plots.

Machine learning

Another very convenient package is caret, that wraps up a lot of methods typically used in machine learning processes. From data preparation, to model training and performances assessment, you will find everything you need when working on predictive analytics tasks.

I recommend reading the caret chapter about model training where this key task is discussed. Here is a very simple example of how to train a logistic regression:

library(dplyr)# say we want to predict iris having a big petal width
observations <- iris %>%
mutate(y = ifelse(Petal.Width >= 1.5, "big", "small")) %>%
select(-Petal.Width)
# set up a a 10-fold cross-validation
train_control <- caret::trainControl(method = "cv",
number = 10,
savePredictions = TRUE,
classProbs = TRUE)
# make it reproducible and train the model
set.seed(123)
model <- caret::train(y ~ .,
data = observations,
method = "glm",
trControl = train_control,
metric = "Accuracy")

Final words

Thanks a lot for reading my very first article on Medium! I feel like there is so much more to say in each section, as I did not talk about other super useful packages such as boot, shiny, shinydashboard, pbapply… Please share your thoughts in the comments, I am very interested in feedbacks on what you are willing to explore in future articles.

Useful documentations and references

--

--