Essential R packages for data science projects
There are more than 16 000 packages on the Comprehensive R Archive Network (CRAN) that gather a lot of commonly used methods in data science projects. Time runs fast, and it may takes days to code functionalities for sometimes basic tasks… Fortunately, we can leverage many packages to focus on what is essential for projects to be successful!
Quick reminder: install and use packages
The most common way is to install a package directly from CRAN using the following R command:
# this command installs tidyr package from CRAN
Once the package is installed on your local machine, you don’t need to run this command again, unless you want to update the package with its latest version! If you want to check the version of a package you installed, you may use:
# returns tidyr package version
RStudio IDE also provides a convenient way to check if any update is available for installed packages in Tools/Check for packages updates…
Last but not least: how to use a package now it is installed :) You may either specify the package name in front of its included method:
stringr::str_replace("Hello world!", "Hello", "Hi")
Or run the following command to load all the package’s functions at once:
# load a package: it will throw an error if package is not installed
Now you’re ready to go!
If you want to learn basically everything about R packages development, I highly recommend Hadley Wickham R packages book (free online version).
Fetching data is often the starting point of a data science project: data can be located in a database, an Excel spreadsheet, a comma-separated values (csv) file… it is essential to be able to read it regardless of its format, and avoid headaches before even starting to work with the data!
- When data is located in a .csv files or any delimited-values file
readr package provides functions that are up to 10 times faster than base R functions to read rectangular data.
Convenient methods exist for reading and writing standard .csv files as well as custom files with a custom values separation symbol:
# read csv data delimited using comma (,)
input_data <- readr::read_csv("./input_data.csv")
# read csv data delimited using semi-colon (;)
input_data <- readr::read_csv2("./input_data.csv")
# read txt data delimited using whatever symbols (||)
input_data <- readr::read_delim("./input_data.txt", delim = "||")
In addition to good looking stickers, great R packages also have cheat sheets you can refer to!
- When data is located in an Excel file
Microsoft Excel has its own file formats (.xls and .xlsx) and is very commonly used to store and edit data. The package
readxl enables efficient reading of these files into R, you can even only read a specific spreadsheet:
# read Excel spreadsheets
input_data <- readxl::read_excel("input_data.xlsx", sheet = "page2")
- When data is located in a database or in the cloud
When it comes to fetching data from databases,
DBI makes it possible to connect to any server, as long as you provide the required credentials, and run SQL queries to fetch data. Because there are many different databases and ways to connect depending on your technical stack, I suggest that you refer to the complete documentation provided by RStudio to find the steps that suit your needs: Databases using R.
Make sure to check if a package exists to connect to your favorite cloud services provider! For example,
bigrquery enables fetching data from Google BigQuery platform.
You may have noticed a lot of the previously mentioned packages are part of the tidyverse. This collection of packages forms a powerful toolbox that you can leverage throughout your data science projects. Mastering these packages is key to become super efficient with R.
Data wrangling is made easy using the pipe operator, which goal is simply to pipe left-hand values into right-hand expressions:
# without pipe operator
# with pipe operator
"Hello" %>% paste("world!)
It may not seem obvious in this example, but this is a life-changing trick when you need to perform several sequential operations to a given object, typically a data frame.
Data frames usually contains your input data, making it the R object you probably work the most with.
dplyr is a package that provides useful functions to edit, filter, rearrange or join data frames.
# mtcars is a toy data set shipped with base R
# create a column
mtcars <- mtcars %>% mutate(vehicle = "car")
# filter on a column
mtcars <- mtcars %>% filter(cyl >= 6)
# create a column AND filter on a column
mtcars <- mtcars %>%
mutate(vehicle = "car") %>%
filter(cyl >= 6)
Now you should understand my point about the power of the pipe operator :)
There is so much more to say about data wrangling that you can find entire books discussing the topic, such as Data Wrangling with R. In addition, a key work on leveraging tidyr functionalities is R for Data Science. A free online version of the latter can be found here. Please notice that these are Amazon affiliated links so I will receive a commission if you decide to buy the books.
One of the main reason R is a very good choice for data science projects may be
ggplot2. This package makes it easy and eventually fun to build visualizations that looks good and gather a lot of informations.
ggplot2 is also part of the tidyverse collection, that’s why it works perfectly with shapes of data you typically obtain after
dplyr data wrangling operations. Managing to plot histograms and scatter plots is rather quick. Then many additional elements can be used to enhance your plots.
Another very convenient package is
caret, that wraps up a lot of methods typically used in machine learning processes. From data preparation, to model training and performances assessment, you will find everything you need when working on predictive analytics tasks.
I recommend reading the
caret chapter about model training where this key task is discussed. Here is a very simple example of how to train a logistic regression:
library(dplyr)# say we want to predict iris having a big petal width
observations <- iris %>%
mutate(y = ifelse(Petal.Width >= 1.5, "big", "small")) %>%
select(-Petal.Width)# set up a a 10-fold cross-validation
train_control <- caret::trainControl(method = "cv",
number = 10,
savePredictions = TRUE,
classProbs = TRUE)# make it reproducible and train the model
model <- caret::train(y ~ .,
data = observations,
method = "glm",
trControl = train_control,
metric = "Accuracy")
Thanks a lot for reading my very first article on Medium! I feel like there is so much more to say in each section, as I did not talk about other super useful packages such as
pbapply… Please share your thoughts in the comments, I am very interested in feedbacks on what you are willing to explore in future articles.