R: dplyr library

In order to use this library, it is necessary to install it once and then load it every time starting R or Rstudio


install.packages("dplyr") # Do it once
library(dplyr)   # Do it every time starting script

Basic functions from dplyr in R

For these examples, I will use build-in dataset: airquality and usually I will show only 5 first lines of the data with function head( … , n = 5)

filter() - filter original dataset

filter() - allows you to filter data on the basis on some given conditions. For example, lt’s remove all data with Temp less than 80 degrees


> airT80 = filter(airquality, (Temp >= 80))
> head(airT80)
#  Ozone Solar.R Wind Temp Month Day
#1    45     252 14.9   81     5  29
#2    NA     186  9.2   84     6   4
#3    NA     220  8.6   85     6   5
#4    29     127  9.7   82     6   7
#5    NA     273  6.9   87     6   8
#6    71     291 13.8   90     6   9

mutate() and transmute() - recalculate new data

Calculate new data column on the basis of existing columns and add it to the data set with function mutate(), or make it separate with transmute()

In our dataset, all temperatures are given in Fahrenheit. Therefore we will recalculate then into Celsius values.


> airTc = mutate(airquality, TempC = round((Temp - 32) *5 / 9, digits = 1))
> head(airTc, n = 5)
#  Ozone Solar.R Wind Temp Month Day TempC
#1    41     190  7.4   67     5   1  19.4
#2    36     118  8.0   72     5   2  22.2
#3    12     149 12.6   74     5   3  23.3
#4    18     313 11.5   62     5   4  16.7
#5    NA      NA 14.3   56     5   5  13.3


> airTc1 = transmute(airquality, TempC = round((Temp - 32) *5 / 9, digits = 1))
> head(airTc1, n = 5)
#  TempC
#1    19.4
#2    22.2
#3    23.3
#4    16.7
#5    13.3

select() - select columns from datasets

If we need to have only few selected columns in our dataset, then we can use select() function


> airSe = select(airTc, Ozone, Wind, TempC, Day)
> head(airSe, n = 5)
#  Ozone Wind TempC Day
#1    41  7.4  19.4   1
#2    36  8.0  22.2   2
#3    12 12.6  23.3   3
#4    18 11.5  16.7   4
#5    NA 14.3  13.3   5

summarise() - to generate summary from our datasets

summarise() can calculate some summaries from our dataset according to the given parameters. Before doing this, it is necessary to make sure that all absent data were treated accordingly. For example in our example, we will not use data with NA values, by applying na.rm = TRUE conditions


> summarise(airquality, av_temp = mean(Temp, na.rm = T))
#   av_temp
#1 77.88235

And now we will calculate average temperature for every month by grouping data by month. Furthermore, we will round average temperature to one digit


> summarise(group_by(airquality, Month), av_temp = round(mean(Temp, na.rm = T), digits = 1))
## A tibble: 5 × 2
#  Month av_temp
#     
#1     5    65.5
#2     6    79.1
#3     7    83.9
#4     8    84  
#5     9    76.9

arrange() for arranging, or sorting data by few columns

In this example we will sort our dataset by Day, and for the same day we will sort data my Month


> air_ar = arrange(airquality, desc(Day), Month)
> head(air_ar, n = 5)
#  Ozone Solar.R Wind Temp Month Day
#1    37     279  7.4   76     5  31
#2    59     254  9.2   81     7  31
#3    85     188  6.3   94     8  31
#4   115     223  5.7   79     5  30
#5    NA     138  8.0   83     6  30

sample_n() and sample_frac() for random sampling

Sometimes for some tests, like for selecting independent data, we can randomly sample some chunk of data. We can do it as exact number of lines sample_n() and by selecting of the fraction of our dataset with sample_frac()


> sample_n(airquality, size = 5)
#  Ozone Solar.R Wind Temp Month Day
#1    46     237  6.9   78     9  16
#2    NA     135  8.0   75     6  25
#3    16      77  7.4   82     8   3
#4    29     127  9.7   82     6   7
#5    18     224 13.8   67     9  17


> sample_frac(airquality, size = 0.05)
#  Ozone Solar.R Wind Temp Month Day
#1    21     230 10.9   75     9   9
#2    NA      NA  8.0   57     5  27
#3    46     237  6.9   78     9  16
#4    NA     194  8.6   69     5  10
#5   135     269  4.1   84     7   1
#6    78     197  5.1   92     9   2
#7    13     238 12.6   64     9  21
#8    64     253  7.4   83     7  30

count() for counting data

It is possible to count any data with basic grouping with count() function. Let’s count, how many data we have for each month


>  count(airquality, Month)
  Month  n
1     5 31
2     6 30
3     7 31
4     8 31
5     9 30

Pipe %>% operations in dplyr

Pipe operation simplified standard step-by-step operations by removing intermediate datasets. Let’s calculate average temperature for each selected month

Traditional way without pipe

In this way we will select required Mont, then we will group our data by this monthand then we will calculate summary


> fData = filter(airquality, Month == 7:9)
> gData = group_by(fData, Month)
> summarise(gData, av_T = mean(Temp, na.tm = TRUE))
## A tibble: 3 × 2
#  Month  av_T
#   
#1     7  82.2
#2     8  84  
#3     9  75.3

Pipe %>% for this task

In fact, we do not need all intermediate results and we can omit them by using pipe


> airquality %>% 
    filter(Month == 7:9) %>% 
    group_by(Month) %>% 
    summarise(av_T = mean(Temp, na.tm = TRUE))
## A tibble: 3 × 2
#  Month  av_T
#   
#1     7  82.2
#2     8  84  
#3     9  75.3