Data Manipulation with R: Tidyverse Dplyr
R is one of the most preferred languages for data manipulation. Today, we will take a look at how to perform Data Manipulation with the Tidyverse package, which is one of the indispensable tools of a data scientist working with R programming, and what can be done with Tidyverse Dplyr. Let’s start.
Just as Numpy and Pandas in Python are indispensable in the field of data manipulation, Tidyverse appears as the package that creates this indispensability in R language. To install the tidyverse package in RStudio, we first write the install.packages(“tidyverse”) code and download it to our environment, and then call the library with the library(tidyverse) statement. In this article, we will go through the famous iris dataset, and since this dataset comes pre-loaded in R, we can access the dataset directly with the View(iris) code.
Introduction to Data Manipulation with Package Functions
After the installation process, we will use functions one by one to introduce data manipulation with R.
iris_new <- iris %>% select(Sepal.Length , Sepal.Width)
With the select function above, we made 2 choices, sepal length and sepal width, and assigned them to a new data set with the %>% expression to work on them. There are only two columns in the data set called iris_new. Let’s say we want to do the same operation on rows and we want to select the first 10 rows. For this, we write the iris %>% slice(1:10) code. The slice function gives us the right to select on the lines.
Let’s say we want to select the minimum and maximum values in a given variable, and for this let’s consider the sepal length function. When we write the code below, the program will bring us the 5 smallest and largest values in the column we want.
iris %>% slice_min(order_by = Sepal.Length , n = 5)
iris %>% slice_max(order_by = Sepal.Length , n = 5)
For example, let’s say we want to take 30 random samples. In order to make random observations in the data set we want, we need to write the following code.
iris %>% slice_sample(n = 30)
In order to examine the non-repeating values on the basis of categorical variables, we call the unique values when we take the variable named setten spices and write the iris %>% distinct(Species) code. The Distinct function allows us to see the values without repeating them.
When we want to sort the values, we need to use the arrange() function. If we want to sort the values on the sepal width variable from smallest to largest, it will be sufficient to write the iris %>% arrange(Sepal.Width) code. If we want to use the same operation to sort two variables together, we can write iris %>% arrange(Sepal.Width ,Sepal.Length) as in the example. Since the first argument entered here is first, sorting is done based on the values of the sepal width variable first.
Let’s say we want to get basic statistical values to look at the dataset and get a quick idea. For this we should use the summarise() function. If we want to see values such as mean, median and standard deviation on the sepal length variable, we must run the code below.
iris %>% summarise( Mean = mean(Sepal.Length) ,
Median = median(Sepal.Length) ,
Sd = sd(Sepal.Length))
When we want to perform a query operation similar to SQL queries, the filter() function comes into play. Let’s say we want to see values less than 5 in the sepal length variable and greater than 3 in the sepal width variable. For this, on the command line;
iris %>% filter(Sepal.Length < 5 , Sepal.Width > 3)iris %>% filter(Sepal.Length < 5 , Sepal.Width > 3)
It will suffice to write the statement. If we want both as one of them, not together, that is, if we need to put in between or the conjunction | We can use the symbol and write the following code.
iris %>% filter(Sepal.Length < 5 | Sepal.Width > 3)
Finally, let’s consider the mutate() function. Let’s say we want to see the logarithm of the data in the sepal length variable, not the data itself. For this, we can use the mutate() function as follows.
iris %>% mutate(Sepal.Length = log(Sepal.Length))
If we want to see the logarithm taken in a new column in the table and we want to preserve the original state of the variable, we perform an assignment as follows.
iris %>% mutate(Sepal.LengthLog = log(Sepal.Length))