Tidyverse R: Data Analysis

What is Tidyverse

A collection of R packages: To handle/manipulate/clean and visualize datasets. A few examples of tidyverse packages:

dplyr: provides functions mostly for data manipulation/data wrangling
tidyr: provides functions mostly for reshaping data
readr: provides functions mostly for reading data from a variety of sources
ggplot2: provides functions mostly for creating data visualizations

Why Tidyverse?

Why Tidyverse

Consistent format: Makes it easy to learn and understand
Efficient: Efficient workflow for data wrangling tasks. The packages work together seamlessly
Active & warm community: A large, active & supportive community of Tidyverse users and developers.

Examples & applications of data wrangling tasks?

Examples & applications of data wrangling tasks

Cleaning and formatting data. Functions such as:
- Selecting relevant columns
- Filtering observations
- Summarizing
Merging/joining data

Setting up

Tidyverse

Loading the tidyverse package will load all of its core packages

# install.packages("tidyverse")
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.0
✔ tibble  3.2.1     ✔ dplyr   1.1.2
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Today’s data

Data description

Amazon Sales Dataset: From Kaggle

Reading data

The readr package provides functions for reading a variety of data formats (eg: cvs & txt). Can be loaded using tidyverse.
The readxl package provides functions for reading Microsoft Excel files. Required to load readxl explicitly, because it is not a core tidyverse package loaded via library(tidyverse).

Loading a csv file

library(tidyverse)

## Data read
df <- read_csv("data/amazon.csv")

Basic data manipulation

Pipe operator

Pipe operator: |>
- Helps to write more concise and readable code.
- Works by chaining together functions, so that the output of one function is passed as the input to the next function.

Select columns

Use the select() function. The select() function takes a vector of column names as its argument. To select a column:

df |> 
  select(product_id)

Select multiple columns

Specifying each of the column:

# Select the product_name and rating columns

df |>  
  select(product_name, rating)

Select multiple columns

To select a range of columns, simply use :

# Select columns from category to discount_percentage

df |> 
  select(category:discount_percentage)

Select multiple columns

To exclude specific columns and select all others, use -

# Select all columns except product_name, category, about_product and review_content

df |> 
  select(-c(product_name, category, about_product, review_content))

Filter

The filter() function, filtering rows based on a logical condition.

What we will cover today:

Operators ==, !=, \<, \>, \<=, \>=:
- Equality (==)
- Inequality (!=)
- Less than (<)
- Greater than (>)
- Less than or equal to (<=)
- Greater than or equal to (>=)

Filter: Equality

# Show only ratings of 5 in the rating column

df |> 
  filter(rating == 5)

Filter: Inequality

Filter the rating column to show only ratings that are not 5

df |> 
  filter(rating != 5)

Filter: Less than

Filter the rating column to show only ratings that are less than 4

df |> 
  filter(rating < 4)

Filter: Greater than

Filter the rating column to show only ratings that are greater than 4

df |> 
  filter(rating > 4)

Filter: Less than or equal to

Your turn:

To filter the rating column to show only ratings of 4 and below

df |> 
  filter(rating <= 4)

Filter: Greater than or equal to

Your turn:

Filter the rating column to show only ratings of 4 and above

df |> 
  filter(rating >= 4)

Exercise Part 1

Save a new tibble with:
- Selected only category, actual_price and rating columns
- Filtered only actual_price of ₹1,000

Solution for: Exercise Part 1

df_filtered <- df |> 
  select(category, actual_price, rating) |> 
  filter(actual_price == "₹1,000")

Mutate

The mutate() function, creating new variables or changing existing ones.

What we will cover today:

Creating a new variable
Modifying a variable

Mutate: Creating a new variable

Creating a new column called source

df |> 
  mutate(source = "amazon")

Mutate: Creating a new variable. Conditional Variable

Applying a condition using if_else

if_else(condition, value_true, value_true)

condition: Condition to evaluate.
value_true: The value if the condition is TRUE.
value_false: The value if the condition is FALSE

Mutate: Creating a new variable. Conditional Variable

df |> 
  mutate(rating_remark = if_else(rating > 4,
                                 "Rating is good",
                                 "Rating is bad"))

Mutate: Creating a new variable

Applying condition(s) using case_when

case_when(condition1 ~ value1,
          condition2 ~ value2,
          condition3 ~ value3,
          ....)

Mutate: Creating a new variable. Multiple conditions

Now, let’s add another condition using is_na() to handle cases where the rating is missing:

The first condition checks if the rating is greater than 4 and assigns the value “Rating is good” accordingly.
The second condition checks if the rating is 4 or less and assigns the value “Rating is bad”.
The third condition uses is.na() to check if the rating is missing and assigns the value “Missing rating”.

Mutate: Creating a new variable. Multiple conditions (Cont)

df |> 
  mutate(rating_remark = case_when(rating > 4 ~ "Rating is good",
                                   rating <= 4 ~ "Rating is bad",
                                   is.na(rating) ~ "Missing rating"))

Mutate: Modifying a variable

Parsing Numbers with parse_number() in mutate()

Parsing numbers from actual_price column
parse_number() can be used to extract numeric values from character strings
parse_number() will convert character strings to numeric values, removing any non-numeric characters.
actual_price’s original format:

Mutate: Modifying a variable (Cont)

To modify the actual_price column

df |> 
  mutate(actual_price = parse_number(actual_price))