Tidyverse R: Data Analysis

What is Tidyverse


A collection of R packages: To handle/manipulate/clean and visualize datasets. A few examples of tidyverse packages:

  • dplyr: provides functions mostly for data manipulation/data wrangling

  • tidyr: provides functions mostly for reshaping data

  • readr: provides functions mostly for reading data from a variety of sources

  • ggplot2: provides functions mostly for creating data visualizations

Why Tidyverse?

Why Tidyverse


  • Consistent format: Makes it easy to learn and understand
  • Efficient: Efficient workflow for data wrangling tasks. The packages work together seamlessly
  • Active & warm community: A large, active & supportive community of Tidyverse users and developers.

Examples & applications of data wrangling tasks?

Examples & applications of data wrangling tasks


  • Cleaning and formatting data. Functions such as:

    • Selecting relevant columns

    • Filtering observations

    • Summarizing

  • Merging/joining data

Setting up

Tidyverse

Loading the tidyverse package will load all of its core packages

# install.packages("tidyverse")
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.0
✔ tibble  3.2.1     ✔ dplyr   1.1.2
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Today’s data

Data description

Amazon Sales Dataset: From Kaggle

Reading data

Reading data


  • The readr package provides functions for reading a variety of data formats (eg: cvs & txt). Can be loaded using tidyverse.

  • The readxl package provides functions for reading Microsoft Excel files. Required to load readxl explicitly, because it is not a core tidyverse package loaded via library(tidyverse).

Loading a csv file





library(tidyverse)

## Data read
df <- read_csv("data/amazon.csv")

Basic data manipulation

Pipe operator


  • Pipe operator: |>
    • Helps to write more concise and readable code.
    • Works by chaining together functions, so that the output of one function is passed as the input to the next function.

Select columns

Use the select() function. The select() function takes a vector of column names as its argument. To select a column:


df |> 
  select(product_id)

Select multiple columns

Specifying each of the column:

# Select the product_name and rating columns

df |>  
  select(product_name, rating)

Select multiple columns

To select a range of columns, simply use :

# Select columns from category to discount_percentage

df |> 
  select(category:discount_percentage)


Select multiple columns

To exclude specific columns and select all others, use -

# Select all columns except product_name, category, about_product and review_content

df |> 
  select(-c(product_name, category, about_product, review_content))


Filter

The filter() function, filtering rows based on a logical condition.

What we will cover today:

  • Operators ==, !=, \<, \>, \<=, \>=:

    • Equality (==)

    • Inequality (!=)

    • Less than (<)

    • Greater than (>)

    • Less than or equal to (<=)

    • Greater than or equal to (>=)

Filter: Equality

# Show only ratings of 5 in the rating column

df |> 
  filter(rating == 5)


Filter: Inequality

Filter the rating column to show only ratings that are not 5

df |> 
  filter(rating != 5)


Filter: Less than

Filter the rating column to show only ratings that are less than 4

df |> 
  filter(rating < 4)


Filter: Greater than

Filter the rating column to show only ratings that are greater than 4

df |> 
  filter(rating > 4)


Filter: Less than or equal to

Your turn:

To filter the rating column to show only ratings of 4 and below

df |> 
  filter(rating <= 4)


Filter: Greater than or equal to

Your turn:

Filter the rating column to show only ratings of 4 and above

df |> 
  filter(rating >= 4)


Exercise Part 1

  • Save a new tibble with:

    • Selected only category, actual_price and rating columns

    • Filtered only actual_price of ₹1,000

Solution for: Exercise Part 1

df_filtered <- df |> 
  select(category, actual_price, rating) |> 
  filter(actual_price == "₹1,000")


Mutate

The mutate() function, creating new variables or changing existing ones.

What we will cover today:

  • Creating a new variable
  • Modifying a variable

Mutate: Creating a new variable

Creating a new column called source

df |> 
  mutate(source = "amazon")


Mutate: Creating a new variable. Conditional Variable

Applying a condition using if_else

if_else(condition, value_true, value_true)
  • condition: Condition to evaluate.
  • value_true: The value if the condition is TRUE.
  • value_false: The value if the condition is FALSE

Mutate: Creating a new variable. Conditional Variable


df |> 
  mutate(rating_remark = if_else(rating > 4,
                                 "Rating is good",
                                 "Rating is bad"))


Mutate: Creating a new variable

Applying condition(s) using case_when

case_when(condition1 ~ value1,
          condition2 ~ value2,
          condition3 ~ value3,
          ....)

Mutate: Creating a new variable. Multiple conditions


Now, let’s add another condition using is_na() to handle cases where the rating is missing:

  • The first condition checks if the rating is greater than 4 and assigns the value “Rating is good” accordingly.

  • The second condition checks if the rating is 4 or less and assigns the value “Rating is bad”.

  • The third condition uses is.na() to check if the rating is missing and assigns the value “Missing rating”.

Mutate: Creating a new variable. Multiple conditions (Cont)

df |> 
  mutate(rating_remark = case_when(rating > 4 ~ "Rating is good",
                                   rating <= 4 ~ "Rating is bad",
                                   is.na(rating) ~ "Missing rating"))


Mutate: Modifying a variable

Parsing Numbers with parse_number() in mutate()

  • Parsing numbers from actual_price column

  • parse_number() can be used to extract numeric values from character strings

  • parse_number() will convert character strings to numeric values, removing any non-numeric characters.

  • actual_price’s original format:

Mutate: Modifying a variable (Cont)

To modify the actual_price column

df |> 
  mutate(actual_price = parse_number(actual_price))