The readr package provides functions for reading a variety of data formats (eg: cvs & txt). Can be loaded using tidyverse.
The readxl package provides functions for reading Microsoft Excel files. Required to load readxl explicitly, because it is not a core tidyverse package loaded via library(tidyverse).
Loading a csv file
library(tidyverse)## Data readdf <-read_csv("data/amazon.csv")
Basic data manipulation
Pipe operator
Pipe operator: |>
Helps to write more concise and readable code.
Works by chaining together functions, so that the output of one function is passed as the input to the next function.
Select columns
Use the select() function. The select() function takes a vector of column names as its argument. To select a column:
df |>select(product_id)
Select multiple columns
Specifying each of the column:
# Select the product_name and rating columnsdf |>select(product_name, rating)
Select multiple columns
To select a range of columns, simply use :
# Select columns from category to discount_percentagedf |>select(category:discount_percentage)
Select multiple columns
To exclude specific columns and select all others, use -
# Select all columns except product_name, category, about_product and review_contentdf |>select(-c(product_name, category, about_product, review_content))
Filter
The filter() function, filtering rows based on a logical condition.
What we will cover today:
Operators ==, !=, \<, \>, \<=, \>=:
Equality (==)
Inequality (!=)
Less than (<)
Greater than (>)
Less than or equal to (<=)
Greater than or equal to (>=)
Filter: Equality
# Show only ratings of 5 in the rating columndf |>filter(rating ==5)
Filter: Inequality
Filter the rating column to show only ratings that are not 5
df |>filter(rating !=5)
Filter: Less than
Filter the rating column to show only ratings that are less than 4
df |>filter(rating <4)
Filter: Greater than
Filter the rating column to show only ratings that are greater than 4
df |>filter(rating >4)
Filter: Less than or equal to
Your turn:
To filter the rating column to show only ratings of 4 and below
df |>filter(rating <=4)
Filter: Greater than or equal to
Your turn:
Filter the rating column to show only ratings of 4 and above
df |>filter(rating >=4)
Exercise Part 1
Save a new tibble with:
Selected only category, actual_price and rating columns