Intro to Data Science

Lab 3 – Data Wrangling II

A Guide to Your Process

Scheduling

Learning Objectives

Practice

Supporting Information

Class Discussion

Today’s Plan

Muddiest Point Review
Intro to the Pipe
Groupwise Summarization with dplyr
Discuss Function Tutorial Assignment

Today’s Learning Objectives

After today’s session you will be able to:

Use the pipe operator in your code
Perform group summarization with dplyr functions

Muddiest Point Review

Recurring topics from most recent MPs:

What other topic(s) would you like to review?

Pipe Operator (`%>%`)

Allows chaining together multiple operations

Product of each function passed to next function

new_data <- old_data %>%
            function() %>%
            another_fxn() %>%
            etc()

Same workflow requires fewer objects

Pipe Operator Example

Without Pipe

# Load data
df_v1 <- read.csv("butterfly.csv")

# Subset to only one treatment
df_v2 <- filter(df_v1, treatment == "cows")

# Add together caterpillars and adult butterflies
df_v3 <- mutate(df_v2, monarch.tot = monarch.bfly + monarch.larva)

# Keep only the total monarch column
df_v4 <- select(df_v3, monarch.tot)

Pipe Operator Example

Without Pipe

# Load data
df_v1 <- read.csv("butterfly.csv")

# Subset to only one treatment
df_v2 <- filter(df_v1, treatment == "cows")

# Add together caterpillars and adult butterflies
df_v3 <- mutate(df_v2, monarch.tot = monarch.bfly + monarch.larva)

# Keep only the total monarch column
df_v4 <- select(df_v3, monarch.tot)

With Pipe

# Load data
df_v1 <- read.csv("butterfly.csv")

# Do all needed wrangling
df_done <- df_v1 %>%
      # Subset to only one treatment
      filter(treatment == "cows") %>%
      # Add together caterpillars and adult butterflies
      mutate(monarch.tot = monarch.bfly + monarch.larva) %>% 
      # Keep only the total monarch column
      select(monarch.tot)

Why Named “Pipe”?

René Magritte – The Treachery of Images (1929)

photo of Rene Magritte

copy of 'the treachery of images', a famous painting of a pipe with the words 'this is not a pipe' written in French beneath the image hex logo for magrittr R package

Practice: Pipe

hex logo for magrittr R package

Install and load the magrittr package

Return to your 3-step wrangling of “minnow.csv” from Lecture #3
- Filter “minnow.csv” to only Stonerollers and Chubs
- Convert depth & diameter to meters (from cm)
- Pare down columns to only species and depth/diameter in meters

Copy these lines and edit them to use the %>%
- Does this have the same end result as the non-pipe lines?

Groupwise Summarization

Summarizing within groups is a common operation
- Average barnacle number at several tidal heights
- Variation in reported customer satisfaction within demographic groups

dplyr offers three functions to accomplish this
1. group_by
2. summarize
3. ungroup

Summarization Syntax

group_by has similar structure to select
- Wants column names separated by commas

summarize has similar structure to mutate
- E.g., new_column = function(old_column)

ungroup has no arguments!

Relevant Helper Functions

To summarize you’ll need to use functions that calculate summary values

Take an average with mean
- Has na.rm argument that determines whether missing values are included

Find standard deviation with sd
- Common measurement to use as error bars in a graph

Find the smallest or largest number with min and max

Summarization + Pipes

Let’s check out an example:

# Take data
data %>%
    # 1. Group by treatment
    group_by(treatment) %>%
    # 2. Calculate average and standard deviation
    summarize(mean_val = mean(response, na.rm = TRUE),
              sd_val = sd(response, na.rm = TRUE)) %>%
    # 3. Ungroup
    ungroup()

Summarization Warnings

Summarizing simplifies dataframes!
- After summarizing, you’ll have one row per combination of grouping columns

Summarizing drops columns unless:
1. Column is named in group_by
2. Column is created by summarize

If you don’t want to lose a column, it needs to meet one of those criteria

Practice: Summarizing

hex logo for magrittr R package hex logo for dplyr R package hex logo for palmerpenguins R package

Using the “penguins” data in the palmerpenguins package, answer the following questions:

What is the average bill length in millimeters for each species of penguin?

Which island has the smallest individual penguin?
- Hint: use body mass

Which species at which island has the longest flippers for female penguins?
- Hint: remember you can use filter before or after summarize!

Temperature Check

How are you Feeling?

Comic-style graph depicting someone's emotional state as they debug code (from initial struggle and defeat to eventual triumph)

Function Tutorial: Learning Objectives

After completing this assignment you will be able to:

Explain the proper syntax and use of R functions
Communicate effectively to an audience of interested non-specialists
Apply feedback on an assignment to a successful revision
Reflect on the process of revising a presentation based on constructively critical feedback

Function Tutorial: FAQ

Tutorial should be an R Markdown with plain text and code chunks
- Write tutorials for your classmates for three functions from packages on CRAN

You’ll present your tutorials for 5-10 minutes in Lab #5
- Get peer feedback then & implement changes before submitting draft 2

Submit & present revised tutorials during Lab #7

Function Tutorial: Points

Draft 1 = 30 pts (12% course grade)
- Overall report – 6 pts
- Function tutorial (x3) – 8 pts each

Draft 2 = 40 pts (16% of grade)
- Overall report – 6 pts
- Function tutorial (x3) – 8 pts each
- Revision response – 3 pts
- Edited from draft 1 from peer feedback – 7 pts

Optional Draft 3 = 40 pts
- If submitted, score replaces draft 2
- Score can only improve (no way draft 3 reduces total points earned)

Picking Functions

Everyone must pick three different functions
- This way no two people present tutorials on the same function
- Unfortunately, means if someone picks before you they “claim” that function

My plan to do this equitably is as follows:
1. Randomize student order and each person picks one function
2. Second function picked in reverse of that order (I.e., if you were last to pick in round 1, you’re first in round 2)
3. Re-randomize student order for third function

Sound fair? If not, what’s a good alternative?

Nick’s Recommended Packages

hex logo for the stringr R package hex logo for the dndR R package hex logo for the lterpalettefinder R package

stringr – Simple, Consistent Wrappers for Common String Operations

dndR – Dungeons & Dragons Functions for Players and Dungeon Masters

lterpalettefinder – Extract Color Palettes from Photos and Pick Official LTER Palettes

hex logo for the supportR R package hex logo for the vegan R package

supportR – Support Functions for Wrangling and Visualization

vegan – Community Ecology Package

Forbidden Packages (Sorry!)

hex logo for the dplyr R package hex logo for the tidyr R package hex logo for the ggplot2 R package

dplyr – A Grammar of Data Manipulation
- Reason: we cover a lot of this in class

tidyr – Tidy Messy Data
- Actually only 2 forbidden functions: pivot_longer & pivot_wider
- Others are okay to use!
- Reason: we just covered both in class

ggplot2 – Create Elegant Data Visualizations Using the Grammar of Graphics
- Reason: we cover a lot of this in class (see week 6) and its functions use a really different syntax from what is used by other packages

Assignment Q & A

What questions do you have about this assignment?
- No such thing as a “dumb” question, so ask away!

Feeling good about next steps?

Exploring CRAN Packages

Visit cran.r-project.org

Click “Packages” on the left sidebar
- Approx. 2/3 down sidebar items

Click “Table of available packages, sorted by name”

Scroll through and look for one with a cool name / title!

Practice: Exploring CRAN

Explore available packages / functions

Select 7-10 functions so you have alternates (if needed)

We will pick functions during next lecture (Lecture #4)

Temperature Check

How are you Feeling?

Comic-style graph depicting someone's emotional state as they debug code (from initial struggle and defeat to eventual triumph)

Upcoming Due Dates

Due before lecture

(By midnight)

Homework #3
Pick 7-10 possible functions for Function Tutorial assignment
- Remember, they must be from CRAN packages!

Due before lab

(By midnight)

Muddiest Point #4

Bonus: Data Shape

Bonus Learning Objectives

After this bonus session you will be able to:

Reshape data from long to wide format
- And vice versa

Data “Shape”

Data with rows/columns has a shape
- Shape refers to whether observations are in the rows or the columns

“Wide” data has observations as columns
- E.g., Each column is a different species’ count

“Long” data has observations as rows
- E.g., The columns are “species” and “count”

Data Shape Visual

Long Data

Wide Data

Cartoon of a long table of data where there is a column with either fire or cow emojis, a column with one of three different butterfly emojis, and a third column with just '#' signs in every row

Cartoon of a wide table of data where there is one row for cows and one row for fire and the columns are dedicated to each of the three butterfly types

Two arrows--one facing left and the other right--between the two tables. The arrow point from wide to long is labeled 'pivot longer' and the opposite arrow is labeled 'pivot wider'

Reshaping Longer

Change from wide to long format with tidyr::pivot_longer
- Has 4 key arguments

data = the wide data to pivot

cols = the columns to pivot
- Can select which columns to pivot OR which to not
- Include: cols = colD:colX
- Exclude: cols = -colA:-colC

Reshaping Longer Continued

names_to = name of new column to hold old column names
- Must be in quotes

values_to = name of new column to hold values
- Also in quotes

Example (for syntax):

df_long <- pivot_longer(data = my_df,
                        columns = hydrogen:uranium,
                        names_to = "element",
                        values_to = "measurement")

Reshaping Longer Visual

Diagram showing how the 'tidyr' package allows users to pivot data into long format (from wide format) using the 'pivot_longer' function

Practice: `pivot_longer`

hex logo for the tidyr R package

Download the “bees.csv” and load it into R with read.csv
- Check its structure! What columns are there?

Pivot the data so that you are left with three columns:
- “year”, “bee_group”, and “bee_abundance”

Check your work! What are the dimensions of the resulting dataframe?

Should be 32 rows by 3 columns

Reshaping Wider

Change from long to wide format with tidyr::pivot_wider
- Also has 4 key arguments

data = the wide data to pivot

names_from = name of the column to turn into new column names
- Must be unquoted

Reshaping Wider Continued

values_from = name of column to make into new column values
- Also unquoted

values_fill = value to fill if value is missing in original data
- Technically optional but good practice to include explicitly

Example:

df_wide <- pivot_wider(data = my_df,
                       names_from = "fruit",
                       values_from = "size",
                       values_fill = NA)

Reshaping Wider Visual

Diagram showing how the 'tidyr' package allows users to pivot data into wide format (from long format) using the 'pivot_wider' function

Practice: `pivot_wider`

hex logo for the tidyr R package

Take the data object you pivoted to long format in the prior practice block

Pivot it back to wide format with pivot_wider!

Check your work!
- Does it look like the original object you loaded with read.csv?

Practice: Wrangling!

hex logo for the magrittr R package hex logo for the tidyr R package hex logo for the dplyr R package hex logo for the palmerpenguins R package

Beginning with the “penguins” data do the following operations:

Keep only data on female penguins
- No male penguins and no individuals where sex is not known

Calculate average bill depth within species and island

Reshape to wide format so that each island is a column
- Note that if an island doesn’t have a given species it should have NA (not 0)

Check your work! What are the dimensions of the resulting dataframe?

Should be 6 rows by 5 columns

Intro to Data Science

A Guide to Your Process

Scheduling

Learning Objectives

Practice

Supporting Information

Class Discussion

Today’s Plan

Today’s Learning Objectives

Muddiest Point Review

Pipe Operator (%>%)

Pipe Operator Example

Without Pipe

Pipe Operator Example

Without Pipe

With Pipe

Why Named “Pipe”?

Practice: Pipe

Groupwise Summarization

Summarization Syntax

Relevant Helper Functions

Summarization + Pipes

Summarization Warnings

Practice: Summarizing

Temperature Check

How are you Feeling?

Function Tutorial: Learning Objectives

Function Tutorial: FAQ

Function Tutorial: Points

Picking Functions

Nick’s Recommended Packages

Forbidden Packages (Sorry!)

Assignment Q & A

Exploring CRAN Packages

Practice: Exploring CRAN

Temperature Check

How are you Feeling?

Upcoming Due Dates

Due before lecture

(By midnight)

Due before lab

(By midnight)

Bonus: Data Shape

Bonus Learning Objectives

Data “Shape”

Data Shape Visual

Long Data

Wide Data

Reshaping Longer

Reshaping Longer Continued

Reshaping Longer Visual

Practice: pivot_longer

Reshaping Wider

Reshaping Wider Continued

Reshaping Wider Visual

Practice: pivot_wider

Practice: Wrangling!

Pipe Operator (`%>%`)

Practice: `pivot_longer`

Practice: `pivot_wider`