Intro to Data Science

Lab 3 – Data Wrangling (P2)

A Guide to Your Process

Scheduling

Learning Objectives

Practice

Supporting Information

Class Discussion

Today’s Plan

  • Muddiest Point Review
  • Intro to the Pipe
  • Groupwise Summarization with dplyr
  • Discuss Function Tutorial Assignment

Today’s Learning Objectives

After today’s session you will be able to:

  • Use the pipe operator in your code
  • Perform group summarization with dplyr functions

Muddiest Point Review

  • Recurring topics from most recent MPs:


  • What other topic(s) would you like to review?

Pipe Operator (%>%)

  • Allows chaining together multiple operations


  • Product of each function passed to next function
new_data <- old_data %>%
            function() %>%
            another_fxn() %>%
            etc()


  • Same workflow requires fewer objects

Pipe Operator Example

Without Pipe

# Load data
df_v1 <- read.csv("butterfly.csv")

# Subset to only one treatment
df_v2 <- filter(df_v1, treatment == "cows")

# Add together caterpillars and adult butterflies
df_v3 <- mutate(df_v2, monarch.tot = monarch.bfly + monarch.larva)

# Keep only the total monarch column
df_v4 <- select(df_v3, monarch.tot)

Pipe Operator Example

Without Pipe

# Load data
df_v1 <- read.csv("butterfly.csv")

# Subset to only one treatment
df_v2 <- filter(df_v1, treatment == "cows")

# Add together caterpillars and adult butterflies
df_v3 <- mutate(df_v2, monarch.tot = monarch.bfly + monarch.larva)

# Keep only the total monarch column
df_v4 <- select(df_v3, monarch.tot)

With Pipe

# Load data
df_v1 <- read.csv("butterfly.csv")

# Do all needed wrangling
df_done <- df_v1 %>%
      # Subset to only one treatment
      filter(treatment == "cows") %>%
      # Add together caterpillars and adult butterflies
      mutate(monarch.tot = monarch.bfly + monarch.larva) %>% 
      # Keep only the total monarch column
      select(monarch.tot)

Why Named “Pipe”?

René Magritte – The Treachery of Images (1929)

photo of Rene Magritte

copy of 'the treachery of images', a famous painting of a pipe with the words 'this is not a pipe' written in French beneath the image hex logo for magrittr R package

Practice: Pipe

hex logo for magrittr R package

  1. Install and load the magrittr package


  1. Return to your 3-step wrangling of “minnow.csv” from Lecture #3
    • Filter “minnow.csv” to only Stonerollers and Chubs
    • Convert depth & diameter to meters (from cm)
    • Pare down columns to only species and depth/diameter in meters


  1. Copy these lines and edit them to use the %>%
    • Does this have the same end result as the non-pipe lines?

Groupwise Summarization

  • Summarizing within groups is a common operation
    • Average barnacle number at several tidal heights
    • Variation in reported customer satisfaction within demographic groups


  • dplyr offers three functions to accomplish this
    1. group_by
    2. summarize
    3. ungroup

Summarization Syntax

  • group_by has similar structure to select
    • Wants column names separated by commas


  • summarize has similar structure to mutate
    • E.g., new_column = function(old_column)


  • ungroup has no arguments!

Relevant Helper Functions

  • To summarize you’ll need to use functions that calculate summary values


  • Take an average with mean
    • Has na.rm argument that determines whether missing values are included


  • Find standard deviation with sd
    • Common measurement to use as error bars in a graph


  • Find the smallest or largest number with min and max

Summarization + Pipes

Let’s check out an example:


# Take data
data %>%
    # 1. Group by treatment
    group_by(treatment) %>%
    # 2. Calculate average and standard deviation
    summarize(mean_val = mean(response, na.rm = TRUE),
              sd_val = sd(response, na.rm = TRUE)) %>%
    # 3. Ungroup
    ungroup()

Summarization Warnings

  1. Summarizing simplifies dataframes!
    • After summarizing, you’ll have one row per combination of grouping columns


  1. Summarizing drops columns unless:
    1. Column is named in group_by
    2. Column is created by summarize


  • If you don’t want to lose a column, it needs to meet one of those criteria

Practice: Summarizing

hex logo for magrittr R package hex logo for dplyr R package hex logo for palmerpenguins R package

  • Using the “penguins” data in the palmerpenguins package, answer the following questions:


  1. What is the average bill length in millimeters for each species of penguin?
  1. Which island has the smallest individual penguin?
    • Hint: use body mass
  1. Which species at which island has the longest flippers for female penguins?
    • Hint: remember you can use filter before or after summarize!

Temperature Check

How are you Feeling?

Comic-style graph depicting someone's emotional state as they debug code (from initial struggle and defeat to eventual triumph)

Function Tutorial: Learning Objectives

After completing this assignment you will be able to:


  • Explain the proper syntax and use of R functions
  • Communicate effectively to an audience of interested non-specialists
  • Apply feedback on an assignment to a successful revision
  • Reflect on the process of revising a presentation based on constructively critical feedback

Function Tutorial: FAQ

  • Tutorial should be an R Markdown with plain text and code chunks
    • Write tutorials for your classmates for three functions from packages on CRAN


  • You’ll present your tutorials for 5-10 minutes in Lab #5
    • Get peer feedback then & implement changes before submitting draft 2


  • Submit & present revised tutorials during Lab #7

Function Tutorial: Points

  • Draft 1 = 30 pts (12% course grade)
    • Overall report – 6 pts
    • Function tutorial (x3) – 8 pts each
  • Draft 2 = 40 pts (16% of grade)
    • Overall report – 6 pts
    • Function tutorial (x3) – 8 pts each
    • Revision response – 3 pts
    • Edited from draft 1 from peer feedback – 7 pts
  • Optional Draft 3 = 40 pts
    • If submitted, score replaces draft 2
    • Score can only improve (no way draft 3 reduces total points earned)

Picking Functions

  • Everyone must pick three different functions
    • This way no two people present tutorials on the same function
    • Unfortunately, means if someone picks before you they “claim” that function


  • My plan to do this equitably is as follows:
    1. Randomize student order and each person picks one function
    2. Second function picked in reverse of that order (I.e., if you were last to pick in round 1, you’re first in round 2)
    3. Re-randomize student order for third function


  • Sound fair? If not, what’s a good alternative?

Forbidden Packages (Sorry!)

hex logo for the dplyr R package hex logo for the tidyr R package hex logo for the ggplot2 R package

  • dplyr – A Grammar of Data Manipulation
    • Reason: we cover a lot of this in class


  • tidyr – Tidy Messy Data
    • Actually only 2 forbidden functions: pivot_longer & pivot_wider
    • Others are okay to use!
    • Reason: we just covered both in class


  • ggplot2 – Create Elegant Data Visualizations Using the Grammar of Graphics
    • Reason: we cover a lot of this in class (see week 6) and its functions use a really different syntax from what is used by other packages

Assignment Q & A

  • What questions do you have about this assignment?
    • No such thing as a “dumb” question, so ask away!


  • Feeling good about next steps?

Exploring CRAN Packages


  • Click “Packages” on the left sidebar
    • Approx. 2/3 down sidebar items


  • Click “Table of available packages, sorted by name”


  • Scroll through and look for one with a cool name / title!

Practice: Exploring CRAN

  • Explore available packages / functions


  • Select 7-10 functions so you have alternates (if needed)


  • We will pick functions during next lecture (Lecture #4)

Temperature Check

How are you Feeling?

Comic-style graph depicting someone's emotional state as they debug code (from initial struggle and defeat to eventual triumph)

Upcoming Due Dates

Due before lecture

(By midnight)

  • Homework #3
  • Pick 7-10 possible functions for Function Tutorial assignment
    • Remember, they must be from CRAN packages!

Due before lab

(By midnight)

  • Muddiest Point #4

Bonus: Data Shape

Bonus Learning Objectives

After this bonus session you will be able to:

  • Reshape data from long to wide format
    • And vice versa

Data “Shape”

  • Data with rows/columns has a shape
    • Shape refers to whether observations are in the rows or the columns


  • “Wide” data has observations as columns
    • E.g., Each column is a different species’ count


  • “Long” data has observations as rows
    • E.g., The columns are “species” and “count”

Data Shape Visual


Long Data

Wide Data

Cartoon of a long table of data where there is a column with either fire or cow emojis, a column with one of three different butterfly emojis, and a third column with just '#' signs in every row

Cartoon of a wide table of data where there is one row for cows and one row for fire and the columns are dedicated to each of the three butterfly types

Two arrows--one facing left and the other right--between the two tables. The arrow point from wide to long is labeled 'pivot longer' and the opposite arrow is labeled 'pivot wider'

Reshaping Longer

  • Change from wide to long format with tidyr::pivot_longer
    • Has 4 key arguments


  1. data = the wide data to pivot


  1. cols = the columns to pivot
    • Can select which columns to pivot OR which to not
    • Include: cols = colD:colX
    • Exclude: cols = -colA:-colC

Reshaping Longer Continued

  1. names_to = name of new column to hold old column names
    • Must be in quotes


  1. values_to = name of new column to hold values
    • Also in quotes


  • Example (for syntax):
df_long <- pivot_longer(data = my_df,
                        columns = hydrogen:uranium,
                        names_to = "element",
                        values_to = "measurement")

Reshaping Longer Visual

Diagram showing how the 'tidyr' package allows users to pivot data into long format (from wide format) using the 'pivot_longer' function

Practice: pivot_longer

hex logo for the tidyr R package

  1. Download the “bees.csv” and load it into R with read.csv
    • Check its structure! What columns are there?


  1. Pivot the data so that you are left with three columns:
    • “year”, “bee_group”, and “bee_abundance”


  1. Check your work! What are the dimensions of the resulting dataframe?
  • Should be 32 rows by 3 columns

Reshaping Wider

  • Change from long to wide format with tidyr::pivot_wider
    • Also has 4 key arguments


  1. data = the wide data to pivot


  1. names_from = name of the column to turn into new column names
    • Must be unquoted

Reshaping Wider Continued

  1. values_from = name of column to make into new column values
    • Also unquoted


  1. values_fill = value to fill if value is missing in original data
    • Technically optional but good practice to include explicitly


  • Example:
df_wide <- pivot_wider(data = my_df,
                       names_from = "fruit",
                       values_from = "size",
                       values_fill = NA)

Reshaping Wider Visual

Diagram showing how the 'tidyr' package allows users to pivot data into wide format (from long format) using the 'pivot_wider' function

Practice: pivot_wider

hex logo for the tidyr R package

  1. Take the data object you pivoted to long format in the prior practice block


  1. Pivot it back to wide format with pivot_wider!


  1. Check your work!
    • Does it look like the original object you loaded with read.csv?

Practice: Wrangling!

hex logo for the magrittr R package hex logo for the tidyr R package hex logo for the dplyr R package hex logo for the palmerpenguins R package

  • Beginning with the “penguins” data do the following operations:
  1. Keep only data on female penguins
    • No male penguins and no individuals where sex is not known
  1. Calculate average bill depth within species and island
  1. Reshape to wide format so that each island is a column
    • Note that if an island doesn’t have a given species it should have NA (not 0)
  1. Check your work! What are the dimensions of the resulting dataframe?
  • Should be 6 rows by 5 columns