Intro to Data Science

Lecture 3 – R Markdowns & Data Wrangling (P1)

A Guide to Your Process

Scheduling

Learning Objectives

Practice

Supporting Information

Class Discussion

Today’s Plan

  • R Markdown Files
  • Loading Data
  • Working with Data
  • Tidyverse - dplyr

Today’s Learning Objectives

After today’s session you will be able to:

  • Define the three major components of RMarkdown files (.Rmd)
  • Write code to load external data into R
  • Explore data with base R tools
  • Manipulate data with dplyr

GitHub Review

  • Great work last week!


  • In the next week or two (i.e., before the second half of the course) I would like you to:
    • Make another practice repository (or maybe 2!)


  • What questions do you have about this?
    • Does this feel reasonable to you?

RMarkdown Files Intro

RMarkdown (Rmd) files have three sections:

  1. Metadata (YAML)
    • Controls formatting of document


  1. Plain Text
    • Technically written in markdown (a text-formatting language)


  1. Code chunks
    • Essentially mini R scripts within the larger file!

RMarkdown Analogy

R logo

picture of a chocolate bar

hex logo for the rmarkdown package

picture of a chocolate chip cookie

Rmds Part 1: Metadata

  • Document formatting metadata is called YAML
    • Yet Another Markup Language


  • Defines document header information & formatting
    • Title, Author, Date
    • File output type


  • Output options:
    • HTML = like a webpage but outputs as a file rather than a living website
    • PDF

Rmds Part 2: Plain Text

  • Write text just like you would in MS Word / etc.


  • But, there is no toolbar with buttons for doing formatting


  • Instead markdown syntax is required to accomplish these tweaks

Markdown Syntax

  • Your function tutorials have four required markdown styles:


  1. # = headings

    • More # = smaller heading
  2. _text_ = italics

  3. **text** = bold

  4. [text](link) = hyperlinked text


Other format options here: markdownguide.org/basic-syntax

Rmds Part 3: Code Chunks

Let’s look at the structure of an example code chunk

Screen capture of a code chunk from an Rmarkdown file where the echo option is set to false, the chunk is named 'pressure', and the `plot` function is used on an object also named 'pressure'

Screen capture of the same code chunk but with colored boxes annotating the chunk start and end, the code language, the chunk name, the chunk options, and the 'run this chunk' button


Note that chunk start must be formatted like:

- ```{language chunk_name, option_1, option_2, ...}

Code Chunks Options

Let’s check out three crucial code chunk options!

A table indicating where the rows correspond to the chunk options 'include', 'echo', and 'message' and the columns correspond to whether the chunk's code, outputs, or messages are included in the resulting file. None of the three code possibilities are included when 'include' is false, code is excluded but everything else is included when 'echo' is false, and everything but messages is included when 'message' is false

  • For a full list of options see here

Practice: RMarkdown Files

  • Install the rmarkdown package
    • Remember to use the install.packages function


  1. Create a new RMarkdown file!
    • File Arrow Right New File Arrow Right R Markdown…
    • In resulting pop-up, skip to bottom and click “OK”


  1. Look at (1) YAML, (2) Plain text, and (3) code chunks

    • Take notes on anything that jumps out at you


  1. Click the “knit” button

Temperature Check

How are you Feeling?

Comic-style graph depicting someone's emotional state as they debug code (from initial struggle and defeat to eventual triumph)

Loading Data

  • Function to use depends on file type
    • CSV = read.csv
    • MS Excel = readxl::read_excel


  • Need to assign data to an object to use it later!


  • For example:
my_data <- read.csv(file = "my_data_file.csv")

Download Example Data

  1. From the course page, download “minnow.csv”


  1. Move “minnow.csv” from your “Downloads” folder to your RStudio Project folder


  1. Make a new script for today’s lecture

Practice: Load Data

hex logo for dplyr R package

  • Now, use read.csv to read “minnow.csv” into R
    • Remember to assign it to an object!


  • First thing after reading in data: check structure!
    • Can use str or dplyr::glimpse


  • What do you see?

Exploring Data with Base R

Two ways in base R to access data:


  1. Bracket notation (works similar to vectors)


  1. Dollar sign ($) notation

Bracket Notation

  • The syntax is: data[row number, column number]


  • Let’s look at some example cases
# Get first column
my_df[,1]

# Get first row
my_df[1,]

# Get the value in the tenth row and fourth column
my_df[10, 4]


  • Note that concatenation works here too!
    • my_df[c(1, 2, 3), 1] would get rows 1 through 3 of column 1

Dollar Sign Notations

  • The syntax is: data$column


  • Let’s look at an example
# Get the column titled "species"
my_df$species


  • Note that this does not work for rows!

Practice: Base R Data Exploration

  1. Using bracket notation:
    • Access the 7th row of the minnow data
    • Access the 5th column of the minnow data
    • What is the value in the 21st row and 3rd column?


  1. Using dollar sign notation:
    • Check the “diameter” column
    • Look at the “species” column

Tidyverse Background

  • Ecosystem of inter-related packages & functions
    • Very human-readable
    • Extremely popular & commonly-used

tidyverse R package hex logo

dplyr R package hex logo

ggplot2 R package hex logo

tidyr R package hex logo

purrr R package hex logo

readr R package hex logo

tibble R package hex logo

magrittr R package hex logo

dplyr Part 1: filter

  • Remember our discussion of conditionals last week?
    • Types include: ==, |, and &


  • Subset using conditionals with filter
    • dplyr::filter == subset


# Subset to only butterfly milkweed records
milkweed <- filter(.data = flowers, species == "Asclepias tuberosa")


  • Can use filter instead of subset just to live fully in the Tidyverse
    • Just a style choice, so your call!

dplyr Part 2: mutate

  • Make new columns with mutate


  • Can create multiple columns at the same time
df_v2 <- mutate(.data = df_v1, new1 = old1 + 2,
                               new2 = old2 * 10,
                               new3 = new1 / new2)


  • Has optional .after argument to specify where you want the new column
df_v2 <- mutate(.data = df_v1,
                weight_lb = weight_kg * 2.2,
                .after = weight_kg)

Column Naming Aside

  • Avoid spaces or hyphens (-) in column names
    • Programming languages don’t like these characters in column names

Comic depicting multiple case options used in coding as the things they're named after

dplyr Part 3: select

  • Pick columns to keep or remove with select


  • Can choose columns to keep or to remove
# Keep only species information and count columns
df_v3a <- select(.data = df_v2, species, count)

# Remove the weight column
df_v3b <- select(.data = df_v2, -weight_kg)


  • Notice that column names are not in quotes
    • This is one of the special properties of the Tidyverse

Practice: Wrangling with dplyr

dplyr R package hex logo

  1. Filter the minnow data to only cases where the species is Stoneroller or Chub
  1. For that subset, make new columns where river depth and fish nest diameter are in meters
  1. Next, keep only the transect, species, diameter in meters, and depth in meters columns
    • There are two ways of doing this; can you identify them both?


  1. Check your work! What are the dimensions of the final data object?
  • Should be 14 rows and 4 columns

Temperature Check

How are you Feeling?

Comic-style graph depicting someone's emotional state as they debug code (from initial struggle and defeat to eventual triumph)

Upcoming Due Dates

Due before lab

(By midnight)

  • Muddiest Point #3

Due before lecture

(By midnight)

  • Homework #3
  • Pick 7-10 possible functions for Function Tutorial assignment
    • Visit: cran.r-project.org
    • Click “Packages” in left sidebar
    • Click “Table of available packages, sorted by name”
    • Your possible functions must be from these packages!