library(tidyverse)
Subsetting Data
Library Loading
Begin by loading any needed libraries.
Load the tidyverse
meta-package.
Load the pandas
and os
libraries.
# Load needed libraries
import os
import pandas as pd
Conditionals
Often when we work with data we want to retrieve only the rows that meet some condition(s). These conditions can be defined in code by using “relational operators”. Fortunately, the fundamental relational operators are the same between Python and R!
Conditional statements always return “boolean” [BOO-lee-un] values. These are type boolean and class logical in Python and R respectively.
In R TRUE
and FALSE
are booleans. Note that their abbreviations T
and F
are also accepted though they must be capitalized.
# Assess whether a number equals itself and assign to an object
<- 20 == 20
bool_r
# See what that object contains
bool_r
[1] TRUE
# And check class
class(bool_r)
[1] "logical"
In Python True
or False
are booleans. Note that they must be capitalized in this way to register as the correct type (i.e., only first letter capitalized).
# Assess whether a number equals itself and assign to an object
= 20 == 20
bool_py
# See what that object contains
bool_py
True
# Also check type
type(bool_py)
<class 'bool'>
In addition to asking whether something ‘is exactly equal to’ something else (==
), we can also ask whether two values are not equal (!=
) or pose greater/less than conditionals (>
/<
).
# Ask whether a number is less than or equal to a particular value
7 <= 5
[1] FALSE
# Ask whether a number is less than or equal to a particular value
7 <= 5
False
Subsetting with One Condition
We can leverage these conditional values to return only rows of a data table that meet criteria that are valuable to us. Let’s begin by loading the external dataset derived from the lterdatasampler
R package (see here) that we used in the previous section.
Read in vertebrate data CSV and check out the first few rows.
# Load data
<- utils::read.csv(file = file.path("data", "verts.csv"))
vert_r
# Check out first few rows
head(vert_r, n = 3)
year sitecode section reach pass unitnum unittype vert_index pitnumber
1 1987 MACKCC-L CC L 1 1 R 1 NA
2 1987 MACKCC-L CC L 1 1 R 2 NA
3 1987 MACKCC-L CC L 1 1 R 3 NA
species length_1_mm length_2_mm weight_g clip sampledate notes
1 Cutthroat trout 58 NA 1.75 NONE 1987-10-07
2 Cutthroat trout 61 NA 1.95 NONE 1987-10-07
3 Cutthroat trout 89 NA 5.60 NONE 1987-10-07
Recall from our file path module that the file.path
function accounts for computer operating system differences.
Read in vertebrate data CSV (remember we must namespace the read_csv
function) and check out the first few rows.
# Load data
= pd.read_csv(os.path.join("data", "verts.csv"))
vert_py
# Check out first few rows
3) vert_py.head(
year sitecode section reach ... weight_g clip sampledate notes
0 1987 MACKCC-L CC L ... 1.75 NONE 1987-10-07 NaN
1 1987 MACKCC-L CC L ... 1.95 NONE 1987-10-07 NaN
2 1987 MACKCC-L CC L ... 5.60 NONE 1987-10-07 NaN
[3 rows x 16 columns]
Recall from our file path module that the join
function accounts for computer operating system differences.
Now that we have some data we can use conditional statements to actually subset it! We’ll use the len
and nrow
functions in Python and R respectively to demonstrate that we successfully subset. Either will show whether we’ve successfully removed the rows that don’t meet our criteria.
Let’s subset to only one particular site of vertebrate data. Note that base R does include a subset
function but we’ll use the equivalent filter
function from the dplyr
package.
# Subset to only site "MACKOG-U"
<- dplyr::filter(vert_r, sitecode == "MACKOG-U")
r_sub1
# Check whether that worked
message("Before subsetting the data had ", nrow(vert_r), " rows.")
Before subsetting the data had 32209 rows.
message("Subsetting changed this to ", nrow(r_sub1))
Subsetting changed this to 5726
Let’s subset to only one particular site of vertebrate data.
# Subset to only site "MACKOG-U"
= vert_py[vert_py.sitecode == "MACKOG-U"]
py_sub1
# Check whether that worked
print("Before subsetting the data had", len(vert_py), "rows.")
Before subsetting the data had 32209 rows.
print("Subsetting changed this to", len(py_sub1))
Subsetting changed this to 5726
Subsetting with Multiple Conditions
If desired we can also specify multiple criteria to subset by. We must decide whether all criteria must be met or if any criterion being met is sufficient to retain the row. If we want all (in either language) we need to separate each criterion with an ampersand (&
). If instead we want any (in either language) we need to separate each criterion with a pipe (|
); note that this is not the same as a “pipe operator” which we’ll discuss in detail later.
If we want a only values between a minimum and maximum value that means we need all criteria to be met so we need to use a &
between conditions.
# Subset to only data from after 1990 before 1995 (inclusive)
<- dplyr::filter(vert_r, year >= 1990 & year <= 1995)
r_sub2
# Check whether that worked
message("Before subsetting the data had ", nrow(vert_r), " rows.")
Before subsetting the data had 32209 rows.
message("Subsetting changed this to ", nrow(r_sub2))
Subsetting changed this to 4161
If we want a only values between a minimum and maximum value that means we need all criteria to be met so we need to use a &
between conditions. Note that when we specify multiple criteria in Python we must wrap each conditional in parentheses.
# Subset to only data from after 1990 before 1995 (inclusive)
= vert_py[(vert_py.year >= 1990) & (vert_py.year <= 1995)]
py_sub2
# Check whether that worked
print("Before subsetting the data had", len(vert_py), "rows.")
Before subsetting the data had 32209 rows.
print("Subsetting changed this to", len(py_sub2))
Subsetting changed this to 4161
Subsetting with Masks
We can define “boolean masks” when we want to leverage a helper function that tests for specific crtieria. For instance we can use the isin
method or the %in%
operator (in Python and R respectively) to ask whether the value in a given row matches any of a set of options. This is much simpler than writing out one “or” criterion for every option individually.
Let’s subset to only salamander species in the dataset using a mask. In R, that means providing the column name to the left of the %in%
operator and giving a vector of options to the right.
# Make the subset
<- dplyr::filter(vert_r, species %in% c("Coastal giant salamander", "Cascade torrent salamander"))
r_mask1
# Check whether that worked
message("Before subsetting the data had ", nrow(vert_r), " rows.")
Before subsetting the data had 32209 rows.
message("Subsetting changed this to ", nrow(r_mask1))
Subsetting changed this to 11773
Let’s subset to only salamander species in the dataset using a mask. In Python, isin
is a method so it is appended to the right of the column it accesses (separated by a period) and the options must be provided as a list (i.e., wrapped in square brackets and separated by commas).
# Make the subset
= vert_py[vert_py.species.isin(["Coastal giant salamander", "Cascade torrent salamander"])]
py_mask1
# Check whether that worked
print("Before subsetting the data had", len(vert_py), "rows.")
Before subsetting the data had 32209 rows.
print("Subsetting changed this to", len(py_mask1))
Subsetting changed this to 11773
Negating Criteria
Sometimes it is helpful to negate criteria and subset to only conditions that don’t meet our criteria. This works with either simpler conditional statements or with masks! The symbol we use to do this negation differs between the two languages but in either it is placed to the left of the criterion it is negating.
In the below example we’ll use the isnull
function or the is.na
function to see whether a column contains missing values and invert it to return only rows where the specified column is not missing values.
In R we negate conditions by adding an exclamation point (!
) to the left of the relevant criterion.
# Subset to only species that *have* a recorded value in the "length_2_mm" column
<- dplyr::filter(vert_r, !is.na(length_2_mm))
r_mask2
# Check whether that worked
message("Before subsetting the data had ", nrow(vert_r), " rows.")
Before subsetting the data had 32209 rows.
message("Subsetting changed this to ", nrow(r_mask2))
Subsetting changed this to 12560
In Python we negate conditions by adding a tilde (~
) to the left of the relevant criterion.
# Subset to only species that *have* a recorded value in the "length_2_mm" column
= vert_py[~pd.isnull(vert_py.length_2_mm)]
py_mask2
# Check whether that worked
print("Before subsetting the data had", len(vert_py), "rows.")
Before subsetting the data had 32209 rows.
print("Subsetting changed this to", len(py_mask2))
Subsetting changed this to 12560