Subsetting Data

Library Loading

Begin by loading any needed libraries.

Load the tidyverse meta-package.

library(tidyverse)

Load the pandas and os libraries.

# Load needed libraries
import os
import pandas as pd

Conditionals

Often when we work with data we want to retrieve only the rows that meet some condition(s). These conditions can be defined in code by using “relational operators”. Fortunately, the fundamental relational operators are the same between Python and R!

Conditional statements always return “boolean” [BOO-lee-un] values. These are type boolean and class logical in Python and R respectively.

In R TRUE and FALSE are booleans. Note that their abbreviations T and F are also accepted though they must be capitalized.

# Assess whether a number equals itself and assign to an object
bool_r <- 20 == 20

# See what that object contains
bool_r
[1] TRUE
# And check class
class(bool_r)
[1] "logical"

In Python True or False are booleans. Note that they must be capitalized in this way to register as the correct type (i.e., only first letter capitalized).

# Assess whether a number equals itself and assign to an object
bool_py = 20 == 20

# See what that object contains
bool_py
True
# Also check type
type(bool_py)
<class 'bool'>

In addition to asking whether something ‘is exactly equal to’ something else (==), we can also ask whether two values are not equal (!=) or pose greater/less than conditionals (>/<).

# Ask whether a number is less than or equal to a particular value
7 <= 5
[1] FALSE
# Ask whether a number is less than or equal to a particular value
7 <= 5
False

Subsetting with One Condition

We can leverage these conditional values to return only rows of a data table that meet criteria that are valuable to us. Let’s begin by loading the external dataset derived from the lterdatasampler R package (see here) that we used in the previous section.

Read in vertebrate data CSV and check out the first few rows.

# Load data
vert_r <- utils::read.csv(file = file.path("data", "verts.csv"))

# Check out first few rows
head(vert_r, n = 3)
  year sitecode section reach pass unitnum unittype vert_index pitnumber
1 1987 MACKCC-L      CC     L    1       1        R          1        NA
2 1987 MACKCC-L      CC     L    1       1        R          2        NA
3 1987 MACKCC-L      CC     L    1       1        R          3        NA
          species length_1_mm length_2_mm weight_g clip sampledate notes
1 Cutthroat trout          58          NA     1.75 NONE 1987-10-07      
2 Cutthroat trout          61          NA     1.95 NONE 1987-10-07      
3 Cutthroat trout          89          NA     5.60 NONE 1987-10-07      

Recall from our file path module that the file.path function accounts for computer operating system differences.

Read in vertebrate data CSV (remember we must namespace the read_csv function) and check out the first few rows.

# Load data
vert_py = pd.read_csv(os.path.join("data", "verts.csv"))

# Check out first few rows
vert_py.head(3)
   year  sitecode section reach  ...  weight_g  clip  sampledate  notes
0  1987  MACKCC-L      CC     L  ...      1.75  NONE  1987-10-07    NaN
1  1987  MACKCC-L      CC     L  ...      1.95  NONE  1987-10-07    NaN
2  1987  MACKCC-L      CC     L  ...      5.60  NONE  1987-10-07    NaN

[3 rows x 16 columns]

Recall from our file path module that the join function accounts for computer operating system differences.

Now that we have some data we can use conditional statements to actually subset it! We’ll use the len and nrow functions in Python and R respectively to demonstrate that we successfully subset. Either will show whether we’ve successfully removed the rows that don’t meet our criteria.

Let’s subset to only one particular site of vertebrate data. Note that base R does include a subset function but we’ll use the equivalent filter function from the dplyr package.

# Subset to only site "MACKOG-U"
r_sub1 <- dplyr::filter(vert_r, sitecode == "MACKOG-U")

# Check whether that worked
message("Before subsetting the data had ", nrow(vert_r), " rows.")
Before subsetting the data had 32209 rows.
message("Subsetting changed this to ", nrow(r_sub1))
Subsetting changed this to 5726

Let’s subset to only one particular site of vertebrate data.

# Subset to only site "MACKOG-U"
py_sub1 = vert_py[vert_py.sitecode == "MACKOG-U"]

# Check whether that worked
print("Before subsetting the data had", len(vert_py), "rows.")
Before subsetting the data had 32209 rows.
print("Subsetting changed this to", len(py_sub1))
Subsetting changed this to 5726

Subsetting with Multiple Conditions

If desired we can also specify multiple criteria to subset by. We must decide whether all criteria must be met or if any criterion being met is sufficient to retain the row. If we want all (in either language) we need to separate each criterion with an ampersand (&). If instead we want any (in either language) we need to separate each criterion with a pipe (|); note that this is not the same as a “pipe operator” which we’ll discuss in detail later.

If we want a only values between a minimum and maximum value that means we need all criteria to be met so we need to use a & between conditions.

# Subset to only data from after 1990 before 1995 (inclusive)
r_sub2 <- dplyr::filter(vert_r, year >= 1990 & year <= 1995)

# Check whether that worked
message("Before subsetting the data had ", nrow(vert_r), " rows.")
Before subsetting the data had 32209 rows.
message("Subsetting changed this to ", nrow(r_sub2))
Subsetting changed this to 4161

If we want a only values between a minimum and maximum value that means we need all criteria to be met so we need to use a & between conditions. Note that when we specify multiple criteria in Python we must wrap each conditional in parentheses.

# Subset to only data from after 1990 before 1995 (inclusive)
py_sub2 = vert_py[(vert_py.year >= 1990) & (vert_py.year <= 1995)]

# Check whether that worked
print("Before subsetting the data had", len(vert_py), "rows.")
Before subsetting the data had 32209 rows.
print("Subsetting changed this to", len(py_sub2))
Subsetting changed this to 4161

Subsetting with Masks

We can define “boolean masks” when we want to leverage a helper function that tests for specific crtieria. For instance we can use the isin method or the %in% operator (in Python and R respectively) to ask whether the value in a given row matches any of a set of options. This is much simpler than writing out one “or” criterion for every option individually.

Let’s subset to only salamander species in the dataset using a mask. In R, that means providing the column name to the left of the %in% operator and giving a vector of options to the right.

# Make the subset
r_mask1 <- dplyr::filter(vert_r, species %in% c("Coastal giant salamander", "Cascade torrent salamander"))

# Check whether that worked
message("Before subsetting the data had ", nrow(vert_r), " rows.")
Before subsetting the data had 32209 rows.
message("Subsetting changed this to ", nrow(r_mask1))
Subsetting changed this to 11773

Let’s subset to only salamander species in the dataset using a mask. In Python, isin is a method so it is appended to the right of the column it accesses (separated by a period) and the options must be provided as a list (i.e., wrapped in square brackets and separated by commas).

# Make the subset
py_mask1 = vert_py[vert_py.species.isin(["Coastal giant salamander", "Cascade torrent salamander"])]

# Check whether that worked
print("Before subsetting the data had", len(vert_py), "rows.")
Before subsetting the data had 32209 rows.
print("Subsetting changed this to", len(py_mask1))
Subsetting changed this to 11773

Negating Criteria

Sometimes it is helpful to negate criteria and subset to only conditions that don’t meet our criteria. This works with either simpler conditional statements or with masks! The symbol we use to do this negation differs between the two languages but in either it is placed to the left of the criterion it is negating.

In the below example we’ll use the isnull function or the is.na function to see whether a column contains missing values and invert it to return only rows where the specified column is not missing values.

In R we negate conditions by adding an exclamation point (!) to the left of the relevant criterion.

# Subset to only species that *have* a recorded value in the "length_2_mm" column
r_mask2 <- dplyr::filter(vert_r, !is.na(length_2_mm))

# Check whether that worked
message("Before subsetting the data had ", nrow(vert_r), " rows.")
Before subsetting the data had 32209 rows.
message("Subsetting changed this to ", nrow(r_mask2))
Subsetting changed this to 12560

In Python we negate conditions by adding a tilde (~) to the left of the relevant criterion.

# Subset to only species that *have* a recorded value in the "length_2_mm" column
py_mask2 = vert_py[~pd.isnull(vert_py.length_2_mm)]

# Check whether that worked
print("Before subsetting the data had", len(vert_py), "rows.")
Before subsetting the data had 32209 rows.
print("Subsetting changed this to", len(py_mask2))
Subsetting changed this to 12560