Summarizing Data

Library & Data Loading

Begin by loading any needed libraries and reading in an external data file for use in downstream examples.

Load the tidyverse meta-package as well as our vertebrate data.

# Load needed library
library(tidyverse)

# Load data
vert_r <- utils::read.csv(file = file.path("data", "verts.csv"))

# Check out first few rows
head(vert_r, n = 5)
  year sitecode section reach pass unitnum unittype vert_index pitnumber
1 1987 MACKCC-L      CC     L    1       1        R          1        NA
2 1987 MACKCC-L      CC     L    1       1        R          2        NA
3 1987 MACKCC-L      CC     L    1       1        R          3        NA
4 1987 MACKCC-L      CC     L    1       1        R          4        NA
5 1987 MACKCC-L      CC     L    1       1        R          5        NA
          species length_1_mm length_2_mm weight_g clip sampledate notes
1 Cutthroat trout          58          NA     1.75 NONE 1987-10-07      
2 Cutthroat trout          61          NA     1.95 NONE 1987-10-07      
3 Cutthroat trout          89          NA     5.60 NONE 1987-10-07      
4 Cutthroat trout          58          NA     2.15 NONE 1987-10-07      
5 Cutthroat trout          93          NA     6.90 NONE 1987-10-07      

Load the pandas and os libraries as well as our vertebrate data.

# Load needed libraries
import os
import pandas as pd

# Load data
vert_py = pd.read_csv(os.path.join("data", "verts.csv"))

# Check out first few rows
vert_py.head()
   year  sitecode section reach  ...  weight_g  clip  sampledate  notes
0  1987  MACKCC-L      CC     L  ...      1.75  NONE  1987-10-07    NaN
1  1987  MACKCC-L      CC     L  ...      1.95  NONE  1987-10-07    NaN
2  1987  MACKCC-L      CC     L  ...      5.60  NONE  1987-10-07    NaN
3  1987  MACKCC-L      CC     L  ...      2.15  NONE  1987-10-07    NaN
4  1987  MACKCC-L      CC     L  ...      6.90  NONE  1987-10-07    NaN

[5 rows x 16 columns]

Defining Groups

Now that we have data, we can use it to demonstrate groupwise summarization! Both Python and R support defining the grouping structure in one step and then doing the actual summarization in subsequent steps.

Let’s suppose that we want to calculate average weight and length within each species and year.

In R, we use the group_by function (from dplyr) to define our grouping variables. Note that this changes the class from data.frame to tibble (a special dataframe-like class defined by the tidyverse in part due to this operation).

group_by–like other functions in the tidyverse–allows for ‘tidy select’ column names where column names are provided without quotes.

# Define grouping variables
vert_r_grp <- dplyr::group_by(.data = vert_r, year, species)

# Check class of resulting object
class(vert_r_grp)
[1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

In Python, we can use the groupby method (available to all variables of type DataFrame). This method does accept multiple column labels but they must be provided as a list (i.e., wrapped in square brackets) and the labels themselves must be wrapped in quotes.

# Define grouping variables
vert_py_grp = vert_py.groupby(["year", "species"])

# Check type of resulting variable
type(vert_py_grp)
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>

Groupwise Summarization

Once we’ve established the columns for which we want to calculate summary statistics we can move on to the tools that actually allow summarization steps to take place. Note that summarization implicitly means that we should lose rows because we should only have one row per combination of grouping column content combination. Adding columns is not necessarily summarization if it doesn’t shed rows.

R allows for multiple summary metrics to be calculate simultaneously within dplyr’s summarize function.

# Calculate average weight and length
vert_r_summary <- dplyr::summarize(.data = vert_r_grp,
                 weight_g = mean(weight_g, na.rm = T),
                 length_1_mm = mean(length_1_mm, na.rm = T),
                 length_2_mm = mean(length_2_mm, na.rm = T))

# Check out the first bit of that
head(vert_r_summary, n = 3)
# A tibble: 3 × 5
# Groups:   year [3]
   year species         weight_g length_1_mm length_2_mm
  <int> <chr>              <dbl>       <dbl>       <dbl>
1  1987 Cutthroat trout     8.96        90.2         NaN
2  1988 Cutthroat trout    16.1        115.          NaN
3  1989 Cutthroat trout    14.3        107.          NaN

When summarizing data in Python we need to calculate each metric separately and then combine them. Apologies that this uses the concat function in pandas that we have not previously covered but you can see a greater discussion of these functions/methods in the “Join” section.

Note the use of two modes of indexing a particular column.

# Calculate average weight and length
mean_wt = vert_py_grp["weight_g"].mean()
mean_ln1 = vert_py_grp.length_1_mm.mean()
mean_ln2 = vert_py_grp["length_2_mm"].mean()

# Combine them
vert_py_summary = pd.concat([mean_wt, mean_ln1, mean_ln2], axis = 1)

# Do some small index reformatting steps
vert_py_summary = vert_py_summary.reset_index().rename_axis(mapper = None, axis = 1)

# Check out that
vert_py_summary.head(3)
   year          species   weight_g  length_1_mm  length_2_mm
0  1987  Cutthroat trout   8.959867    90.170813          NaN
1  1988  Cutthroat trout  16.073510   115.417219          NaN
2  1989  Cutthroat trout  14.308766   107.038961          NaN

The index re-setting is needed for Python just to make sure the columns and rows are indexed as they should be.