Load the tidyverse meta-package as well as our vertebrate data.
# Load needed librarylibrary(tidyverse)# Load datavert_r <- utils::read.csv(file =file.path("data", "verts.csv"))# Check out first few rowshead(vert_r, n =5)
year sitecode section reach pass unitnum unittype vert_index pitnumber
1 1987 MACKCC-L CC L 1 1 R 1 NA
2 1987 MACKCC-L CC L 1 1 R 2 NA
3 1987 MACKCC-L CC L 1 1 R 3 NA
4 1987 MACKCC-L CC L 1 1 R 4 NA
5 1987 MACKCC-L CC L 1 1 R 5 NA
species length_1_mm length_2_mm weight_g clip sampledate notes
1 Cutthroat trout 58 NA 1.75 NONE 1987-10-07
2 Cutthroat trout 61 NA 1.95 NONE 1987-10-07
3 Cutthroat trout 89 NA 5.60 NONE 1987-10-07
4 Cutthroat trout 58 NA 2.15 NONE 1987-10-07
5 Cutthroat trout 93 NA 6.90 NONE 1987-10-07
Load the pandas and os libraries as well as our vertebrate data.
# Load needed librariesimport osimport pandas as pd# Load datavert_py = pd.read_csv(os.path.join("data", "verts.csv"))# Check out first few rowsvert_py.head()
year sitecode section reach ... weight_g clip sampledate notes
0 1987 MACKCC-L CC L ... 1.75 NONE 1987-10-07 NaN
1 1987 MACKCC-L CC L ... 1.95 NONE 1987-10-07 NaN
2 1987 MACKCC-L CC L ... 5.60 NONE 1987-10-07 NaN
3 1987 MACKCC-L CC L ... 2.15 NONE 1987-10-07 NaN
4 1987 MACKCC-L CC L ... 6.90 NONE 1987-10-07 NaN
[5 rows x 16 columns]
Defining Groups
Now that we have data, we can use it to demonstrate groupwise summarization! Both Python and R support defining the grouping structure in one step and then doing the actual summarization in subsequent steps.
Let’s suppose that we want to calculate average weight and length within each species and year.
In R, we use the group_by function (from dplyr) to define our grouping variables. Note that this changes the class from data.frame to tibble (a special dataframe-like class defined by the tidyverse in part due to this operation).
group_by–like other functions in the tidyverse–allows for ‘tidy select’ column names where column names are provided without quotes.
# Define grouping variablesvert_r_grp <- dplyr::group_by(.data = vert_r, year, species)# Check class of resulting objectclass(vert_r_grp)
[1] "grouped_df" "tbl_df" "tbl" "data.frame"
In Python, we can use the groupby method (available to all variables of type DataFrame). This method does accept multiple column labels but they must be provided as a list (i.e., wrapped in square brackets) and the labels themselves must be wrapped in quotes.
# Define grouping variablesvert_py_grp = vert_py.groupby(["year", "species"])# Check type of resulting variabletype(vert_py_grp)
Once we’ve established the columns for which we want to calculate summary statistics we can move on to the tools that actually allow summarization steps to take place. Note that summarization implicitly means that we should lose rows because we should only have one row per combination of grouping column content combination. Adding columns is not necessarily summarization if it doesn’t shed rows.
R allows for multiple summary metrics to be calculate simultaneously within dplyr’s summarize function.
# Calculate average weight and lengthvert_r_summary <- dplyr::summarize(.data = vert_r_grp,weight_g =mean(weight_g, na.rm = T),length_1_mm =mean(length_1_mm, na.rm = T),length_2_mm =mean(length_2_mm, na.rm = T))# Check out the first bit of thathead(vert_r_summary, n =3)
# A tibble: 3 × 5
# Groups: year [3]
year species weight_g length_1_mm length_2_mm
<int> <chr> <dbl> <dbl> <dbl>
1 1987 Cutthroat trout 8.96 90.2 NaN
2 1988 Cutthroat trout 16.1 115. NaN
3 1989 Cutthroat trout 14.3 107. NaN
When summarizing data in Python we need to calculate each metric separately and then combine them. Apologies that this uses the concat function in pandas that we have not previously covered but you can see a greater discussion of these functions/methods in the “Join” section.
Note the use of two modes of indexing a particular column.
# Calculate average weight and lengthmean_wt = vert_py_grp["weight_g"].mean()mean_ln1 = vert_py_grp.length_1_mm.mean()mean_ln2 = vert_py_grp["length_2_mm"].mean()# Combine themvert_py_summary = pd.concat([mean_wt, mean_ln1, mean_ln2], axis =1)# Do some small index reformatting stepsvert_py_summary = vert_py_summary.reset_index().rename_axis(mapper =None, axis =1)# Check out thatvert_py_summary.head(3)
year species weight_g length_1_mm length_2_mm
0 1987 Cutthroat trout 8.959867 90.170813 NaN
1 1988 Cutthroat trout 16.073510 115.417219 NaN
2 1989 Cutthroat trout 14.308766 107.038961 NaN
The index re-setting is needed for Python just to make sure the columns and rows are indexed as they should be.