# Load needed library
library(tidyverse)
Joining Data
Library Loading
Begin by loading any needed libraries.
Load the tidyverse
meta-package.
Load the pandas
and os
libraries.
# Load needed libraries
import os
import pandas as pd
Combining DataFrames
Sometimes we collected related data and store them in separate files. This necessitates integrating the two datasets later on for statistics and/or visualization. If the two datasets that are sampled at very different frequencies (e.g., annual temperature values and daily insect counts), trying to include both in a single file results in duplicating the less granular data many times. This is not ideal. Fortunately, scripted languages provide several methods for combining data easily and appropriately so that they can be used together despite being stored separately.
To illustrate some of these methods we’ll load some new simulated data on lichen coverage to use instead of the vertebrate data we’ve used in past modules.
Load the vertebrate data first.
# Load data
<- read.csv(file = file.path("data", "verts.csv"))
vert_r
# Check out first few rows
head(vert_r, n = 2)
year sitecode section reach pass unitnum unittype vert_index pitnumber
1 1987 MACKCC-L CC L 1 1 R 1 NA
2 1987 MACKCC-L CC L 1 1 R 2 NA
species length_1_mm length_2_mm weight_g clip sampledate notes
1 Cutthroat trout 58 NA 1.75 NONE 1987-10-07
2 Cutthroat trout 61 NA 1.95 NONE 1987-10-07
Then load the dataset with lichen community composition on trees.
# Load data
<- read.csv(file = file.path("data", "tree_lichen.csv"))
lich
# Check out rows rows
head(lich, n = 2)
tree lichen_foliose lichen_fruticose lichen_crustose
1 Tree_A 1.00 0.9 0.95
2 Tree_B 0.35 1.0 0.00
And finally load the data that includes distance from the nearest road for some of the same trees for which we have lichen data.
# Load data
<- read.csv(file = file.path("data", "tree_road.csv"))
road
# Check out rows
head(road, n = 2)
tree_name dist_to_road_m
1 Tree_A 13
2 Tree_C 10
Load the vertebrate data first.
# Load data
= pd.read_csv(os.path.join("data", "verts.csv"))
vert_py
# Check out first few rows
2) vert_py.head(
year sitecode section reach ... weight_g clip sampledate notes
0 1987 MACKCC-L CC L ... 1.75 NONE 1987-10-07 NaN
1 1987 MACKCC-L CC L ... 1.95 NONE 1987-10-07 NaN
[2 rows x 16 columns]
Then load the dataset with lichen community composition on trees.
# Load data
= pd.read_csv(os.path.join("data", "tree_lichen.csv"))
lich
# Check out rows
2) lich.head(
tree lichen_foliose lichen_fruticose lichen_crustose
0 Tree_A 1.00 0.9 0.95
1 Tree_B 0.35 1.0 0.00
And finally load the data that includes distance from the nearest road for some of the same trees for which we have lichen data.
# Load data
= pd.read_csv(os.path.join("data", "tree_road.csv"))
road
# Check out rows
2) road.head(
tree_name dist_to_road_m
0 Tree_A 13
1 Tree_C 10
Concatenating Data
The simplest way of combining data in either Python or R is called “concatenation”. This involves–essentially–pasting rows or columns of separate data variables/objects together.
We’ll need to modify our vertebrate data somewhat in order to demonstrate the two modes (horizontal or vertical) of concatenating DataFrames/data.frames.
Concatenating data (either horizontally or vertically) in this way is deeply risky in that it can easily result in creating rows or columns that appear to relate to one another but in actuality do not. However, it is still worthwhile to cover how this can be done.
Split the first and last two rows of the vertebrate data into separate objects. Note that in each we’ll want to retain all columns.
# Split the two datasets
<- vert_r[1:2, ]
vert_r_top <- vert_r[(nrow(vert_r) - 1):nrow(vert_r), ]
vert_r_bottom
# Look at one
vert_r_bottom
year sitecode section reach pass unitnum unittype vert_index pitnumber
32208 2019 MACKOG-U OG U 2 16 C 25 1043583
32209 2019 MACKOG-U OG U 2 16 C 26 1043500
species length_1_mm length_2_mm weight_g clip sampledate
32208 Coastal giant salamander 74 131 14.3 NONE 2019-09-05
32209 Coastal giant salamander 73 128 11.6 NONE 2019-09-05
notes
32208
32209 Terrestrial
Split the first and last two rows of the vertebrate data into separate variables. Note that in each we’ll want to retain all columns.
# Split the two datasets
= vert_py.iloc[[0, 1], :]
vert_py_top = vert_py.iloc[[-1, -2], :]
vert_py_bottom
# Look at one
vert_py_bottom
year sitecode section reach ... weight_g clip sampledate notes
32208 2019 MACKOG-U OG U ... 11.6 NONE 2019-09-05 Terrestrial
32207 2019 MACKOG-U OG U ... 14.3 NONE 2019-09-05 NaN
[2 rows x 16 columns]
Vertical Concatenation
Vertical concatenation (i.e., concatenating by stacking rows on top of each other) is one option for concatenation. This is much more common than horizontal concatenation.
R can perform this operation with the rbind
function (from base R) but the bind_rows
function (from the dplyr
package) is preferable because it checks for matching column names and–if necessary–reorders columns to match between the two data objects.
# Combine vertically
<- dplyr::bind_rows(vert_r_top, vert_r_bottom)
vert_r_vertical
# Check shape before and after to demonstrate it worked
message("There were two rows each before concatenation and ", nrow(vert_r_vertical), " after")
There were two rows each before concatenation and 4 after
Python does horizontal concatenation with the concat
function from the pandas
library. This function does horizontal and vertical concatenation and uses the axis
argument to determine which is done. Setting axis
to 0
performs vertical concatenation.
# Combine vertically
= pd.concat([vert_py_top, vert_py_bottom], axis = 0)
vert_py_vertical
# Check shape before and after to demonstrate it worked
print("There were two rows each before concatenation and", len(vert_py_vertical), "after")
There were two rows each before concatenation and 4 after
Horizontal Concatenation
Horizontal concatenation (i.e., concatenating by putting columns next to one another) is the other option for concatenation. Note that it assumes row orders are consistent and won’t perform any check for improper row combinations. Also, both languages will create duplicate column labels/names in our example because both data variables/objects have the same column labels/names.
R does horizontal concatenation with the cbind
function from base R.
# Combine horizontally
<- cbind(vert_r_top, vert_r_bottom)
vert_r_horiz
# Check columns to show they were added
names(vert_r_horiz)
[1] "year" "sitecode" "section" "reach" "pass"
[6] "unitnum" "unittype" "vert_index" "pitnumber" "species"
[11] "length_1_mm" "length_2_mm" "weight_g" "clip" "sampledate"
[16] "notes" "year" "sitecode" "section" "reach"
[21] "pass" "unitnum" "unittype" "vert_index" "pitnumber"
[26] "species" "length_1_mm" "length_2_mm" "weight_g" "clip"
[31] "sampledate" "notes"
Setting the axis
argument to 1
is how the concat
function is switched to horizontal concatenation.
# Combine horizontally
= pd.concat([vert_py_top, vert_py_bottom], axis = 1)
vert_py_horiz
# Check columns to show they were added
vert_py_horiz.columns
Index(['year', 'sitecode', 'section', 'reach', 'pass', 'unitnum', 'unittype',
'vert_index', 'pitnumber', 'species', 'length_1_mm', 'length_2_mm',
'weight_g', 'clip', 'sampledate', 'notes', 'year', 'sitecode',
'section', 'reach', 'pass', 'unitnum', 'unittype', 'vert_index',
'pitnumber', 'species', 'length_1_mm', 'length_2_mm', 'weight_g',
'clip', 'sampledate', 'notes'],
dtype='object')
Joins
While concatentation is simple and effective in some cases, a less risky mode of combining separate data objects is to use “joins”. Joins allow two data variables/objects to be combined in an appropriate order based on a column found in both data variables/objects. This shared column (or column_s_) are known as “join keys”.
This module makes use of simulated data on lichen coverage on various trees and the distance of those trees to the nearest road. Lichen wasn’t surveyed on all of the trees and not all trees that did have lichen measured had their distance to the nearest road recorded. Vertical concatenation is inappropriate because the column labels/names and horizontal concatenation is incorrect because we want to make sure we only combine data from the same tree across the two datasets.
There are several variants of joins that each function slightly differently so we’ll discuss each below. Note that all joins only combine two data variables/objects at a time and they are always referred to as the “left” (first) and “right” (second).
Left / Right Joins
Left joins combine two data variables/objects by their join key(s) but only keep rows that are found in the “left”. A right join performs the same operation but prioritizes rows in the second data variable/object. Note that functionally you can make a left join into a right join by switching which data variable/object you assign to “left” versus “right”.
The left_join
function (from the dplyr
package) performs–as its name suggests–left joins. The right_join
function does the same for right joins. Both have an x
and a y
argument for the left and right data objects respectively.
The by
argument expects a vector of join keys to by which to join. Note that if the join key column names differ–as is the case with our example data–you can specify that by using the syntax "left key" = "right key"
.
# Do a left join
<- dplyr::left_join(x = lich, y = road, by = c("tree" = "tree_name"))
left_r
# Check that out
head(left_r, n = 5)
tree lichen_foliose lichen_fruticose lichen_crustose dist_to_road_m
1 Tree_A 1.00 0.9 0.95 13
2 Tree_B 0.35 1.0 0.00 NA
3 Tree_C 0.20 0.0 0.05 10
4 Tree_D 0.55 0.9 0.85 23
5 Tree_E 0.85 0.9 0.25 20
See how “tree_B” has no distance to road listed? That is because it lacks a row in the right data object!
All Python joins are performed with the merge
function (from the pandas
library) and the type of join is specified by the how
argument. Which data variable is left and right is specified by arguments of the same name.
The on
argument can be used when the join key(s) labels match but otherwise a left_on
and right_on
argument are provided to specify the join key’s label in the left and right data variable respectively.
# Do a left join
= pd.merge(left = lich, right = road, how = "left",
left_py = "tree", right_on = "tree_name")
left_on
# Check that out
5) left_py.head(
tree lichen_foliose ... tree_name dist_to_road_m
0 Tree_A 1.00 ... Tree_A 13.0
1 Tree_B 0.35 ... NaN NaN
2 Tree_C 0.20 ... Tree_C 10.0
3 Tree_D 0.55 ... Tree_D 23.0
4 Tree_E 0.85 ... Tree_E 20.0
[5 rows x 6 columns]
See how “tree_B” has no distance to road listed? That is because it lacks a row in the right data variable!
Inner Joins
An inner join keeps only the rows that can be found in both join keys. This is very useful in situations where only “complete” data (i.e., no missing values) is required.
In R, we can do an inner join with dplyr
’s inner_join
function. It has the same arguments as the left_join
function.
# Inner join the two datasets
<- dplyr::inner_join(x = lich, y = road, by = c("tree" = "tree_name"))
in_r
# Check that out
head(in_r, n = 5)
tree lichen_foliose lichen_fruticose lichen_crustose dist_to_road_m
1 Tree_A 1.00 0.90 0.95 13
2 Tree_C 0.20 0.00 0.05 10
3 Tree_D 0.55 0.90 0.85 23
4 Tree_E 0.85 0.90 0.25 20
5 Tree_L 0.75 0.35 1.00 17
In Python, an inner join is actually the default behavior of the merge
function. Therefore we can exclude the how
argument entirely!
# Inner join the two datasets
= pd.merge(left = lich, right = road,
in_py = "tree", right_on = "tree_name")
left_on
# Check that out
5) in_py.head(
tree lichen_foliose ... tree_name dist_to_road_m
0 Tree_A 1.00 ... Tree_A 13
1 Tree_C 0.20 ... Tree_C 10
2 Tree_D 0.55 ... Tree_D 23
3 Tree_E 0.85 ... Tree_E 20
4 Tree_L 0.75 ... Tree_L 17
[5 rows x 6 columns]
Full Joins
A full join (sometimes known as an “outer join” or a “full outer join”) combines all rows in both left and right data variables/objects regardless of whether the join key entries are found in both.
R uses the full_join
function (from dplyr
) and the same arguments as the other types of join that we’ve already discussed.
# Do a full join
<- dplyr::full_join(x = lich, y = road, by = c("tree" = "tree_name"))
full_r
# Check that out
head(full_r, n = 5)
tree lichen_foliose lichen_fruticose lichen_crustose dist_to_road_m
1 Tree_A 1.00 0.9 0.95 13
2 Tree_B 0.35 1.0 0.00 NA
3 Tree_C 0.20 0.0 0.05 10
4 Tree_D 0.55 0.9 0.85 23
5 Tree_E 0.85 0.9 0.25 20
Python refers to full joins as “outer” joins so the how
argument of the merge
function needs to be set to “outer”.
# Do a full join
= pd.merge(left = lich, right = road, how = "outer",
full_py = "tree", right_on = "tree_name")
left_on
# Check that out
5) full_py.head(
tree lichen_foliose ... tree_name dist_to_road_m
0 Tree_A 1.00 ... Tree_A 13.0
1 Tree_B 0.35 ... NaN NaN
2 Tree_C 0.20 ... Tree_C 10.0
3 Tree_D 0.55 ... Tree_D 23.0
4 Tree_E 0.85 ... Tree_E 20.0
[5 rows x 6 columns]