Reshaping Data

Library & Data Loading

Begin by loading any needed libraries and reading in an external data file for use in downstream examples.

Load the tidyverse meta-package as well as our simulated lichen data.

# Load needed library
library(tidyverse)

# Load data
lich_r <- read.csv(file = file.path("data", "tree_lichen.csv"))

# Check out first few rows
head(lich_r, n = 3)
    tree lichen_foliose lichen_fruticose lichen_crustose
1 Tree_A           1.00              0.9            0.95
2 Tree_B           0.35              1.0            0.00
3 Tree_C           0.20              0.0            0.05

Load the pandas and os libraries as well as our simulated lichen data.

# Load needed libraries
import os
import pandas as pd

# Load data
lich_py = pd.read_csv(os.path.join("data", "tree_lichen.csv"))

# Check out first few rows
lich_py.head(3)
     tree  lichen_foliose  lichen_fruticose  lichen_crustose
0  Tree_A            1.00               0.9             0.95
1  Tree_B            0.35               1.0             0.00
2  Tree_C            0.20               0.0             0.05

Understanding Data Shape

All tabular data can be said to have a “shape” that is either wide or long. This is often used to simply indicate whether there are more columns than there are rows but it is perhaps more accurate to say that wide data have variables in separate columns while long data usually has a column of variable names and a column of values. This distinction is important because you can have a data table with more rows than columns that is still in wide format (e.g., the simulated example lichen community data we just loaded).

When wrangling data we may find it necessary to change the data from one shape to the other–this process is what is meant by “reshaping” (or sometimes “pivoting”). An example of each type of reshaping is included below.

Reshaping Longer

Our lichen data are currently in wide format so let’s reshape into long format.

R uses the pivot_longer function from the tidyr package for this operation. We need to specify the data table to reshape, the columns to collapse (to the cols argument), as well as the new column names for both the old column names and the values they contained. Fortunately the names_to and values_to arguments are reasonably intuitively named.

# Reshape longer
long_r <- tidyr::pivot_longer(data = lich_r,
                              cols = c(lichen_foliose, lichen_fruticose, lichen_crustose),
                              names_to = "lichen_types",
                              values_to = "percent_cover")

# Check out the first few rows of that
head(long_r, n = 3)
# A tibble: 3 × 3
  tree   lichen_types     percent_cover
  <chr>  <chr>                    <dbl>
1 Tree_A lichen_foliose            1   
2 Tree_A lichen_fruticose          0.9 
3 Tree_A lichen_crustose           0.95

Note that the column names to reshape must be a vector.

Python uses the melt function from the pandas library for this operation. We need to specify the DataFrame to reshape, the columns to exclude from collapsing (with the id_vars argument), as well as the new column names for both the old column names and the values they contained. Fortunately the var_name and value_name arguments are reasonably intuitively named.

# Reshape longer
long_py = pd.melt(frame = lich_py,
                  id_vars = ["tree"],
                  var_name = "lichen_types",
                  value_name = "percent_cover")

# Check out the first few rows of that
long_py.head(3)
     tree    lichen_types  percent_cover
0  Tree_A  lichen_foliose           1.00
1  Tree_B  lichen_foliose           0.35
2  Tree_C  lichen_foliose           0.20

Note that the column names use as IDs must be a list (or similar object type).

Reshaping Wider

The opposite operation–reshaping wider–is also readily available in either Python or R. We’ll reshape the data we made into long format above to demonsrate re-reshaping back into wide format.

R uses the pivot_wider function (also from the tidyr package) to do this type of reshaping operation. Reshaping wider requires fewer arguments as only the columns to reshape need to be specified.

# Reshape (back) into wide format
wide_r <- tidyr::pivot_wider(data = long_r,
                             names_from = "lichen_types",
                             values_from = "percent_cover")

# Check out the first few rows
head(wide_r, n = 2)
# A tibble: 2 × 4
  tree   lichen_foliose lichen_fruticose lichen_crustose
  <chr>           <dbl>            <dbl>           <dbl>
1 Tree_A           1                 0.9            0.95
2 Tree_B           0.35              1              0   

We can use the pandas function pivot_table to reshape back into wide format. Unlike R, we do need to specify the index column label(s) that are not being reshaped.

# Reshape (back) into wide format
wide_py = pd.pivot_table(data = long_py,
                         index = ["tree"], 
                         columns = "lichen_types",
                         values = "percent_cover")
                         
# Do some small index reformatting steps
wide_py = wide_py.reset_index().rename_axis(mapper = None, axis = 1)

# Check out the first few rows
wide_py.head(2)
     tree  lichen_crustose  lichen_foliose  lichen_fruticose
0  Tree_A             0.95            1.00               0.9
1  Tree_B             0.00            0.35               1.0

The reset index step is needed to get the index to look like it did before we melted it.