# Load needed library
library(tidyverse)
Starting with Data
Library Loading
Begin by loading any needed libraries.
Load the tidyverse
meta-package.
Load the pandas
and os
libraries.
# Load needed libraries
import os
import pandas as pd
Data Import
We can now load an external dataset derived from the lterdatasampler
R package (see here) with both R and Python. This relatively simple operation is also a nice chance to showcase how ‘namespacing’ (i.e., indicating which package a given function comes from) differs between the two languages.
Namespacing in R is accomplished by doing package_name::function_name
and is optional (though, in my opinion, good practice!). Note that we use the assignment operator (<-
) to assign the contents of the CSV to an object.
# Read in vertebrate data CSV
<- utils::read.csv(file = file.path("data", "verts.csv")) vert_r
Recall from our file path module that the file.path
function accounts for computer operating system differences.
In Python, namespacing is required and is done via package_name.function_name
. Note that we use the assignment operator (=
) to assign the contents of the CSV to a variable.
# Read in vertebrate data CSV
= pd.read_csv(os.path.join("data", "verts.csv")) vert_py
Recall from our file path module that the join
function accounts for computer operating system differences.
Tabular Data Type/Class
Data stored in CSVs (and similar data formats like Microsoft Excel, etc.) has a unique type/class that differs from some of the categories we covered in the “Fundamentals” section.
In R, such data are class data.frame
# Check class of a data object
class(vert_r)
[1] "data.frame"
In Python, such data are type DataFrame
and this variable type is defined by the pandas
library. This is the standard type returned by the pandas
read_csv
function.
# Check type of a data object
type(vert_py)
<class 'pandas.core.frame.DataFrame'>
Making Heads or Tails of Data
Checking the ‘head’ or ‘tail’ of the data (i.e., the first or last few rows of the data respectively) is a nice way of getting a sense for the general format of the dataframe being assessed.
In R, we use the head
or tail
function to return the first or last rows respectively.
# Check out head
::head(vert_r, n = 2) utils
- 1
-
Note that we’re using the optional
n
argument to specify the number of rows to return
year sitecode section reach pass unitnum unittype vert_index pitnumber
1 1987 MACKCC-L CC L 1 1 R 1 NA
2 1987 MACKCC-L CC L 1 1 R 2 NA
species length_1_mm length_2_mm weight_g clip sampledate notes
1 Cutthroat trout 58 NA 1.75 NONE 1987-10-07
2 Cutthroat trout 61 NA 1.95 NONE 1987-10-07
# Check out tail
::tail(vert_r, n = 3) utils
year sitecode section reach pass unitnum unittype vert_index pitnumber
32207 2019 MACKOG-U OG U 2 16 C 24 1043547
32208 2019 MACKOG-U OG U 2 16 C 25 1043583
32209 2019 MACKOG-U OG U 2 16 C 26 1043500
species length_1_mm length_2_mm weight_g clip sampledate
32207 Coastal giant salamander 67 120 9.6 NONE 2019-09-05
32208 Coastal giant salamander 74 131 14.3 NONE 2019-09-05
32209 Coastal giant salamander 73 128 11.6 NONE 2019-09-05
notes
32207
32208
32209 Terrestrial
In Python, we use the head
or tail
method to return the first or last rows respectively. Note that these methods are only available to variables of type DataFrame
. All methods are appended to the end of the variable of the appropriate type separated by a period.
# Check out head
3) vert_py.head(
year sitecode section reach ... weight_g clip sampledate notes
0 1987 MACKCC-L CC L ... 1.75 NONE 1987-10-07 NaN
1 1987 MACKCC-L CC L ... 1.95 NONE 1987-10-07 NaN
2 1987 MACKCC-L CC L ... 5.60 NONE 1987-10-07 NaN
[3 rows x 16 columns]
# Check out tail
2) vert_py.tail(
year sitecode section reach ... weight_g clip sampledate notes
32207 2019 MACKOG-U OG U ... 14.3 NONE 2019-09-05 NaN
32208 2019 MACKOG-U OG U ... 11.6 NONE 2019-09-05 Terrestrial
[2 rows x 16 columns]
Data Structure
While it is nice to know the type/class of the data table generally, we often need to know the type/class of the columns within those data tables. Our operations are typically aimed at modifying particular columns and pre-requisite to that is knowing the type/class of the column to know what actions are available to us.
R uses the str
function to assess data structure. Structure includes the dimensions of the data (i.e., number of rows and columns) as well as the class of the data (data.frame
) and the class of each column.
::str(vert_r) utils
'data.frame': 32209 obs. of 16 variables:
$ year : int 1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 ...
$ sitecode : chr "MACKCC-L" "MACKCC-L" "MACKCC-L" "MACKCC-L" ...
$ section : chr "CC" "CC" "CC" "CC" ...
$ reach : chr "L" "L" "L" "L" ...
$ pass : int 1 1 1 1 1 1 1 1 1 1 ...
$ unitnum : num 1 1 1 1 1 1 1 1 1 1 ...
$ unittype : chr "R" "R" "R" "R" ...
$ vert_index : int 1 2 3 4 5 6 7 8 9 10 ...
$ pitnumber : int NA NA NA NA NA NA NA NA NA NA ...
$ species : chr "Cutthroat trout" "Cutthroat trout" "Cutthroat trout" "Cutthroat trout" ...
$ length_1_mm: int 58 61 89 58 93 86 107 131 103 117 ...
$ length_2_mm: int NA NA NA NA NA NA NA NA NA NA ...
$ weight_g : num 1.75 1.95 5.6 2.15 6.9 5.9 10.5 20.6 9.55 13 ...
$ clip : chr "NONE" "NONE" "NONE" "NONE" ...
$ sampledate : chr "1987-10-07" "1987-10-07" "1987-10-07" "1987-10-07" ...
$ notes : chr "" "" "" "" ...
Be careful to not confuse this with the Python function str
that coerces values to type “string”!
When we want to know the type of each column in a Python DataFrame
, we can use the dtypes
“attribute”. Attributes are akin to a method but they completely lack arguments that might modify their behavior. As a consequence they are extremely precisely defined. The dtypes
attribute returns the type of each column.
vert_py.dtypes
year int64
sitecode object
section object
reach object
pass int64
unitnum float64
unittype object
vert_index int64
pitnumber float64
species object
length_1_mm float64
length_2_mm float64
weight_g float64
clip object
sampledate object
notes object
dtype: object
Data Summaries
We often want to begin our exploration of a given dataset by getting a summary of each column. This is rarely what we actually need for statistics or visualization but it is a nice high-level way of getting a sense for the composition of the data.
R provides the summary
function to summarize all columns in a dataset. Note that it is not terribly informative for columns of class character.
# Get summary of data
summary(vert_r)
year sitecode section reach
Min. :1987 Length:32209 Length:32209 Length:32209
1st Qu.:1998 Class :character Class :character Class :character
Median :2006 Mode :character Mode :character Mode :character
Mean :2005
3rd Qu.:2012
Max. :2019
pass unitnum unittype vert_index
Min. :1.000 Min. : 1.000 Length:32209 Min. : 1.00
1st Qu.:1.000 1st Qu.: 3.000 Class :character 1st Qu.: 5.00
Median :1.000 Median : 7.000 Mode :character Median : 13.00
Mean :1.224 Mean : 7.696 Mean : 20.17
3rd Qu.:1.000 3rd Qu.:11.000 3rd Qu.: 27.00
Max. :2.000 Max. :20.000 Max. :147.00
pitnumber species length_1_mm length_2_mm
Min. : 62048 Length:32209 Min. : 19.00 Min. : 28.0
1st Qu.:13713632 Class :character 1st Qu.: 47.00 1st Qu.: 77.0
Median :18570447 Mode :character Median : 63.00 Median : 98.0
Mean :16286432 Mean : 73.83 Mean :100.5
3rd Qu.:19132429 3rd Qu.: 97.00 3rd Qu.:119.0
Max. :28180046 Max. :253.00 Max. :284.0
NA's :26574 NA's :17 NA's :19649
weight_g clip sampledate notes
Min. : 0.090 Length:32209 Length:32209 Length:32209
1st Qu.: 1.510 Class :character Class :character Class :character
Median : 6.050 Mode :character Mode :character Mode :character
Mean : 8.903
3rd Qu.: 11.660
Max. :134.590
NA's :13268
Python has the describe
method for getting similar information about each column of a DataFrame
.
# Describe data
vert_py.describe()
year pass ... length_2_mm weight_g
count 32209.000000 32209.000000 ... 12560.000000 18941.000000
mean 2004.917601 1.223664 ... 100.485191 8.902859
std 8.572474 0.416706 ... 34.736955 10.676276
min 1987.000000 1.000000 ... 28.000000 0.090000
25% 1998.000000 1.000000 ... 77.000000 1.510000
50% 2006.000000 1.000000 ... 98.000000 6.050000
75% 2012.000000 1.000000 ... 119.000000 11.660000
max 2019.000000 2.000000 ... 284.000000 134.590000
[8 rows x 8 columns]
Python also offers the info
method to identify only the count of non-null entries in each column as well as the type of each column.
# Check info of data
vert_py.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32209 entries, 0 to 32208
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 32209 non-null int64
1 sitecode 32209 non-null object
2 section 32209 non-null object
3 reach 32209 non-null object
4 pass 32209 non-null int64
5 unitnum 32209 non-null float64
6 unittype 31599 non-null object
7 vert_index 32209 non-null int64
8 pitnumber 5635 non-null float64
9 species 32206 non-null object
10 length_1_mm 32192 non-null float64
11 length_2_mm 12560 non-null float64
12 weight_g 18941 non-null float64
13 clip 32209 non-null object
14 sampledate 32209 non-null object
15 notes 3174 non-null object
dtypes: float64(5), int64(3), object(8)
memory usage: 3.9+ MB
Identifying Columns
If we want to skip these steps and just identify the full set of column labels/names there is a small operation for stripping that information out in both languages.
R has the names
function to quickly return a vector of the column names in a given data.frame
object.
# Check column names
names(vert_r)
[1] "year" "sitecode" "section" "reach" "pass"
[6] "unitnum" "unittype" "vert_index" "pitnumber" "species"
[11] "length_1_mm" "length_2_mm" "weight_g" "clip" "sampledate"
[16] "notes"
Python has the columns
attributes for returning a simple list of column labels in a given DataFrame
variable.
# Check column labels
vert_py.columns
Index(['year', 'sitecode', 'section', 'reach', 'pass', 'unitnum', 'unittype',
'vert_index', 'pitnumber', 'species', 'length_1_mm', 'length_2_mm',
'weight_g', 'clip', 'sampledate', 'notes'],
dtype='object')
Accessing Column(s)
We may want to access a particular column or set of columns. There are several approaches we might use to access a particular column in either Python or R. They are:
- Indexing the column by its location
- Indexing the column by its label/name
- Indexing the column with an operator
In all of the following examples we’ll use the head
method/function to simplify the output.
If we know the order of the columns we can use the same syntax as when we indexed a vector.
# Use column indexing
head(vert_r[1])
year
1 1987
2 1987
3 1987
4 1987
5 1987
6 1987
Note that if we also want to account for rows we would put the desired column number on the right and the desired row number on the left.
# Use row/column indexing
head(vert_r[, 1])
[1] 1987 1987 1987 1987 1987 1987
If preferred we could instead substitute the number for the column name.
head(vert_r["year"])
year
1 1987
2 1987
3 1987
4 1987
5 1987
6 1987
In R, the column operator is a $
and we place that character between the object and column names (e.g., data_name$column_name
).
head(vert_r$year)
[1] 1987 1987 1987 1987 1987 1987
If we know the index position of the desired column we can use the iloc
method (short for “integer location”). Note that the iloc
method uses square brackets instead of parentheses (typical methods use parentheses). We must include a colon with either nothing or the start/stop bounds (see “Slicing” in “Fundamentals”). A colon without specifying bounds returns all rows.
# Use row/column indexing
0].head() vert_py.iloc[: ,
0 1987
1 1987
2 1987
3 1987
4 1987
Name: year, dtype: int64
If preferred we could instead substitute the number for the column label.
"year"].head() vert_py[
0 1987
1 1987
2 1987
3 1987
4 1987
Name: year, dtype: int64
In Python, the column operator is a .
and we place that character between the variable name and column label (e.g., data_name.column_name
).
vert_py.year.head()
0 1987
1 1987
2 1987
3 1987
4 1987
Name: year, dtype: int64
When we want to access multiple columns we can still use either index positions or column labels/names but we cannot use the column operator.
We’ll continue to use the head
method/function to simplify outputs.
In R we could use a concatenated vector of index positions to specify particular columns by their positions.
# Use multiple column index positions
head(vert_r[, c(1:2, 13)])
year sitecode weight_g
1 1987 MACKCC-L 1.75
2 1987 MACKCC-L 1.95
3 1987 MACKCC-L 5.60
4 1987 MACKCC-L 2.15
5 1987 MACKCC-L 6.90
6 1987 MACKCC-L 5.90
Instead we could use a vector of column names for extra precision.
# Use multiple column names
head(vert_r[, c("year", "sitecode", "weight_g")])
year sitecode weight_g
1 1987 MACKCC-L 1.75
2 1987 MACKCC-L 1.95
3 1987 MACKCC-L 5.60
4 1987 MACKCC-L 2.15
5 1987 MACKCC-L 6.90
6 1987 MACKCC-L 5.90
In Python we could use the iloc
method for multiple columns and simply supply a list of the column index positions in which we are interested. Note that this method is different from many of the other methods we’ve covered so far because it uses square brackets instead of parentheses.
# Use multiple column index positions
0, 1, 12]].head() vert_py.iloc[:, [
year sitecode weight_g
0 1987 MACKCC-L 1.75
1 1987 MACKCC-L 1.95
2 1987 MACKCC-L 5.60
3 1987 MACKCC-L 2.15
4 1987 MACKCC-L 6.90
Or we could use the .loc
method and supply a list of column labels. Note that the loc
method also uses square brackets instead of parentheses and requires a colon to the left of the comma in order to return all rows.
"year", "sitecode", "weight_g"]].head() vert_py.loc[:, [
year sitecode weight_g
0 1987 MACKCC-L 1.75
1 1987 MACKCC-L 1.95
2 1987 MACKCC-L 5.60
3 1987 MACKCC-L 2.15
4 1987 MACKCC-L 6.90