SQL

Overview

So far, we’ve been able to use Python or R on CSV files without issue. However, when the data that we are working with become very large (e.g., millions of rows) it can be very inefficient to rely on your computer’s internal memory to do operations. Database SQL (Structured Query Language) is another programming language specifically built to work with larger databases where users “query” just the component of the data that they need for a given operation. This minimizes the amount of data that must be held in memory at any given time.

A caveat before we begin: Database SQL is a fully fledged language in its own right that carries a suite of verbs and syntax conventions. We’ll need to engage with these a little in our examples but this website is not meant as a tutorial in this language. Note that R users may have some Database SQL intuition already because Database SQL shares many of its verbs with the dplyr package.

Library Loading

Begin by loading the needed libraries.

Load the DBI package.

# Load needed libraries
library(DBI)

Load the pandas, os, and sqlite3 packages.

# Load needed libraries
import os
import pandas as pd
import sqlite3

Connecting to Databases

SQLite operations require us to connect to a database, perform desired operations, and then close the connection when we are finished. A good introduction to this is showing how one can list the data tables available in a given database; this is vital information for doing any “real” queries as we need to know the name of the table from which we want to extract data.

In R we can open the connection with the dbConnect function, identify the tables in the database with the dbListTables function and–eventually–close the connection with the dbDisconnect function (all of which are in the DBI package).

# Open the connection
cxn <- DBI::dbConnect(RSQLite::SQLite(), file.path("data", "mammals.sqlite"))

# Identify tables in the database
DBI::dbListTables(conn = cxn)
[1] "plots"   "species" "surveys"
# Close the connection
DBI::dbDisconnect(conn = cxn)

In Python we can open the connection with the connect function (from the sqlite3 library), identify the tables in the database with the read_sql_query function (from pandas) and–eventually–close the connection with the close method.

# Open the connection
cxn = sqlite3.connect(os.path.join("data", "mammals.sqlite"))

# Query some columns in the 'surveys' object
pd.read_sql_query(sql = "SELECT name FROM sqlite_master WHERE type = 'table'", con = cxn)
      name
0  surveys
1  species
2    plots

# Close the connection
cxn.close()

Extracting Data

Connecting and listing the available data in a given database is all well and good but really what we want to do is extract some of the data for later use. To do this we still need to open (and eventually close!) the connection but between those two steps we can use Database SQL syntax to actually perform a query. Note that it is a good practice to fully capitalize Database SQL verbs (e.g., SELECT) to help differentiate them from column / data table names.

At a minimum a query must involve SELECT to dictate which columns to keep (accepts either column names or a * for ‘all columns’) and FROM to indicate which table in the database we want to query.

# Open the connection
cxn <- DBI::dbConnect(RSQLite::SQLite(), file.path("data", "mammals.sqlite"))

# Query some columns in the 'surveys' object
query_r <- DBI::dbGetQuery(conn = cxn, statement = "SELECT year, species_id, plot_id FROM surveys")

# Close the connection
DBI::dbDisconnect(conn = cxn)

Note that we can still use the query data object even when we have closed our database connection!

# Check out the first few rows of that
head(query_r, n = 4)
  year species_id plot_id
1 1977         NL       2
2 1977         NL       3
3 1977         DM       2
4 1977         DM       7
# Open the connection
cxn = sqlite3.connect(os.path.join("data", "mammals.sqlite"))

# Query some columns in the 'surveys' object
query_py = pd.read_sql_query(sql = "SELECT year, species_id, plot_id FROM surveys", con = cxn)

# Close the connection
cxn.close()

Note that we can still use the query data object even when we have closed our database connection!

# Check out the first few rows of that
query_py.head(4)
   year species_id  plot_id
0  1977         NL        2
1  1977         NL        3
2  1977         DM        2
3  1977         DM        7