# Load needed libraries
library(DBI)
SQL
Overview
So far, we’ve been able to use Python or R on CSV files without issue. However, when the data that we are working with become very large (e.g., millions of rows) it can be very inefficient to rely on your computer’s internal memory to do operations. SQL (Structured Query Language) is another programming language specifically built to work with larger databases where users “query” just the component of the data that they need for a given operation. This minimizes the amount of data that must be held in memory at any given time.
A caveat before we begin: SQL is a fully fledged language in its own right that carries a suite of verbs and syntax conventions. We’ll need to engage with these a little in our examples but this website is not meant as a tutorial in this language. Note that R users may have some SQL intuition already because SQL shares many of its verbs with the dplyr
package.
Library Loading
Begin by loading the needed libraries.
Load the DBI
package.
Load the pandas
, os
, and sqlite3
packages.
# Load needed libraries
import os
import pandas as pd
import sqlite3
Connecting to Databases
SQLite operations require us to connect to a database, perform desired operations, and then close the connection when we are finished. A good introduction to this is showing how one can list the data tables available in a given database; this is vital information for doing any “real” queries as we need to know the name of the table from which we want to extract data.
In R we can open the connection with the dbConnect
function, identify the tables in the database with the dbListTables
function and–eventually–close the connection with the dbDisconnect
function (all of which are in the DBI
package).
# Open the connection
<- DBI::dbConnect(RSQLite::SQLite(), file.path("data", "mammals.sqlite"))
cxn
# Identify tables in the database
::dbListTables(conn = cxn) DBI
[1] "plots" "species" "surveys"
# Close the connection
::dbDisconnect(conn = cxn) DBI
In Python we can open the connection with the connect
function (from the sqlite3
library), identify the tables in the database with the read_sql_query
function (from pandas
) and–eventually–close the connection with the close
method.
# Open the connection
= sqlite3.connect(os.path.join("data", "mammals.sqlite"))
cxn
# Query some columns in the 'surveys' object
= "SELECT name FROM sqlite_master WHERE type = 'table'", con = cxn) pd.read_sql_query(sql
name
0 surveys
1 species
2 plots
# Close the connection
cxn.close()
Extracting Data
Connecting and listing the available data in a given database is all well and good but really what we want to do is extract some of the data for later use. To do this we still need to open (and eventually close!) the connection but between those two steps we can use SQL syntax to actually perform a query. Note that it is a good practice to fully capitalize SQL verbs (e.g., SELECT
) to help differentiate them from column / data table names.
At a minimum a query must involve SELECT
to dictate which columns to keep (accepts either column names or a *
for ‘all columns’) and FROM
to indicate which table in the database we want to query.
# Open the connection
<- DBI::dbConnect(RSQLite::SQLite(), file.path("data", "mammals.sqlite"))
cxn
# Query some columns in the 'surveys' object
<- DBI::dbGetQuery(conn = cxn, statement = "SELECT year, species_id, plot_id FROM surveys")
query_r
# Close the connection
::dbDisconnect(conn = cxn) DBI
Note that we can still use the query data object even when we have closed our database connection!
# Check out the first few rows of that
head(query_r, n = 4)
year species_id plot_id
1 1977 NL 2
2 1977 NL 3
3 1977 DM 2
4 1977 DM 7
# Open the connection
= sqlite3.connect(os.path.join("data", "mammals.sqlite"))
cxn
# Query some columns in the 'surveys' object
= pd.read_sql_query(sql = "SELECT year, species_id, plot_id FROM surveys", con = cxn)
query_py
# Close the connection
cxn.close()
Note that we can still use the query data object even when we have closed our database connection!
# Check out the first few rows of that
4) query_py.head(
year species_id plot_id
0 1977 NL 2
1 1977 NL 3
2 1977 DM 2
3 1977 DM 7