# Load needed library
import os
import glob
Directory & File Management
Overview
One thing we have not discussed that is incredibly important to any scripted language ( Python and R very much included) is computer directory and file management. Code often interacts with external data files that must be imported at the start of a workflow and often preserves a record of its operation as an exported product at the end of the workflow. These concepts are more related to fundamental understanding of how your computer stores information than they are strictly coding language considerations but they are still worth discussing here.
Crucial Vocabulary
There are a few vocabulary terms we need to introduce before we can dive into the code side of file and directory management. Fortunately, these are terms that apply to computers generally so we do not have to deal with Python and R using different names for the same concept.
- Directory – A folder on your computer (typically containing other folders and/or files) particular file/folder
- Working Directory – The folder your code is “looking at” for a given project
- By default this is the folder in which the code file itself can be found though you can set it elsewhere (though this is sometimes risky)
- Root – The single folder that contains everything else on your computer
- Absolute Path – The names of each directory beginning at the root and ending at a - Relative Path – The names of each directory beginning at the your working directory and ending at a particular file/folder
Library Loading
We’ll need to quickly load any needed libraries before getting into the “actual” coding.
R does not require any libraries to perform these operations!
We’ll need the os
library and glob
library in Python to do these operations.
The Working Directory
As defined above, the working directory is the primary folder with which your code can easily interact. This includes folders within your working directory but does not include folders “above”/outside of it. It is a good practice to use different folders for different projects such that each project has a different working directory and you don’t run the risk of scripts/data from one project accidentally interacting with those form another project. For R users, the RStudio “R Project” functionality guarantees that you have a different working directory for each R project.
Both languages provide a quick way of checking what your current working directory is to ensure you’re not accidentally in the wrong one. Note that either approach will return the absolute path to your working directory which will differ among users/computers.
Base R includes the getwd
function to display your current working directory.
# Check current working directory
getwd()
[1] "/Users/lyon/Documents/personal/lyon_bilingualism"
Python has a getcwd
function (from the os
library) to display your current working directory.
# Check current working directory
os.getcwd()
'/Users/lyon/Documents/personal/lyon_bilingualism'
Operating Systems & File Paths
One minor–but often frustrating–hurdle for collaborative coding occurs when group members use different operating systems (i.e., macOS, Windows, etc.). When defining file paths (absolute or relative), different operating systems use either a slash (/
) or a backslash (\
) to separate directories in that path. Unfortunately for us, your OS will only be able to interpret file paths that use the type of slash it uses. This means that any code that reads in external data or exports its outputs needs to account for this OS-level difference every time a file path is defined. However, the developers of Python and R know this pain themselves and have given us straightforward tools to handle this.
R includes a file.path
function that automatically detects your computer’s OS and inserts the correct type of slash between each directory name.
# Build a faux file path
file.path("path", "to", "my", "data")
[1] "path/to/my/data"
The os
library in Python includes a join
function that automatically detects your computer’s OS and inserts the correct type of slash between each directory name.
# Build a faux file path
"path", "to", "my", "data") os.path.join(
'path/to/my/data'
Note that the needed function is in the path
module of the os
library.
Using these tools allows us to code collaboratively even when not all group members use the same operating system! Note that nothing will solve the issue of using absolute paths because group members will always have different paths from the root to a particular directory. Because of this it is best to always use relative paths for maximum reproducibility.
Finding Files
Once you’ve confirmed your working directory and figured out how to account for OS idiosyncrasies, its time to actually look at the files you have available! In the command line, the ls
function lists all files in a particular folder. Fortunately both Python and R contain tools for doing this as well.
We can use the R dir
function to identify all files in a particular folder (note this ignores folders inside of the specified folder).
# List files in the "data" folder for this website
dir(path = file.path("data"))
[1] "elevation.tif"
[2] "mammals.sqlite"
[3] "nc.dbf"
[4] "nc.prj"
[5] "nc.shp"
[6] "nc.shx"
[7] "README.md"
[8] "SRTMGL3_NC.003_SRTMGL3_DEM_doy2000042_aid0001.tif"
[9] "tree_lichen.csv"
[10] "tree_road.csv"
[11] "verts.csv"
We can use the glob
function (from the library of the same name) to identify all files in a particular folder.
# List files in the "data" folder for this website
= os.path.join("data", "*")) glob.glob(pathname
['data/nc.prj', 'data/SRTMGL3_NC.003_SRTMGL3_DEM_doy2000042_aid0001.tif', 'data/elevation.tif', 'data/README.md', 'data/tree_lichen.csv', 'data/nc.shx', 'data/verts.csv', 'data/nc.shp', 'data/tree_road.csv', 'data/nc.dbf', 'data/mammals.sqlite']
Note that we need to use the wildcard asterisk (*
) to identify everything in the “data” folder.