Workflow Automation

Loops

Often we want to perform some set of operations repeatedly across a known number of iterations. For example, maybe we want to subset a given data file into a separate variable/object by month of data collection and export the resulting file as a CSV. We could simply copy/paste our ‘subset and export’ code as many times as needed but this can be error-prone. Also, it is cumbersome to manually update all copies of the relevant code when you identify a possible improvement.

One code solution to this is to automate the workflow using for loops (casually referred to more simply as just “loops”). The syntax of Python and R is very similar for loops–likely because this is such a fundamental operation to any coding language!

Make a simple object to demonstrate loops.

# Make a vector of animal types
zoo_r <- c("lion", "tiger", "crocodile", "vulture", "hippo")

# Check that out
zoo_r
[1] "lion"      "tiger"     "crocodile" "vulture"   "hippo"    

Make a simple variable to demonstrate loops.

# Make a list of animal types
zoo_py = ["lion", "tiger", "crocodile", "vulture", "hippo"]

# Check that out
zoo_py
['lion', 'tiger', 'crocodile', 'vulture', 'hippo']

With this simple variable/object in-hand we can now demonstrate the core facets of loops.

Fundamental Components

Loops (in either language) require a few core components in order to work properly:

  1. for statement – defines the start of the loop-definition component
  2. “Loop variable/object” – essentially a placeholder variable/object whose value will change with each iteration of the loop
  3. in statement – relates loop variable/object to set of list/vector to iterate across
  4. list/vector to iterate across – set of values to iterate across
  5. Actual workflow! – operations to perform on each iteration of the loop

To see in this syntax in action we’ll use a simple loop that prints each animal type in the list/vector we created above.

In R, the for statement requires parentheses around the loop object, the in statement, and the vector to iterate across. The operation(s) performed in each iteration must be wrapped in curly braces ({...}).

When the code reaches the closing curly brace it returns to the top of the workflow and begins again with the next element of the provided vector.

# For each animal in the zoo
for(animal in zoo_r){
  
  # Print its name
  print(animal)
  
}
[1] "lion"
[1] "tiger"
[1] "crocodile"
[1] "vulture"
[1] "hippo"

Note that when we are done the loop object still exists and is set to the last element of the vector we iterated across.

# Check current value of `animal` object
animal
[1] "hippo"

In Python, the for statement, loop variable, in statement, and list to iterate across do not use parentheses but the end of the line requires a colon :. The operation(s) performed in each iteration must be indentened one level (i.e., press “tab” once or “space” four times).

When the code reaches the end of the indented lines it returns to the top of the workflow and begins again with the next item of the provided list.

# For each animal in the zoo
for animal in zoo_py:
  # Print its name
  print(animal)
lion
tiger
crocodile
vulture
hippo

Note that when we are done the loop variable still exists and is set to the last item of the list we iterated across.

# Check current value of `animal` variable
animal
'hippo'

Loops & Conditionals

We can also build conditional statements into a loop to create a loop that can flexibly handle different outcomes. We have discussed conditional operators elsewhere so we’ll only explain the parts of loop conditionals that we haven’t already discussed. To demonstrate, we can loop across a set of numbers and use conditionals to print whether the values are greater/less than or equal to zero.

In the example below we’ll use three new statements if, else if and else. Each condition only performs its operation when its condition is met (i.e., returns True/TRUE).

These three statements all have similar syntax to the for statement in that they evaluate something in parentheses and then perform some operation(s) in curly braces. They do differ slightly in context however:

  • if can only be used first (or in cases where there is only if and else)
  • else if can only be used after if (or after another else if) and allows for specifying another condition.
  • else can only be used at the end; catches only cases that don’t meet one of the prior conditions
# Loop across numbers
for(j in c(-2, -1, 0, 1, 2)){
  
  # If less than 0
  if(j < 0){ 
    print(paste(j, "is negative")) 
    }
  
  # If greater than 0
  else if(j > 0){
    print(paste(j, "is positive"))
  }
  
  # If neither of those, then it must be 0!
  else { 
    print(paste(j, "is zero!"))
    }
}
[1] "-2 is negative"
[1] "-1 is negative"
[1] "0 is zero!"
[1] "1 is positive"
[1] "2 is positive"

Note that to get the message to print correctly we needed to wrap a paste function in print to assemble multiple things into a single object.

These three statements all have similar syntax to the for statement in that they evaluate something before a colon and then perform some operation(s) after that colon. They do differ slightly in context however:

  • if can only be used first (or in cases where there is only if and else)
  • elif can only be used after if (or after another elif) and allows for specifying another condition.
  • else can only be used at the end; catches only cases that don’t meet one of the prior conditions
# Loop across numbers
for k in [-2, -1, 0, 1, 2]:
  
  # If less than 0
  if k < 0: 
    print(str(k) + " is negative")
    
  # If greater than 0
  elif k > 0:
    print(str(k) + " is positive")
  
  # If neither of those, then it must be 0!
  else:
    print(str(k) + " is zero!")
-2 is negative
-1 is negative
0 is zero!
1 is positive
2 is positive

Note that to get the message to print correctly we needed to coerce the loop variable into type string (using the str function).

“Custom” Functions

Loops are a really powerful tool but they are limited in some ways. Sometimes we want to do a task once per project but only use it once in each instance. Such an operation is certainly “repeated” but not really the same context in which a loop makes sense. We can create reusable modular code to fit these circumstances by writing our own custom functions–“custom” in the sense that we write them ourselves rather than load them from a particular library.

Let’s write a simple function in both languages that simply multiplies two arguments by one another and returns the result.

Generating a function in R shares some syntax elements with loops and conditional statements! In this case we use the function function to preserve our work as a function, then provide any needed arguments in parentheses, and end with curly braces with the operation(s) performed by the function inside. If the function produces something that we want to give back to the user, we need to specify that with the return function.

# Multiplication function
mult_r <- function(p, q){
  
  # Multiply the two values
  result_r <- p * q
  
  # Return that
  return(result_r)
}

# Once defined, we can invoke the function like we would any other
mult_r(p = 2, q = 5)
[1] 10

Generating a function in Python shares some syntax elements with loops and conditional statements! In this case we use the def statement then provide the name and–parenthetically–any needed arguments for our new function. If the function produces something that we want to give back to the user, we need to specify that by using the return statement.

# Multiplication function
def mult_py(n, i):
  # Add docstrings for later use (see below)
  """
  Multiply two values by one another.
  
  n -- First value to multiply
  i -- Second value to multiply
  """
  
  # Multiply the two values
  result_py = n * i
  
  # Return them
  return result_py

# Once defined, we can invoke the function like we would any other
mult_py(n = 2, i = 5)
10

Function Documentation

One component of custom functions to be aware of is their somewhat variable documentation. “Official” functions tend to be really well documented but custom functions have no required documentation. However, there are some best practices that we can try to follow ourselves to make life as easy as possible for people trying to intuit our functions’ purposes (including ourselves in the future!).

R contains no native mode of specifying function documentation! While there are tools to formalize this when functions are part of a formal package (see roxygen2 formatting) our custom functions cannot include documentation. That said, it is still good practice to include plain-language comment lines that describe the function’s operations even when they will only be visible where the function is defined.

Note that the docstring package for R simulates Python-style docstrings for R functions but is not part of “base” R.

Python custom functions allow us to specify triple quoted ("""...""") documentation of function purpose/arguments known as “docstrings”. When this is supplied, we can use the help function (or append a ? after the function name) to print whatever documentation was included in the function when it was defined.

# Check custom function documentation
help(mult_py)
Help on function mult_py in module __main__:

mult_py(n, i)
    Multiply two values by one another.
    
    n -- First value to multiply
    i -- Second value to multiply

Function Defaults

Sometimes a given argument will often be set to the same value. In cases like this, we can define that as the default of the argument which allows users to not specify that argument at all. When users do specify something for that argument, it overrides the default behavior. All functions (and Python methods) with “optional” arguments are using defaults behind the scenes to make those arguments optional.

We can define these defaults when we first create a function! Let’s make a simple division function that divides the first argument by the second and sets the default of the second argument to 2.

Write and demonstrate the simple division function.

# Define function
div_r <- function(p, q = 2){
  
  # Do division
  result_r <- p / q
  
  # Return that
  return(result_r)
}

# Test this function
div_r(p = 10)
[1] 5

Use the function again but set the second argument ourselves.

# Specify the second argument
div_r(p = 10, q = 10)
[1] 1

Write and demonstrate the simple division function.

# Define function
def div_py(n, i = 2):
  # Write function documentation
  """
  Divide the first value by the second
  
  n -- Numerator
  i -- Denominator
  """
  
  # Do division
  result_py = n / i
  
  # Return that
  return result_py

# Use the function with the default
div_py(n = 10)
5.0

Use the function again but set the second argument ourselves.

# Specify the second argument
div_py(n = 10, i = 10)
1.0

Functions & Conditionals

Just like loops, we can build conditional statements into our functions to make them more flexible and broadly useful. Let’s combine this with setting default values to demonstrate this effectively.

Let’s make a simple addition function and set both arguments to default to NULL. NULL is an R constant that allows us to create an object without assigning any value to it.

Note that we’re also using the is.null function in our conditional in order to easily assess whether the argument has been left to its default (i.e., set to NULL) or defined.

# Define addition function
add_r <- function(p = NULL, q = NULL){
  
  # If first argument is missing, set it to 2
  if(is.null(p) == TRUE){
    p <- 2
  }
  
  # Do the same for the second argument
  if(is.null(q) == TRUE){
    q <- 2
  }
  
  # Sum the two arguments
  result_r <- p + q
  
  # Return that
  return(result_r)
}

Now let’s use the function without specifying either argument.

# Use the function
add_r()
[1] 4

Let’s make a simple addition function and set both arguments to default to None. None is a Python constant that allows us to create a variable without assigning any value to it.

Note that we’re also using the is statement in our conditional (in this case it is equivalent to ==).

# Define addition function
def add_py(n = None, i = None):
  # Add documentation
  """Add two values (`n` and `i`)"""

  # If first argument is missing, set it to 2
  if n is None:
    n = 2

  # Do the same for the second argument
  if i is None:
    i = 2
  
  # Sum the two arguments
  result_py = n + i
  
  # Return that
  return result_py

Now let’s use the function without specifying either argument.

# Use the function
add_py()