Text Methods

Overview

Data stored as text (i.e., string/object or character) is notoriously typo-prone and often requires extensive quality control checks throughout the data tidying process. Below is a–non exhaustive–set of common text methods that may prove valuable to people interested in dealing with text data in either Python or R.

Find & Replace

The bulk of text tidying often boils down to finding unwanted strings/characters and replacing them with desired variants. This operation is how computers handle fixing typos.

We can use the R function gsub to find and replace (part of) a character object.

# Make a vector of characters
text_r <- c("he1lo", "HELLO", "bye")

# Find and replace the number "1" with an "L"
fix1_r <- gsub(pattern = "1", replacement = "l", x = text_r)

# Print the result
print(fix1_r)
[1] "hello" "HELLO" "bye"  

We can use the Python method replace to find and replace (part of) a string/object variable.

# Make a list of strings
text_py = ["he1lo", "HELLO", "bye"]

# Find and replace the number "1" with an "L"
fix1_py = [item.replace("1", "l") for item in text_py]

# Print the result
print(fix1_py)
['hello', 'HELLO', 'bye']

Note that we have to loop across our list to do this operation in this language.

Casing

Text casing (i.e., either UPPERCASE or lowercase) is also a frequent source of issues in code as many scripted operations are sensitive to text case. We can coerce to upper or lowercase as needed though with relative ease.

We can use the tolower or toupper functions to coerce all text into lower or uppercase respectively.

# Coerce everything to lowercase
fix2_r <- tolower(x = fix1_r)

# Print the result
print(fix2_r)
[1] "hello" "hello" "bye"  

We can use the lower or upper methods to coerce all text into lower or uppercase respectively.

# Coerce everything to lowercase
fix2_py = [item.lower() for item in fix1_py]

# Print the result
print(fix2_py)
['hello', 'hello', 'bye']