# Make a vector of characters
<- c("he1lo", "HELLO", "bye")
text_r
# Find and replace the number "1" with an "L"
<- gsub(pattern = "1", replacement = "l", x = text_r)
fix1_r
# Print the result
print(fix1_r)
[1] "hello" "HELLO" "bye"
Data stored as text (i.e., string/object or character) is notoriously typo-prone and often requires extensive quality control checks throughout the data tidying process. Below is a–non exhaustive–set of common text methods that may prove valuable to people interested in dealing with text data in either Python or R.
The bulk of text tidying often boils down to finding unwanted strings/characters and replacing them with desired variants. This operation is how computers handle fixing typos.
We can use the R function gsub
to find and replace (part of) a character object.
We can use the Python method replace
to find and replace (part of) a string/object variable.
# Make a list of strings
text_py = ["he1lo", "HELLO", "bye"]
# Find and replace the number "1" with an "L"
fix1_py = [item.replace("1", "l") for item in text_py]
# Print the result
print(fix1_py)
['hello', 'HELLO', 'bye']
Note that we have to loop across our list to do this operation in this language.
Text casing (i.e., either UPPERCASE or lowercase) is also a frequent source of issues in code as many scripted operations are sensitive to text case. We can coerce to upper or lowercase as needed though with relative ease.
We can use the tolower
or toupper
functions to coerce all text into lower or uppercase respectively.