clean_strings is the way to prepare strings for name
matching, either within tier_match (see the
Using-tier-match vignette). There are several useful
options that allow for many different options.
Here’s the example string we’ll be using:
name_vec <- corp_data1[, Company]
name_vec
#>  [1] "Walmart"            "Bershire Hataway"   "Apple"             
#>  [4] "Exxon Mobile"       "McKesson "          "UnitedHealth Group"
#>  [7] "CVS Health"         "General Motors"     "AT&T"              
#> [10] "Ford Motor Company"First, we can use the basic string cleaning defaults:
clean_strings(name_vec)
#>  [1] "walmart"            "bershire hataway"   "apple"             
#>  [4] "exxon mobile"       "mckesson"           "unitedhealth group"
#>  [7] "cvs health"         "general motors"     "atandt"            
#> [10] "ford motor company"Without any additional arguments, clean_strings does the
following:
Then, we have a few different options we can use.
sp_char_words is a data.frame with 2 columns: the first
column is symbols to replace, and the second is their replacement.
fedmatch as a built-in set of symbols:
print(sp_char_words)
#>    character replacement
#>       <char>      <char>
#> 1:       \\&         and
#> 2:       \\$      dollar
#> 3:       \\%     percent
#> 4:       \\@          atBut, you can use any data.frame you’d like, to make whatever replacements you’d like:
new_sp_char <- data.table::data.table(character = c("o"), replacement = c("apple"))
clean_strings(name_vec, sp_char_words = new_sp_char)
#>  [1] "walmart"                            "bershire hataway"                  
#>  [3] "apple"                              "exxapplen mapplebile"              
#>  [5] "mckessapplen"                       "unitedhealth grappleup"            
#>  [7] "cvs health"                         "general mappletapplers"            
#>  [9] "at t"                               "fapplerd mappletappler capplempany"common_words is similar, but it respects word boundaries
(so you don’t replace every usage of ‘Corp’ with ‘Corporation’, for
example.) fedmatch has a built-in set of 54 words and their
replacements:
print(corporate_words[1:5])
#>      abbr     long.names
#>    <char>         <char>
#> 1:  accep     acceptance
#> 2:   amer        america
#> 3:  assoc     associates
#> 4:     cl company listed
#> 5:  cmnty      communityBut, you can use whatever words you’d like:
clean_strings(name_vec, common_words = data.table::data.table(word = c("general", "almart"),
                                                              replacement = c("bananas", "oranges")))
#>  [1] "walmart"            "bershire hataway"   "apple"             
#>  [4] "exxon mobile"       "mckesson"           "unitedhealth group"
#>  [7] "cvs health"         "bananas motors"     "atandt"            
#> [10] "ford motor company"(bananas motors sounds like a lovely place to work). Note that the ‘almart’ in ‘walmart’ didn’t get replaced, because common_words respects word boundaries.,
You can also use a related function, word_frequency, to
look for the most common strings in your data:
remove_words and remove_char are booleans that let you simply remove the words in ‘common_words’ or specify a set of characters to remove rather than replacing them.
clean_strings(name_vec, sp_char_words = new_sp_char, remove_char = c("a", "c"))
#>  [1] "w lm rt"                           "bershire h t w y"                 
#>  [3] "pple"                              "exxapplen mapplebile"             
#>  [5] "m kessapplen"                      "unitedhe lth grappleup"           
#>  [7] "vs he lth"                         "gener l mappletapplers"           
#>  [9] "t t"                               "fapplerd mappletappler applemp ny"
clean_strings(name_vec, common_words = data.table::data.table(word = c("general", "company"),
                                                              replacement = c("bananas", "oranges")),
              remove_words = TRUE)
#>  [1] "walmart"            "bershire hataway"   "apple"             
#>  [4] "exxon mobile"       "mckesson"           "unitedhealth group"
#>  [7] "cvs health"         "motors"             "atandt"            
#> [10] "ford motor"