mercredi 14 septembre 2016

R - Data cleaning - Cannot get FindReplace function to work as intended

I have a large dataframe with a column that has thousands of different location (city) names, and I need to simplify/clean it.

After fighting quite a lot and trying to do it with regex and loop, I found the DataCombine package and the FindReplace that is meant to do what I want but I don't manage to make it work.

So I have:

   UserId          Location
1   USR_1             Paris
2   USR_2            London
3   USR_3           Londres
4   USR_4           Neuilly
5   USR_5            Berlin
6   USR_6    London Chelsea
7   USR_7 Berlin Schoenfeld
8   USR_8          Paris-20
9   USR_9           Neuilly
10 USR_10     Friedrischain

The cleaning is just a replace, e.g. "London Chelsea" should be "London", "Brooklyn" should be "New York City", "Paris 20e" and "Paris-14" should be "Paris". To go further, I would like everything that has the pattern "Paris" to be replaces by "Paris" (sort of LIKE "Paris%" in SQL).

# Data for testing
library(DataCombine)
user_test <- data_frame(x <- paste("USR", as.character(1:10), sep = "_"), y <- c("Paris", "London", "Londres", "Neuilly", " Berlin", "London Chelsea", "Berlin Schoenfeld", "Paris-20", "Neuilly", "Friedrischain"))
colnames(user_test) <- c("UserId","Location")
user_test <- as.data.frame(user_test) ### Not sure why I have to put it there but otherwise it doesn't have the dataframe class
should_be <- data_frame(c("Paris", "London", "Berlin", "Neuilly", "Friedr"), c("Paris", "London", "Berlin", "Paris", "Berlin"))
colnames(should_be) <- c("is","should_be")

# Calling the function
FindReplace(data = user_test, Var = "Location", replaceData = cleaner, from = "is", to = "should_be", exact = FALSE, vector = FALSE)

And the function returns this:

   UserId          Location
1   USR_1             Paris
2   USR_2            London
3   USR_3           Londres
4   USR_4           Neuilly
5   USR_5            Berlin
6   USR_6    London Chelsea
7   USR_7 Berlin Schoenfeld
8   USR_8          Paris-20
9   USR_9           Neuilly
10 USR_10     Friedrischain

Not cleaned at all.

Any ideas on why?

Thanks

Aucun commentaire:

Enregistrer un commentaire