I have a large dataframe with a column that has thousands of different location (city) names, and I need to simplify/clean it.
After fighting quite a lot and trying to do it with regex and loop, I found the DataCombine package and the FindReplace that is meant to do what I want but I don't manage to make it work.
So I have:
UserId Location
1 USR_1 Paris
2 USR_2 London
3 USR_3 Londres
4 USR_4 Neuilly
5 USR_5 Berlin
6 USR_6 London Chelsea
7 USR_7 Berlin Schoenfeld
8 USR_8 Paris-20
9 USR_9 Neuilly
10 USR_10 Friedrischain
The cleaning is just a replace, e.g. "London Chelsea" should be "London", "Brooklyn" should be "New York City", "Paris 20e" and "Paris-14" should be "Paris". To go further, I would like everything that has the pattern "Paris" to be replaces by "Paris" (sort of LIKE "Paris%" in SQL).
# Data for testing
library(DataCombine)
user_test <- data_frame(x <- paste("USR", as.character(1:10), sep = "_"), y <- c("Paris", "London", "Londres", "Neuilly", " Berlin", "London Chelsea", "Berlin Schoenfeld", "Paris-20", "Neuilly", "Friedrischain"))
colnames(user_test) <- c("UserId","Location")
user_test <- as.data.frame(user_test) ### Not sure why I have to put it there but otherwise it doesn't have the dataframe class
should_be <- data_frame(c("Paris", "London", "Berlin", "Neuilly", "Friedr"), c("Paris", "London", "Berlin", "Paris", "Berlin"))
colnames(should_be) <- c("is","should_be")
# Calling the function
FindReplace(data = user_test, Var = "Location", replaceData = cleaner, from = "is", to = "should_be", exact = FALSE, vector = FALSE)
And the function returns this:
UserId Location
1 USR_1 Paris
2 USR_2 London
3 USR_3 Londres
4 USR_4 Neuilly
5 USR_5 Berlin
6 USR_6 London Chelsea
7 USR_7 Berlin Schoenfeld
8 USR_8 Paris-20
9 USR_9 Neuilly
10 USR_10 Friedrischain
Not cleaned at all.
Any ideas on why?
Thanks
Aucun commentaire:
Enregistrer un commentaire