mercredi 24 mars 2021

Extracting certain word(s) after specific pattern, while excluding specified patterns. in R

Using R, I want to extract the building, plaza or mansion names. The names are ahead of whether its specified a building ,mansion, plaza. Here is an example

addresses<-c("big fake plaza, 12 this street,district, city", 
"Green mansion, district, city", 
 "Block 7 of orange building  district, city",
"98 main street block a blue plaza, city",
 "blue red mansion, 46 pearl street, city")            

What I want to get is

"big fake" "Green" "orange" "blue" "blue red"

The code I currently using is

str_extract(addresses, "[[a-z]]*\\s*[[a-z]+]*\\s*(?=(building|mansion|plaza))")

Sometime the name is two words sometimes one. However because of the varied format, sometimes there is an 'a' or 'of' which is also getting extracted. How do I continue to extract the two word formats of the building name but exclude the 'a' or 'of'

Thanks in advance

Aucun commentaire:

Enregistrer un commentaire