When I use the RegexTokenizer from pyspark.ml.feature to tokenize sentences column in my dataframe to find all the word characters, I get the opposite of what I would get when the python re package is used for the same sentence. Here is the sample code:
from pyspark.sql import SparkSession
from pyspark.ml.feature import RegexTokenizer
spark = SparkSession.builder \
.master("local") \
.appName("Word list") \
.getOrCreate()
df = spark.createDataFrame(data = [["Hi there, I have a question about RegexTokenizer, Could you
please help me..."]], schema = ["Sentence"])
regexTokenizer = RegexTokenizer(inputCol="Sentence", outputCol="letters", pattern="\\w")
df = regexTokenizer.transform(df)
df.first()['letters']
This gives the following output:
[' ', ', ', ' ', ' ', ' ', ' ', ' ', ', ', ' ', ' ', ' ', ' ', '...']
On the other hand if I use the re module on the same sentence and use the same pattern to match the letters, using this code here:
import re
sentence = "Hi there, I have a question about RegexTokenizer, could you
please help me..."
letters_list = re.findall("\\w", sentence)
print(letters_list)
I get the desired output as per the regular expression pattern as:
['H', 'i', 't', 'h', 'e', 'r', 'e', 'I', 'h', 'a', 'v', 'e', 'a',
'q', 'u', 'e', 's', 't', 'i', 'o', 'n', 'a', 'b', 'o', 'u', 't',
'R', 'e', 'g', 'e', 'x', 'T', 'o', 'k', 'e', 'n', 'i', 'z', 'e',
'r', 'c', 'o', 'u', 'l', 'd', 'y', 'o', 'u', 'p', 'l', 'e', 'a',
's', 'e', 'h', 'e', 'l', 'p', 'm', 'e']
I also found that I need to use \W instead of \w in pySpark to solve this problem. Why is this difference? Or have I misunderstood the usage of pattern argument in RegexTokenizer?
Aucun commentaire:
Enregistrer un commentaire