mercredi 30 novembre 2016

How to tokenize Spanish text with NLTK or Pattern-es

Basically the issue I am having is with separating object and direct object pronouns from verbs.

Ie 'aprenderlo' should ideally be tokenized as two separate entities, 'dimelo' should be tokenized as three. I have tried a variety of taggers in both libraries, and so far nothing has produced the results I want. However, I am sure this must be a common problem - any ideas?

Aucun commentaire:

Enregistrer un commentaire