vendredi 26 octobre 2018

Creating a tokenizer using the method Split

I'm trying to create simple tokenizer that splits on whitespace, lowercases tokens, removes all nonalphabetic characters, and keeps only terms with 3 or more characters. I write this code, it´s all ready work on lowercases, nonalphabetic characters and only keeps 3 or more characters. But I want to use the method split, but I don't know how. Please suggest something.

public class main {

 public static final String EXAMPLE_TEST = "This Mariana John bar Barr "
        + "12364 FFFFF aaaa a s d f g.";
public static void main(String[] args){


   Pattern pattern = Pattern.compile("(\\s[a-z]{3,20})");
    Matcher matcher = pattern.matcher(EXAMPLE_TEST);
    //
    while (matcher.find()) {
        System.out.print("Start index: " + matcher.start());
        System.out.print(" End index: " + matcher.end() + " ");
        System.out.println(matcher.group());

    }

}

}

Aucun commentaire:

Enregistrer un commentaire