I'm trying to create simple tokenizer that splits on whitespace, lowercases tokens, removes all nonalphabetic characters, and keeps only terms with 3 or more characters. I write this code, it´s all ready work on lowercases, nonalphabetic characters and only keeps 3 or more characters. But I want to use the method split, but I don't know how. Please suggest something.
public class main {
public static final String EXAMPLE_TEST = "This Mariana John bar Barr "
+ "12364 FFFFF aaaa a s d f g.";
public static void main(String[] args){
Pattern pattern = Pattern.compile("(\\s[a-z]{3,20})");
Matcher matcher = pattern.matcher(EXAMPLE_TEST);
//
while (matcher.find()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end() + " ");
System.out.println(matcher.group());
}
}
}
Aucun commentaire:
Enregistrer un commentaire