Stop And Stem

After looking at the results of my brief foray into sentiment analysis of tweets a couple of weeks ago, and reading about the problem, it became clear that pre-processing may well help clean up the data and improve training. The goal is to reduce the number of possible features. Put simply, there are too many different words, and a lot of them are too noisy!

There are various techniques to do this, such as removing stop words ("and", "the" etc., words that don't add to the sentiment), and stemming to group reduce the variants of the same word (eg plurals and other endings) to the same token.

In Java the Lucene libraries help a great deal here. Here's how to remove stop words using Lucene's StopFilter:

    Tokenizer tokenizer = new StandardTokenizer(Version.LUCENE_41,
            new StringReader("I've got a brand new combine harvester, and I'm giving you the key"));

    final StandardFilter standardFilter = new StandardFilter(Version.LUCENE_41, tokenizer);
    final StopFilter stopFilter = new StopFilter(Version.LUCENE_41, standardFilter, StopAnalyzer.ENGLISH_STOP_WORDS_SET);

    final CharTermAttribute charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);

    while(stopFilter.incrementToken()) {
        final String token = charTermAttribute.toString().toString();
        System.out.println("token: " + token);

This will give you the following output:

token: I've
token: got
token: brand
token: new
token: combine
token: harvester
token: I'm
token: giving
token: you
token: key

Note that this assumes that the language is English; you'll have to find your own list of stop words for other languages. This example also uses the StandardFilter, which is is also useful for tokenization - it recognises things like email addresses for correct tokenization.

Stemming can also be achieved with the help of Lucene, via the PorterStemmer:

    final PorterStemmer stemmer = new PorterStemmer();



    final String current = stemmer.getCurrent();

    System.out.println("current: " + current);

This will print out:

    current: weak

Again this is for English only.

Some more ideas to clean up the data: removing @usernames, excessive punctuation!!! and characters repeated too many times (eg "cooool"). Armed with these I'll attempt my sentiment training again.

Posted on April 8, 2013 and filed under dev.