Google Prediction Revisited

A few weeks ago I had a brief foray with the Google Prediction API, and I resolved to revisit it armed with some techniques for cleaning up the training data.

Briefly, I've cleaned up the data to remove stop words, twitter orthography (eg 'RT' but not #hashtags), @usernames and links.

The results are more positive. Gone are the dubious classifications in the 0.66666... to 0.33333 range. Here are some examples, showing the query in italics and the response from the model below it:

I just love ice cream

  {
   "label": "positive",
   "score": 0.531505
  },
  {
   "label": "negative",
   "score": 0.11532
  },
  {
   "label": "neutral",
   "score": 0.353176
  }

this is relevant to my interests

  {
   "label": "positive",
   "score": 0.117893
  },
  {
   "label": "negative",
   "score": 0.333303
  },
  {
   "label": "neutral",
   "score": 0.548804
  }

I absolutely hate this rubbish

  {
   "label": "positive",
   "score": 0.07385
  },
  {
   "label": "negative",
   "score": 0.737656
  },
  {
   "label": "neutral",
   "score": 0.188494
  }

I have of course cherry-picked these examples. The training data is still heavily burdened with neutral examples, and this shows in some queries:

I'm incredibly happy

  {
   "label": "positive",
   "score": 0.150057
  },
  {
   "label": "negative",
   "score": 0.186907
  },
  {
   "label": "neutral",
   "score": 0.663037
  }

But on the whole this is much more useful than my previous experiment, and I'll continue to refine the processing and try to get some more training data.

Posted on April 10, 2013 and filed under dev.