The Google Prediction API

In a previous post I explored some sample sentiment training data available from Sanders. Now let's try using it in the Google Prediction API.

The API lets you upload a set of training data. It will then create a model which you can interrogate. Training data is stored in Google Cloud Storage, and the API is accessible via REST, secured by OAuth in the usual Google style.

To get a good idea of what's involved I recommend reading the Hello Prediction! tutorial. I pretty much followed their example, except instead of detecting the language I used it to detect sentiment.

I had to refine my aforementioned training data to be in a form suitable for the API. That just means in this case that it has to be CSV file like so:

"positive","I love the whole world and everything in it"
"negative","You guys suck"
"neutral","Cheese is a kind of dairy product"

After following the steps described in the tutorial I was then in a position to query the model. Here's the prediction for an actual example taken from the positive data set:

{
 "kind": "prediction#output",
 "id": "my_model_id",
 "selfLink": "https://www.googleapis.com/prediction/v1.5/trainedmodels/my_model_id/predict",
 "outputLabel": "positive",
 "outputMulti": [
  {
   "label": "positive",
   "score": 0.666667
  },
  {
   "label": "negative",
   "score": 0
  },
  {
   "label": "neutral",
   "score": 0.333333
  }
 ]
}

Note that it doesn't give a unanimous positive vote, although it clearly chooses positive as the most likely category. I suspect this is because there is a lot more neutral data in the training set than either positive or negative, so that there is always a tendency to treat things as neutral. This is a useful quality where borderline cases are involved.

The other thing worth noting is the suspicious looking 2/3 and 1/3 score values themselves. Playing around with different queries always shows this 1/3 to 2/3 split, never any other numbers. I don't know what the cause of this is.

I need to spend some more time with this model, and probably get some more training data. One thing I will say is that it's both easy to use and fast. In Java terms the google-api-java-client covers a lot of ground here. I will post some more on developing with the Prediction API, and how well it performs in future posts.

Posted on March 22, 2013 and filed under dev.