How to get character encoding correct on Google App Engine | MacGyver Development

I've been having endless trouble trying to force a particular encoding for some content on the Google App Engine. It was complicated by my Mac's insistence on MacRoman, but even when forcing a file encoding of UTF-8 my web pages would still show up with funny ?s all over the shop.

The Spring CharacterEncodingFilter described in the linked blog post did the trick. Can you think of any other way of doing this without a filter?

Source: http://macgyverdev.blogspot.co.uk/2011/09/...

Everything announced at the Google I/O 2013 keynote in one handy list

Google has completed its mammoth 3-hour I/O 2013 keynote, and many announcements were made. We’ve compiled a handy list so you can catch up and make sure you haven’t missed anything.

 A very handy list from The Next Web about all the exciting announcements made at Google I/O 2013 - well worth checking out. I'm particularly interested in exploring the Compute Engine, more of which later.

Source: http://thenextweb.com/insider/2013/05/15/e...

Simulating the High Replication Datastore Locally

Recently I was trying to add transactional support to certain batch processes in a Google App Engine app and it was coming up with strange errors. I was using Objectify. In particular it would tell me that it

can't operate on multiple entity groups in a single transaction

To my knowledge, I wasn't trying to operate on multiple entity groups. The problem turned out to be that the local dev datastore doesn't simulate the eventual consistency of the live datastore properly.

However, it's possible to turn on a simulation of this behaviour in your local app by following these instructions, ie by passing this command option:

-Ddatastore.default_high_rep_job_policy_unapplied_job_pct=1

The 1 on the end is the percentage of eventual consistency you want to see in your datastore. In practice any number bigger than zero is enough to get Objectify transactions working properly locally in these situations.

If you are using Maven to run your apps using the maven-gae-plugin, then you can configure this option in your pom.xml as follows:

        <plugin>
            <groupId>net.kindleit</groupId>
            <artifactId>maven-gae-plugin</artifactId>
            <version>0.9.6</version>
            <configuration>
                <jvmFlags>
                    <jvmFlag>-Ddatastore.default_high_rep_job_policy_unapplied_job_pct=1</jvmFlag>
                </jvmFlags>
            </configuration>
            <dependencies>
                <dependency>
                    <groupId>net.kindleit</groupId>
                    <artifactId>gae-runtime</artifactId>
                    <version>${gae-runtime.version}</version>
                    <type>pom</type>
                </dependency>
            </dependencies>
        </plugin>

Google Prediction API in the App Engine

I've now integrated the Google Prediction API into a Google App Engine project, in order to supply sentiment prediction at runtime. I wanted to use a service account to access the model via the Google api client libraries for Java. This has proven trickier than I first imagined, but the code is ultimately straightforward.

Some caveats

Use your service account, not the APIs Explorer

I originally set up my model and queried it using the APIs Explorer. Unfortunately I didn't realise that although I was using the same google account to configure access to the API from the app (see below) as I was using to train the model, the one can't see the other. In other words, the service account and the google account are separate, and they can't see each other's data. The upshot of this is that I have to train my model programatically using the service account, if I want to query it programatically too.

If you set up your Google APIs Console project from a Google Apps account, don't

The problem is that your service account needs to allow access to your Google App Engine "Service Account Name" - see under the Application Settings for your app, it will be of the form something@appspot.gserviceaccount.com. Unfortunately you need to add this to your Team for the Google APIs Console project. And it won't let you if you're logged in to your Google Apps account. For example, I'm logged in under extropy.net, and it will complain if you add an address that doesn't belong to this domain. I couldn't find any mention of this problem in the documentation anywhere.

The solution is to create a new Google APIs Console project under a regular gmail account. You can then add any email address it seems, including appspot ones.

If you have a custom domain for your app that matches your Google APIs Console account you may be able to ignore this, because the domain names will match, but I'm not able to confirm that.

The Code

In the end the code is quite simple, although the documentation is misleading and in a number of cases out of date. This is what I found that worked...

Following the examples from the Google API client libraries I created a utility class like so:

static final HttpTransport HTTP_TRANSPORT = new UrlFetchTransport();

static final String MODEL_ID = "your_model_id";
static final String STORAGE_DATA_LOCATION = "path_to_your_training_data.csv";

static final String API_KEY = "the_key_from_the_apis_console";

/**
 * Global instance of the JSON factory.
 */
static final JsonFactory JSON_FACTORY = new JacksonFactory();
public static final String APPLICATION_NAME = "Grokmood";

public static Prediction getPrediction() throws Exception {

    AppIdentityCredential credential =
            new AppIdentityCredential(Arrays.asList(PredictionScopes.PREDICTION));

    Prediction prediction = new Prediction.Builder(
            HTTP_TRANSPORT, JSON_FACTORY, credential).setApplicationName(APPLICATION_NAME)
            .build();

    return prediction;
}

This will create a Prediction object for you. It uses the AppIdentityCredential to access your service account. I found the documentation for this somewhat scarce.

To train the model call this method:

public static void train(Prediction prediction) throws IOException {

    Training training = new Training();
    training.setId(MODEL_ID);
    training.setStorageDataLocation(STORAGE_DATA_LOCATION);

    prediction.trainedmodels().insert(training).setKey(API_KEY).execute();
}

And you can query it like so:

public static String predict(Prediction prediction, String text) throws IOException {
    Input input = new Input();
    Input.InputInput inputInput = new Input.InputInput();
    inputInput.setCsvInstance(Collections.<Object>singletonList(text));
    input.setInput(inputInput);
    Output output = prediction.trainedmodels().predict(MODEL_ID, input).setKey(API_KEY).setId(MODEL_ID).execute();
    return output.getOutputLabel();
}

This last method will simply return positive, negative or neutral for my sentiment model.

When training the model don't forget that this takes some time- the method just kicks off the training process and returns immediately. For an example of how to wait until the model is finished please take a look at PredictionSample.java. I just kicked off training and came back quarter of an hour later. Remember that you can't see the status except by querying from the service account - one could add a similar method using trainedmodels().get() to review the training status too.

One last caveat - none of the above will work locally. This is another difference between local and production Google App Engine environments. Once it's been deployed the app will correctly identify itself, but there's no way to do that when running on your own machine. You will either have to fake your API responses locally, or use a different authentication method - one could use OAuth to authenticate yourself after logging in with a google account. You'd then have two different means of authenticating, one for local, and one for production...

Unseasonable

I've been studying D3.js and the Wunderground API for another project and decided to learn about both by creating this little app called Unseasonable.

It takes your current location, looks up the current conditions and historical temperature data and shows a little bar chart in red or blue depending on whether or not the current temperature is above or below the mean. It's not terribly informative since the historical temperature range is daily, not hourly, so you're likely to be below average at night and above average in the middle of the day, but it was a good learning exercise.

The app is deployed on Google App Engine and makes as much use of the Memcache as possible. If you want to see the D3 code for this chart just look at the source of the page - it's heavily based on this example.

Note, if you want to use the app you need to give it permission to use your location. If you don't you won't see anything of interest! The Wunderground API doesn't have historical (or even current) data for all locations; there is no error checking or reporting or any means of knowing what the app is doing - you'll either see the chart or you won't.

Unseasonable

Google Prediction Revisited

A few weeks ago I had a brief foray with the Google Prediction API, and I resolved to revisit it armed with some techniques for cleaning up the training data.

Briefly, I've cleaned up the data to remove stop words, twitter orthography (eg 'RT' but not #hashtags), @usernames and links.

The results are more positive. Gone are the dubious classifications in the 0.66666... to 0.33333 range. Here are some examples, showing the query in italics and the response from the model below it:

I just love ice cream

  {
   "label": "positive",
   "score": 0.531505
  },
  {
   "label": "negative",
   "score": 0.11532
  },
  {
   "label": "neutral",
   "score": 0.353176
  }

this is relevant to my interests

  {
   "label": "positive",
   "score": 0.117893
  },
  {
   "label": "negative",
   "score": 0.333303
  },
  {
   "label": "neutral",
   "score": 0.548804
  }

I absolutely hate this rubbish

  {
   "label": "positive",
   "score": 0.07385
  },
  {
   "label": "negative",
   "score": 0.737656
  },
  {
   "label": "neutral",
   "score": 0.188494
  }

I have of course cherry-picked these examples. The training data is still heavily burdened with neutral examples, and this shows in some queries:

I'm incredibly happy

  {
   "label": "positive",
   "score": 0.150057
  },
  {
   "label": "negative",
   "score": 0.186907
  },
  {
   "label": "neutral",
   "score": 0.663037
  }

But on the whole this is much more useful than my previous experiment, and I'll continue to refine the processing and try to get some more training data.

Stop And Stem

After looking at the results of my brief foray into sentiment analysis of tweets a couple of weeks ago, and reading about the problem, it became clear that pre-processing may well help clean up the data and improve training. The goal is to reduce the number of possible features. Put simply, there are too many different words, and a lot of them are too noisy!

There are various techniques to do this, such as removing stop words ("and", "the" etc., words that don't add to the sentiment), and stemming to group reduce the variants of the same word (eg plurals and other endings) to the same token.

In Java the Lucene libraries help a great deal here. Here's how to remove stop words using Lucene's StopFilter:

    Tokenizer tokenizer = new StandardTokenizer(Version.LUCENE_41,
            new StringReader("I've got a brand new combine harvester, and I'm giving you the key"));

    final StandardFilter standardFilter = new StandardFilter(Version.LUCENE_41, tokenizer);
    final StopFilter stopFilter = new StopFilter(Version.LUCENE_41, standardFilter, StopAnalyzer.ENGLISH_STOP_WORDS_SET);

    final CharTermAttribute charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);

    stopFilter.reset();
    while(stopFilter.incrementToken()) {
        final String token = charTermAttribute.toString().toString();
        System.out.println("token: " + token);
    }

This will give you the following output:

token: I've
token: got
token: brand
token: new
token: combine
token: harvester
token: I'm
token: giving
token: you
token: key

Note that this assumes that the language is English; you'll have to find your own list of stop words for other languages. This example also uses the StandardFilter, which is is also useful for tokenization - it recognises things like email addresses for correct tokenization.

Stemming can also be achieved with the help of Lucene, via the PorterStemmer:

    final PorterStemmer stemmer = new PorterStemmer();

    stemmer.setCurrent("weakness");

    stemmer.stem();

    final String current = stemmer.getCurrent();

    System.out.println("current: " + current);

This will print out:

    current: weak

Again this is for English only.

Some more ideas to clean up the data: removing @usernames, excessive punctuation!!! and characters repeated too many times (eg "cooool"). Armed with these I'll attempt my sentiment training again.

Monitoring quotas on Google App Engine

One of my periodic chores with the Google App Engine is monitoring the quotas, particularly in my apps without billing enabled. Unfortunately Google provides no programatic way of doing this, and it doesn't look likely that it will. There is a QuotaService, but that isn't well documented and only shows quota use during a request.

However, one can report on quota exceptions that occur using the LogService. With this it's possible to find all exceptions within the last hour, say, that involved an OverQuotaException, like so:

    final LogService logService = LogServiceFactory.getLogService();

    LogQuery query = LogQuery.Builder.withDefaults();
    query.includeAppLogs(true);
    query.minLogLevel(LogService.LogLevel.ERROR);

    Calendar cal = Calendar.getInstance();
    cal.add(Calendar.MINUTE, -60);

    query.startTimeMillis(cal.getTimeInMillis());

    final Iterable<RequestLogs> requestLogsIterable = logService.fetch(query);

    int quotaFailures = 0;

    for (RequestLogs requestLog : requestLogsIterable) {

        LOGGER.info(requestLog.toString());

        for (AppLogLine appLogLine : requestLog.getAppLogLines()) {

            if (appLogLine.getLogMessage().contains("OverQuotaException")) {
                quotaFailures++;
            }
        }
    }

I can use the total number of quota exceptions within the last hour to create a healthcheck servlet, which can be queried by a an automated monitor (I use ServerMojo to ping this URL once an hour).

Of course, this doesn't warn you that you're about to go over quota, but it's given me a good handle on how the app fares over the course of a day.

One warning, LogService querying is subject to its own quota. During my early experiments I managed to get the date range wrong, and blew my LogService read quota in one hit! YMMV.

Links

I've started collecting useful links on all manner of subjects here. I hope you find these helpful. I will keep them up to date.

Objectify and Google Guice

I've been working over several Google App Engine Java apps recently to introduce Google Guice and Objectify to them. Guice is a lightweight dependency injection framework, and Objectify is a superb replacement for JDO/JPA in your Java GAE projects.

Google Guice lets you bind interfaces to implementations and annotate dependencies for injection, eg:

public interface MyService...

public class ClientCode {

    private MyService myService;

    @Inject
    public void setMyService(MyService myService) {
        this.myService = myService.
    }

}

If you're familiar with Spring then you'll find this a doddle. There's no XML in sight - Guice concentrates pretty much only on dependency injection, and the Java-based configuration classes one uses instead of XML seem perfectly adequate for this.

It also works nicely with Objectify. This is a data access API for the app engine. Take a look at the examples, they are extremely straightforward:

@Entity
class Car {
    @Id String vin; // Can be Long, long, or String
    String color;
}

ofy().save().entity(new Car("123123", "red")).now();
Car c = ofy().load().type(Car.class).id("123123").get();
ofy().delete().entity(c);

There's an Objectify servlet filter, somewhat similar in purpose open session in view filters, which can easily be set up in a couple of lines in Guice.

Moreover, now I can easily write pretty concise DAO and Service classes that are easily testable, which is something essential I've been sorely missing.

The Google Prediction API

In a previous post I explored some sample sentiment training data available from Sanders. Now let's try using it in the Google Prediction API.

The API lets you upload a set of training data. It will then create a model which you can interrogate. Training data is stored in Google Cloud Storage, and the API is accessible via REST, secured by OAuth in the usual Google style.

To get a good idea of what's involved I recommend reading the Hello Prediction! tutorial. I pretty much followed their example, except instead of detecting the language I used it to detect sentiment.

I had to refine my aforementioned training data to be in a form suitable for the API. That just means in this case that it has to be CSV file like so:

"positive","I love the whole world and everything in it"
"negative","You guys suck"
"neutral","Cheese is a kind of dairy product"

After following the steps described in the tutorial I was then in a position to query the model. Here's the prediction for an actual example taken from the positive data set:

{
 "kind": "prediction#output",
 "id": "my_model_id",
 "selfLink": "https://www.googleapis.com/prediction/v1.5/trainedmodels/my_model_id/predict",
 "outputLabel": "positive",
 "outputMulti": [
  {
   "label": "positive",
   "score": 0.666667
  },
  {
   "label": "negative",
   "score": 0
  },
  {
   "label": "neutral",
   "score": 0.333333
  }
 ]
}

Note that it doesn't give a unanimous positive vote, although it clearly chooses positive as the most likely category. I suspect this is because there is a lot more neutral data in the training set than either positive or negative, so that there is always a tendency to treat things as neutral. This is a useful quality where borderline cases are involved.

The other thing worth noting is the suspicious looking 2/3 and 1/3 score values themselves. Playing around with different queries always shows this 1/3 to 2/3 split, never any other numbers. I don't know what the cause of this is.

I need to spend some more time with this model, and probably get some more training data. One thing I will say is that it's both easy to use and fast. In Java terms the google-api-java-client covers a lot of ground here. I will post some more on developing with the Prediction API, and how well it performs in future posts.

Googomi

One of the great things about Google App Engine is, if you stay inside the box, so to speak, many things are a doddle. So much so that I was able to create this new app, Googomi, in a day or two, most of which involved fiddling with and learning about the Google+ API.

The Googomi app is a very simple beast with only one purpose: it will take your public Google+ stream and turn it into an RSS feed.

I've put a modicum of processing into it, so that it should correctly guess the most appropriate title for each RSS item, eg choosing the annotation, or the remote URL's title, where appropriate.

I personally had a use case for this (apart from learning about various Google APIs) whereby I wanted to export Google+ posts to other services automatically. For example, with this I can post from Google+ to Buffer and then beyond automatically.

Google App Engine and the Google+ API

I've been playing with what the Google+ API has to offer and I've found it quite easy to integrate into my Google App Engine apps using the google-api-java-client.

I initially followed the Quick start for Java tutorial with regard to creating the OAuth tokens and so forth, but the google-api-java-client has some good tutorials regarding making the actual OAuth calls. See for example this section about how to make the calls from a Google App Engine app. The library handles all the plumbing for you.

I only had to make one ammendment to their example. I found that the refresh token wasn't being returned along with the access token after it was granted. However this was simply fixed by adding a call to setApprovalPrompt("force") on the GoogleAuthorizationCodeFlow.Builder, like so:

public static GoogleAuthorizationCodeFlow newFlow() throws IOException {
    return new GoogleAuthorizationCodeFlow.Builder(HTTP_TRANSPORT, JSON_FACTORY,
            getClientCredential(),             Collections.singleton(PlusScopes.ME.getUri())).setCredentialStore(
            new AppEngineCredentialStore()).setAccessType("offline")
            .setApprovalPrompt("force")
            .build();
}

Twitter Sentiment Data

I've been delving into some twitter sentiment analysis and have been casting about for some useful training data. I've found various sources but few have any neutral data, which I think is important for any training as sort of control.

One useful source is Sanders Analytics, which has a source of tweet ids and a script to download the actual tweets from the ids (Twitter's terms & conditions do not allow the tweets themselves to be distributed).

This script takes a couple of days to download all the tweets because it has to honour Twitter's API limits.

I found one issue in the script which is easily fixed. It could cope with the presence of "error" in the response, but not "errors", eg:

{"errors":[{"message":"Sorry, that page does not exist","code":34}]}

The simple fix is to add this to the parse_tweet_json function, after the error check:

if 'errors' in tweet_json:
    raise RuntimeError('errors in downloaded tweet')

When the script finishes it will produce a file called full-corpus.csv. Now the final data has this format:

"apple","positive","126360398885687296","Tue Oct 18 18:14:01 +0000 2011","a tweet of some sort"

That is, the subject, the sentiment, the tweet id, the date and the tweet content.

The subject is what the tweet is about. This is important, as the sentiment refers to the subject. In other words the sentiment is about the subject (in this case "apple"), and not anything else in the tweet content.

Regardless, for my purposes I do actually need the tweet content without the subject. This can be simply achieved using grep and awk. Eg to extract the neutral tweets:

grep "\"neutral\"" full-corpus.csv | awk -F"\",\"" '{print $5}' | cut -d "\"" -f1

The output of this will just be the tweets themselves.

Updating to GAE 1.7.5

Today I updated a maven-based Google App Engine app from 1.7.4 to 1.7.5. As before, it didn't turn out as straightforward as I expected (maybe I should stop expecting this).

Once I'd installed 1.7.5 and set gae.version to 1.7.5 the build failed yet again - the issue this time boiled down to this error:

Could not find artifact net.kindleit:maven-gae-parent:pom:0.9.6-SNAPSHOT

As usual I turned to stackoverflow for help, where several others have had the same problem.

The key for me was to specify 1.7.5.1 for the GAE runtime version.

<gae.version>1.7.5</gae.version>
<gae-runtime.version>1.7.5.1</gae-runtime.version>

<dependency>
    <groupId>net.kindleit</groupId>
    <artifactId>gae-runtime</artifactId>
    <version>${gae-runtime.version}</version>
    <type>pom</type>
</dependency>   

<plugin>
    <groupId>net.kindleit</groupId>
    <artifactId>maven-gae-plugin</artifactId>
    <version>0.9.5</version>
    <dependencies>
        <dependency>
            <groupId>net.kindleit</groupId>
            <artifactId>gae-runtime</artifactId>
            <version>${gae-runtime.version}</version>
            <type>pom</type>
        </dependency>
    </dependencies>
</plugin>

I continue to use 1.7.5 for other dependencies, eg appengine-api-stubs. I have no idea about the whys and wherefores regarding this inconsistency I'm afraid.

Facebook Apps in Heroku

A couple of years ago or so Heroku and Facebook teamed up to make creating Facebook apps a doddle. Indeed one can do so with a few clicks from the app creation centre in Facebook if you already have a Heroku account.

Here are pretty comprehensive instructions from Heroku on how to do this, and I can attest that it all works well.

I've added to this setup with a staging instance for team testing purposes using the facility Heroku has for managing different environments by pushing to different remotes. See this handy guide for full details.

To create a staging branch called staging:

heroku create --remote staging

And to add Facebook app credentials for the staging version of your app just do:

heroku config:add FACEBOOK_APP_ID=123456 FACEBOOK_SECRET=789102323etc --remote staging

Managing Javascript Resources In Maven

One of the fiddly steps in setting up a web app, and maintaining it is managing all the various javascript libraries your pages use. But it's quite easy to manage resources like jQuery in Maven thanks to WebJars. Here's how to use it in Dropwizard.

If you take a look at WebJars you'll see all sorts of supported libraries. I'll use jQuery in this example.

Adding jQuery to Dropwizard

First add your jQuery dependency in your pom:

<dependency>
    <groupId>org.webjars</groupId>
    <artifactId>jquery</artifactId>
    <version>1.9.0</version>
</dependency>

Now add an AssetBundle in your Dropwizard service class:

@Override
public void initialize(Bootstrap<StreamWebAppConfiguration> bootstrap) {
    bootstrap.setName("webjars-demo");
    ... other assets ...
    bootstrap.addBundle(new AssetsBundle("/META-INF/resources/webjars", "/webjars"));
}

This will map the path "/webjars" to that jar resource - which will contain the jQuery js files in our example.

Now you can reference them in your HTML pages:

<script src="/webjars/jquery/1.9.0/jquery.min.js"></script>

And that's that. But you can go one step further. You can remove references to the version number in your pages by using the dropwizard-webjars-resource library.

Dropwizard Webjars Resource

To do this add another maven dependency:

<dependency>
    <groupId>com.bazaarvoice.dropwizard</groupId>
    <artifactId>dropwizard-webjars-resource</artifactId>
    <version>0.2.0</version>
</dependency>

In your service class remove the aforementioned AssetBundle and instead add a WebJarResource to your run method:

environment.addResource(new WebJarResource());

This will handle all the asset mapping (which is why an AssetBundle is no longer required). Now in your pages you can reference:

<script src="/webjars/jquery/jquery.min.js"></script>

i.e. without the version number. Simple! If you need to update your whole site to the next version of jQuery, just update the pom.

Streaming Twitter with Twitter4J

Twitter4J is an excellent java library for all sorts of twitter work. I've been using it recently to connect to the "garden hose", ie Twitter's streaming API. Here's how to follow a particular user with it.

You can load this into your project via Maven:

<dependency>
  <groupId>org.twitter4j</groupId>
  <artifactId>twitter4j-core</artifactId>
  <version>3.0.3</version>
</dependency>
<dependency>
  <groupId>org.twitter4j</groupId>
  <artifactId>twitter4j-stream</artifactId>
  <version>3.0.3</version>
</dependency>

Now you can construct your TwitterStream class:

ConfigurationBuilder cb = new ConfigurationBuilder();
cb.setDebugEnabled(true)
.setOAuthConsumerKey("******************")
.setOAuthConsumerSecret("************************************")
.setOAuthAccessToken("************************************")
.setOAuthAccessTokenSecret("************************************");
TwitterStreamFactory twitterStreamFactory = new TwitterStreamFactory(cb.build());
TwitterStream twitterStream = twitterStreamFactory.getInstance();

Of course you'd put your own oauth tokens etc. here.

To listen to a particular user you can use a FilterQuery object:

  FilterQuery filterQuery = new FilterQuery();
  filterQuery.follow(new long[] {3473284738472384327743L});

The follow method takes an array of user's ids to follow.

To track the user you need to add a listener and attach this filter:

  twitterStream.addListener(new MyStatusListener());
  twitterStream.filter(filterQuery);

Now the MyStatusListener class merely implements StatusListener. The important method we implement here is onStatus. For our purposes we just print the statuses out:

  public void onStatus(Status status) {
          System.err.println("status: " + status);
  }

We don't need to do anything else in our StatusListener implementation for our current purpose.

If you execute this code and let it run - Twitter4J will start a thread for you - you will see the results comming in, eg:

  status: StatusJSONImpl{createdAt=Mon Feb 18 15:51:26 GMT 2013, id=303532182444584960, text='RT @stephenfry: Oh no, ...
  status: StatusJSONImpl{createdAt=Mon Feb 18 15:51:30 GMT 2013, id=303532201159557123, text='RT @stephenfry: Oh no, ...
  etc...

This Week's Reading

This week has had some ups and downs development-wise. I've found myself stymied by serious differences between the local development and production versions of Google App Engine, in particular in the way backend services work. More on this in a later post.

In the meantime I've found some interesting articles on a subject I'm particularly interested in, the mapping and analysis of tweets!

Mapping Twitter sentiment is a bitch is an excellent post-mortem on the troubles that crop up when displaying twitter sentiment in maps. The video at the end is particularly good.

How I scraped and stored over 3 million tweets is a follow-up to the above post, which deals more with the architecture required to gather and store all those tweets.

Finally here's a somewhat tricksy crossword, where the clues are regular expressions...