Twitter data fun

29 November 2011

I made a map of my followers on Twitter. This is not entirely straight forward, as most Twitter users don't attach geo coordinates to their tweets or profiles. Luckily, many people leave something sensible in the location field of their profile (e.g. 'Amsterdam' or 'London, UK'). You can match this field against a Lucene index of all the cities in the world, which I happen to have. I was able to place 15 out of my grand total of 19 followers on the map.

Followers of @fzk:

Some time ago, someone asked me to go find out how it is possible to obtain substantial amounts of relevant social network data (from legit sources, for money). That sounded like fun, so I went ahead and Googled. My main source of interest for now is Twitter, because of two things: 1) you have to start somewhere and 2) it's real-time in nature, which I think is interesting. For example, when you are going on a nice weekend trip, you tweet about it when leaving, whereas the pictures only show up on Facebook on Monday morning. This could be an important difference for some use cases.

Twitter is a global data source. However, you usually use it to attack a local problem, like when shouting to everyone at the same conference where the beer is at or that your slides are online. This works well, as long as you know exactly who you want to reach or follow, using a hash tag or a user's screen name. As a company, you often try to attack harder problems, like listening to a group of people who are potentially interested in your products. Given the probability that you are not Google or Facebook, that group usually isn't made up of everybody on the planet with an internet connection. And your specific audience typically doesn't exactly come with its own hash tag. Now what?

First of all, you need to obtain Twitter data. One option is to go through the Twitter API, but this is limited. It's mostly meant for use by client applications for a specific user or, at best, harvesting search results through the streaming API. There is the Twitter sample stream, but that gives you just a random 1% of all tweets. Altogether, Twitter's own API is not meant for gathering a large volume of data. It has rate limiting and limits the number of concurrent things you can do from a single host. If you need real data, there's apparently only two places you can go: Gnip and DataSift. Gnip advertises having data streams from lots of different social media whereas DataSift only has Twitter and one or two other offerings, yet both companies appear to emphasize on Twitter. A quick comparison based on their websites reveals that DataSift is a lot more transparent, both in pricing and API, whereas Gnip really want you to contact them in order to talk about the thing you already wanted to get online. DataSift is more of platform that allows you to get data in real-time. That said, both of them deliver a filtered Twitter stream based on a filter that you get to define yourself. DataSift has its own language for filtering. It allows you to create filters using this language and 'compile' them using their REST API. Once compiled, you can get streaming results using that filter, which works really nice. I am guessing Gnip could work in a similar way, but it's not apparent from their website. (Update! I have taken a better look at DataSift and think it is currently the best offering that exists for obtaining Twitter data. Also as a platform it's very impressive, allowing you to apply expensive filters with lots of predicates in real time. There is a nice post about them and their platform on the High Scalability blog.)

Now, here's the problem. I want to have all the Twitter messages by people that matter to me. But "people that matter to me" is not a predicate in the filter language. The things that you can filter on fall roughly into three categories: content properties, user metadata properties, reach properties.

These filtering options mainly allow you to filter based on properties of users or tweets (content, language, etc.). That's an obvious strategy, but I think there is not enough value in content alone, as long as you don't have the option to also use properties of the network graph as a means of filtering. The reach property that you can use is the closest thing here. It is based on the so called Klout Score of a user. The Klout Score of a user is based on how many unique people that user reaches when shouting on the internet and how likely it is that those people will amplify the message. This is nice, because you can filter on influential people, but the problem remains that this is global, not local.

If you look at the map of my followers, you can see a pattern: the geography of my reach. Below I have the same map for my friend Age (@agemooij). His reach is far greater than mine (with 200+ followers), but also, the geo pattern for his reach is noticeably different.

Followers of @agemooij:

Of course the map is just there because people like maps. The real information is in the pie chart, which is basically a feature vector of the geographic reach (hover your mouse over the chart to see data). My method for extracting coordinates from location fields is far from spohisticated and will produce erratic results every now and then, so it is probably safer to ignore any country under 1% in the distribution. That said, Age has a portion of his followers in the US, India and UK. This may be a pattern that you're looking for. It would be nice to be able to filter based on users with a geographic reach similar to Age's. Also, it should be feasible to do this technically.

As a reference, I also created the map for the official, verified account of the prime minister of The Netherlands. His name is Mark. Mark has tens of thousands of followers, so I had to sample because of rate limiting, but I got 2000+ locations, so it is likely to be representative. You see Mark's geographic reach consists of only NL, after discarding every category of 1% or lower. Mark's Klout Score will likely be high, but if you want to look at content that crosses the Atlantic, you're better of with Age.

Followers of @MinPres:

While content based filtering options, let's you easily do simple things, like getting all tweets that mention your company name or product name, it isn't very helpful if you want to look at sentiment amongst specific groups. It would be nice to have a filter that let's you define predicates like 'people with a mostly Western European network and a Klout Score > 30 and a noticable political interest'. The first two predicates should be doable. Now the last part of that query is a hard one. Perhaps looking at the links that people share can give us some insight in interests. I will give it a try soon and if it works out to something visible, that'll be another post.

Making the map

Onto making the map. All the code that you need is here: https://github.com/friso/twitterfun. So clone that. Building the map consists of three steps, one of which is manual (I did it only three times, so when I have to make another one, I'll automate the whole process).

  1. Extract location fields from a user's followers
  2. Turn locations into geo coordinates
  3. Make the HTML file with the map and pie chart

For step 1 I use a piece of python which is here. It takes two arguments: a screen name and a file name. The first is the screen name of the user of interest. The second is a file where the script will write the list of locations it found in the follower list. The script talks to the Twitter REST API for collecting the required data. It will first lookup the user's follower IDs and then request the user profiles for each of the IDs. It will lookup 30 users at a time, because the API method is limited to some amount of users (amount >= 30) per request. For users with lots of followers, you will run into rate limiting before it finishes. It will throw an error in that case, because of unexpected content in the response.

Step 2 is a Java progam. It has several command line options, as you can see by browsing the main method. You should run it like:

java [java opts] com.xebia.locations.finder.LocationFinder -i <index location> -f <input file> -v

The java opt need to have all the jars that it requires. This is at the very least Lucene and Commons CLI, but there's probably more. You're better of creating a project in your IDE for it and run from there or add a run thingie to the Maven pom.xml. The index location is a location of a Lucene index containing all the cities in the world and their population. I will show you in a minute how to get that. The input file is the file that came out of the python script. The -v option tells the program to output records in JSON format, which comes in handy later on.

Next to the LocationFinder class, there is a IndexBuilder class. The nice people at MaxMind publish a database containing all the cities in the world and their population as a text file. The IndexBuilder class can read this file and turn it into a Lucene index at a specified location. Run like this:

java [java opts] com.xebia.locations.locationfinder.IndexBuilder -i <index output dir> -d <maxmind db file>

It will build the index in the directory specified by <index output dir> and use the MaxMind database file <maxmind db file>. It expects text, so extract the gzipped version.

The location finder tool tokenizes the location as entered by the user and does a seach against the entire index for anything that has similarity to that location. It will fetch at most 350 results and then rank the results based on exact matches for city, country and in case of US also state. Additionally it will rank the city with the largest population a bit higher as well. This makes sure that 'Amsterdam', matches Amsterdam in The Netherlands more than any of the four places named Amsterdam that exist in the US. It will choose the highest ranking result, if it ranks above a certain threshold.

Step 3 is a manual step. The location finder outputs a bunch of JSON objects (one per line). When you turn that list of objects into an array of objects by adding a comma at the end of each line and putting square brackets around it and subsequently paste that result into the HTML template over here instead of the array that's already there in the "var locations". It will give you the desired HTML with a map and the pie chart. Also, Google Maps requires an API key these days, so you'll have to provide that at line 11 of the HTML file.

I am not going to do a line-by-line walk through of the code in this text, because those are boring. You can checkout and read the code for yourself. Enjoy!

blog comments powered by Disqus