Incremental dump of a mysql database

I've just written a couple of simple ruby classes to help dump a mysql database without locking tables for extended periods.

In particular, some of my tables contain lots of date-based data which is not changed after the day it was created. The classes can be told to dump such data one day at a time, and won't dump a day that has already been dumped.

Check it out on github;


I hope people find it useful.



Introduction to Hive on Amazon Elastic Map Reduce


Scaling an application usually involves adding processing nodes. This means you end up with valuable data (e.g. server logs) existing on multiple different machines. Very often, we want to mine those logs for useful information.

One way to do this would be to put all the logs in one place and run some kind of computation over all the data. This is relatively simple, but it really doesn't scale. Pretty soon, you reach the point where the machine which is analysing a day's worth of log data is taking more than a day to do it.

Another way is to let each processing node analyse its own logs, and then have some way to collate the analysis from each node to find the overall answer you're looking for. This is a more practical solution for the long term, but it still has a couple of problems;

1. A processing node should be processing - i.e. doing the task for which it was created. Ideally, you wouldn't want to bog it down with other responsibilities. If you do, you probably need to make the node more powerful than it needs to be in order to carry out its primary task, so you're paying for extra capacity just in case you want to run a log analysis task.

2. In the case of log data in particular, keeping the logs on the node which created them generally means you have to keep the node around too. This makes it awkward to remove nodes if you want to scale down, or replace nodes with better nodes, because you have to worry about copying the logs off the node and keeping them somewhere.

It would be nice if we could have each node push its logs into something like Amazon S3 for storage, and spin up a distributed computing task whenever we want to run some analysis. Amazon Elastic Map Reduce (EMR) is designed to work in exactly this way, but the learning curve for writing map/reduce job flows is pretty steep - particularly if you're used to writing simple scripts to get useful information out of log data.

As of October 1st 2009, Amazon EMR supports Apache Hive, which makes things a lot easier.

What is Hive?

The proper answer is here.

The way I think of Hive is that it lets you pretend that a whole mess of semi-structured log files are actually big database tables, and then helps you run SQL-like queries over those tables. All this without having to actually insert the data into any kind of table, and without having to know how to write distrubuted map/reduce tasks.

Using Hive with Amazon EMR

This is a very basic introduction to working with Hive on Amazon EMR. Very basic because I've only just started looking into this myself.

You will need to be signed up for Amazon Web Services, including S3 and Elastic Map Reduce.

I'm going to go through part of an exercise from the Cloudera Introduction to Hive, which I strongly recommend working through. That training exercise uses a Cloudera VMWare virtual appliance running Hive. Here is how to I did a similar task using Hive on Amazon EMR.

For this exercise, we're going to take a data file consisting of words and the frequency of occurrence of those words within the complete works of William Shakespeare. The file consists of a number of lines like this;

25848   the
23031 I
19671 and
18038 to
16700 of
14170 a
12702 you
11297 my
10797 in

The first value is an integer saying how many times the word occurs, then a tab character, then the word. This file is generated by an earlier exercise in one of the Cloudera Hadoop tutorials. If you don't feel like running through those exercises, just generate a file containing a bunch of numbers and words, separated by a tab character, and use that.

Upload the data to S3

Before we can analyse the data, we need it to be available in S3. Create a bucket called "hive-emr", and upload your data file into it using the key "/tables/wordfreqs/whatever". In my case, I have the tab-delimited text file in;


NB: The S3 path "hive-emr/tables/wordfreqs" is going to be our Hive table. If you're unfamiliar with S3, "hive-emr" is the name of our bucket, and 'tables/wordfreqs/shakespeare.txt' is the key whose value is the contents of our "shakespeare.txt" file.

Everything in the 'directory' "tables/wordfreqs/" (which isn't really a directory, but we can pretend it is) must be parseable as data for our table, so don't put any other types of file in there. You could, if you wanted, have more than one tab-delimited text file though, and all of the data in all of those files would become records in the same Hive table.

It's also important not to have any underscores in the S3 bucket or key. S3 will happily let you create and upload files to buckets/keys with underscores, but you'll get an S3 URI error when you try to create the table in Hive.

I'm using s3sync to upload the data files, but you can use anything you want provided you get the data into S3 with the correct bucket and key name.

Generating an EC2 Key Pair

We need a key pair to enable us to SSH onto our Hive cluster, when we've started it. If you don't have a suitable key pair already, sign in to the Amazon Web Services console and go to the Amazon EC2 tab. Near the bottom of the left-hand column, use the "Key Pairs" function to generate a key pair and save the secret key to your local machine.

Be aware of the "Region" you're using - key pairs will only work for servers of the same region. I'm using "EU-West", but it doesn't matter which you use, as long as you're consistent.

Starting Hive

Sign in to the Amazon Web Services console and go to the Amazon Elastic MapReduce tab (you won't see the tab if you haven't signed up to the service, so make sure you do that first).

Click "Create New Job Flow". Make sure you're using the same region you used when you generated your key pair.

Give the job flow a name, and select "Hive Program" for the job type.

On the next screen, choose "Start an Interactive Hive Session".

On the next screen, we choose the number and size of the machines we want to comprise our cluster. In real life use, using a lot of big machines will make things go faster. For the purpose of this exercise, one small instance will do. We're not doing anything heavyweight here, and we only have one data file, so there isn't much point spending the extra money to run lots of large machines.

Select the key pair you generated earlier, and start the job flow. Don't forget to terminate the job flow when you've finished, otherwise you'll be paying to keep an idle cluster going.

Now we have to wait for the cluster to start up and reach the point where it's ready to do some work. This usually takes a few minutes.

When the job flow status is "WAITING", click on the job flow and scroll down in the lower pane to get the "Master Public DNS Name" assigned to your cluster so that we can SSH to it.

From a terminal window, ssh onto your cluster like this;

ssh -i key/hive.pem hadoop@ec2-79-125-30-42.eu-west-1.compute.amazonaws.com

Replace key/hive.pem with the location and filename of the secret key you created and saved earlier.

Replace "ec2-79-125-30-42.eu-west-1.compute.amazonaws.com" with the Master Public DNS Name of your cluster. The username 'hadoop' is required.

You should now have a terminal prompt like this;

Type "hive" to get to the hive console. This is an interactive shell that works in a similar way to the "mysql" command-line client.

Creating a table

We're almost ready to start querying our data. First, we have to tell Hive where it is, and what kind of data is contained in our file.

Type these lines into the hive shell;

hive> create external table wordfreqs (freq int, word string)
> row format delimited fields terminated by '\t'
> stored as textfile
> location 's3://hive-emr/tables/wordfreqs';
Time taken: 1.29 seconds

Note that we didn't need to put "shakespeare.txt" as part of the location. Hive will look at the location we gave it and, provided all the "files" in that "directory" have the right kind of contents (lines consisting of an integer, a tab character and a string), all of their contents will be accessible in the 'wordfreqs' table.

Now that we've told Hive how to find and parse our data, we can start asking questions in almost the same way as we would do if it were in a mysql table.

Try this;

hive> select * from wordfreqs limit 5;
25848 the
23031 I
19671 and
18038 to
16700 of
Time taken: 4.868 seconds

So far, so good - even though that's a long time to take for a very simple query. Let's try something a little more interesting;

hive> select count(word) from wordfreqs;

Here is the output I got from this;

Total MapReduce jobs = 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_200911121319_0001, Tracking URL = http://ip-10-227-111-150.eu-west-1.compute.internal:9100/jobdetails.jsp?jobid=job_200911121319_0001
Kill Command = /home/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=ip-10-227-111-150.eu-west-1.compute.internal:9001 -kill job_200911121319_0001
2009-11-12 01:37:33,004 map = 0%, reduce =0%
2009-11-12 01:37:46,653 map = 50%, reduce =0%
2009-11-12 01:37:47,665 map = 100%, reduce =0%
2009-11-12 01:37:55,709 map = 100%, reduce =100%
Ended Job = job_200911121319_0001
Time taken: 28.455 seconds

All of that is Hive translating our sql-like query into MapReduce jobs that are then farmed out to our cluster. Since we're using a single, small instance, and since we only have one data file, there isn't any parallelisation happening, and the whole thing runs quite slowly. But, in principle, we could have terabytes of data files in our S3 bucket, and be using more and much larger machines in our cluster. Under those circumstances, we should see major gains from using Hive.

FYI, the reason the first query didn't have this kind of output is that Hive is smart enough to figure out that no MapReduce trickery is necessary for this request - it can just read a few lines from the file to satisfy the query.

This has been a very quick and simple introduction to Hive on Amazon EMR. I hope you found it useful. I plan to go into more advanced, and hopefully more useful, territory in future posts.

PS: Don't forget to terminate your Hive cluster!


Setting up a remote git project

I often find myself working on a quick hack, using a local git repository. Eventually, whatever I've hacked up becomes something I need to keep, and I want it in my remote git repository, with the local copy tracking the remote.

After setting this up manually several times, I finally got around to scripting it.


Assuming you've got a local git repo called 'myproject', and your current working directory is something like '/home/me/projects/myproject', then running this script will create a directory called 'myproject.git' on your remote git server, push your code to the remote repo and set the local copy to track it.

Don't forget to edit the script first to set the correct server name and main git directory, below which all your projects live.

The script assumes you're using SSH as the transport layer for git.


OCaml for the impatient - part 2, reading standard input

Now that "Hello, world" is out of the way, let's look at the next step in writing our log processing script. We want to be able to read lines from standard input.
OCaml has an "input_line" function, which takes a channel as a parameter. Standard input is available without doing any extra work as the channel 'stdin'. So, to read a line of text from standard input, we just need to call;

input_line stdin

In OCaml, parameters to functions are not enclosed with braces. There are plenty of places you do need to use braces, but surrounding parameters is not one of them.
To do something useful with our line of text, we'll need to assign it to a variable. OCaml uses the "let" keyword for that, but we'll also need to declare a scope for our variable, using "in". So, the code we want is something like this;

let line = input_line stdin in
... a block of code ...

To read all the lines of text from standard input, until we run out, we'll need a loop of some kind. OCaml does allow us to write code in an imperative style, so we can just use a while loop. While loops are pretty basic in OCaml (and in functional languages in general), because you're meant to do much cleverer things with recursion.
Our loop will need to terminate when we run out of lines to read. The simplest way to do that in OCaml is to catch the "End_of_file" exception. I'm not a big fan of using exceptions for normal control flow, but we can live with it for now.
So, a simple program to read lines from standard input and echo them to standard output might look like this;

while true do
let line = input_line stdin in
Printf.printf "%s\n" line
End_of_file -> None

There are a few points to note here. The semi-colon after "done" is necessary to tell OCaml that it should evaluate everything before the semi-colon first, and then evaluate the stuff after it. Without the semi-colon, you'll get a syntax error. It needs to be ";" and not ";;" because we're not terminating a block of code.
We're using "End_of_file -> None" to discard the exception we get when "input_line" tries to read a line that isn't there. "None" is a bit like "nil" in Ruby or "undef" in Perl.
The "None" at the end of the block is required to keep the return type consistent. OCaml, like Perl or Ruby, returns whatever is the last thing evaluated in the block. OCaml requires that the try block return the same type of value as we will return if we catch an exception and end up in the with block. If you try running the code without the "None" before with, you'll get an error saying "This expression has type 'a option but is here used with type unit" (OCaml error messages are translated from French, so they're a little idiosyncratic).
The type "unit" is the empty type, like void in Java. Our with block is returning "None", so it's return type is unit, and the try block must return the same type.
If we change the with to say;

End_of_file -> "whatever"

Then the error becomes "This expression has type string but is here used with type 'a option". So, we can make it go away by replacing the earlier None with any string constant (like "hello" - try it).
The last thing we're going to do is to take our inline "Printf.printf" statement and turn it into a function call, so that we can do something more interesting with line later.
In OCaml, functions are values we can assign to variables. So, to define a function, we use the same let statement as we used to define line. Here's a function to print out our line;

let out = Printf.printf "%s\n";;

Notice that we terminated the statement without specifying what is supposed to be printed. If you type the code above into the interactive ocaml interpreter, you get this;

# let out = Printf.printf "%s\n";;
val out : string -> unit =

That's saying "the value out is a function which takes a single string and doesn't return anything". OCaml decided we were defining a function because we didn't specify all the arguments. If we had, it would have simply evaluated it and assigned the result to 'out'.
Now, we can simplify our program a little;

let out = Printf.printf "%s\n";;

while true do
let line = input_line stdin in
out line
End_of_file -> None

Try running the program like this "ls | ocaml foo.ml", or by compiling it as shown in part 1.
So far, we haven't done anything very useful overall, but we've covered reading from standard input and writing to standard output, looping over all the available input, assigning each line to a variable and calling a function with that variable.
In part 3, we'll actually do something!


OCaml for the impatient - part 1, "Hello, world!"

After an inspiring presentation by Tom Stuart at LRUG earlier this week, I decided to have a go at learning a functional programming language. I picked OCaml, partly on recommendation from Tom, although he's since changed his advice to recommend Clojure because of it's "smart and comprehensive treatment of mutable state".

I found quite a few introductions to OCaml, but many of them seem quite theoretical and a bit dry. Plus, I'm really impatient. I tend to learn best by just diving in and trying to perform a real-world task. So, I decided to start by writing a basic filter to take apache access log lines and spit out some fields in CSV form.

I know this is not necessarily the best task for OCaml, and I'm sure my coding style is missing the point and doing many things the 'wrong' way. But, at least it's a practical way to get my hands dirty with the language.

First of all, let's do the traditional "Hello, world!" exercise;

1. Installing OCaml

Lots of instructions for various platforms here;


I used the INRIA binary on Mac OS X.

2. Hello World

A basic hello world program in OCaml. Using a text editor (preferably vim, but I've heard that other programs sort of work), put the following into a file called "hello.ml"

Printf.printf "Hello, world!\n";;

Some points to note;

1. The 'printf' function comes from the 'Printf' module. This module is available to all programs, so there's no need to do anything special to gain access to it.

2. ";;" denotes the end of a chunk of code.

You can run this code interactively using the "ocaml toplevel"

$ ocaml
Objective Caml version 3.10.1

# Printf.printf "Hello, world!\n";; <--- type this and press enter Hello, world! - : unit = () # <-------- Ctrl-D to exit $

To run the program from the command-line there are several options.

Option 1

$ ocaml hello.ml

This is the easiest way, using the ocaml interpreter.

Option 2

$ ocamlc hello.ml -o hello
$ ./hello

This creates an executable "hello" file of OCaml bytecode, which should run on any machine with the OCaml bytecode interpreter, "ocamlrun" installed. It also creates a "hello.cmo" and "hello.cmi" file, which are OCaml object and interface files, respectively. This seems to be an intermediate step, since "hello" runs just fine if you delete them.

Option 3

$ ocamlopt hello.ml -o hello
$ ./hello

This time, "hello" is a compiled binary which can run standalone. There will also be the "hello.cmi" file, as well as "hello.cmx" (OCaml *native* object file), and "hello.o" (object file for your OS).

The interactive toplevel is great for noodling around, and option 1 is what I spend most of my time doing. Option 3 is what I will use if and when I write something that I want to use in production.

So, that's "Hello, world" out of the way. In the next part, we'll look at reading from standard input and writing to standard output.

OCaml for the impatient - part 2, reading standard input


Quick DB check script without loading Rails

I use Zabbix to monitor my servers. It works by having a central zabbix server make calls to zabbix-agent processes on the monitored servers. The zabbix-agent can invoke a user-created script on the server and return whatever one-line output the script provides.

I needed a quick script to check the server to see if a set of database records exist, matching a given set of conditions. In this case, I'm checking to see if some log data representing yesterday's web traffic has been imported correctly, on the monitored server.

To start with, I used something like this;

#!/usr/bin/env ruby

ENV['RAILS_ENV'] = 'production'

require File.dirname(__FILE__) + '/../config/environment'

day = 1.day.ago.to_date
hostname = `hostname`.chomp

puts LogEntry.all(:conditions => {:day => day, :hostname => hostname}, :limit => 1).size

There could be a whole lot of log_entries records for yesterday, for this hostname. All I really want is to confirm that the importer script ran OK, so checking for the existence of one record that matches these criteria is sufficient and much faster than doing an aggregate query like a count.

This script outputs a '1' if everything is OK, and a '0' if not.

Hooking up a zabbix agent check to the script is easy enough, but whenever the server tries to call it, the check times out. The problem is that loading the entire Rails stack takes too long. Besides, it's overkill when all I want to do is run a single database query.

So, with help from here I changed it to this;

#!/usr/bin/env ruby

RAILS_ROOT = File.dirname(__FILE__) + '/..'
require 'rubygems'
require RAILS_ROOT + '/vendor/rails/activerecord/lib/active_record'
yaml_file = RAILS_ROOT + '/config/database.yml'
db = YAML::load(File.read(yaml_file))['production']

class LogEntry < ActiveRecord::Base; end

hostname = `hostname`.chomp
yesterday = 1.day.ago.to_date

puts LogEntry.all(:conditions => {:day => day, :hostname => hostname}, :limit => 1).size

If you've got Rails installed as a gem on your server, you don't need to jump through all the 'RAILS_ROOT' hoops. But, I vendor everything, so I need to tell ruby where to find active_record.

Now the script is fast enough to respond to zabbix checks. If I needed it to be even faster, I could use lower-level ruby database code. But, this is quick enough, and importing ActiveRecord gives us the "1.day.ago" stuff too, keeping the code clean.


Setting up Ubuntu 9.04 for Ruby on Rails development

I've just installed Ubuntu 9.04 "Jaunty Jackalope", and so far I'm really impressed. It's slick and fast, with some really nice little touches (e.g. my laptop wakes up when I open the lid - I used to have to hit the power button to wake it up from suspend).

Here are the steps I went through to set it up for Rails development;

  • Launch a terminal and become root
  • aptitude install build-essential
This gives you a c compiler and a bunch of other stuff essential to allow you to build other packages (hence the name)
  • aptitude install git-core
I use git for version control. Of course, there are a bunch of subversion and other packages available too. In Ubuntu 9.04, the version of git is, which is good enough for me.
  • aptitude install phpunit php5-curl
I do some PHP development too. You can ignore this and the php mysql package later on, if you don't.
  • aptitude install ttf-mscorefonts-installer
This makes fonts look a lot nicer in firefox.
  • aptitude install rails vim-rails ruby1.8-dev
This will install rails 2.1.0. A later version would be nicer, but all my projects have their own version of rails in the vendor directory, so this is good enough.

I use gvim as my editor. You can skip the vim-rails if you don't.

The "ruby1.8-dev" package is important. Without this, you'll get failures loading "mkmf" whenever you try to install gems.
  • aptitude install mysql-server-5.0 php5-mysql mysql-client-5.0 libmysqlclient15-dev
Ubuntu 9.04 comes with mysql 5.1, but a lot of my projects are still using 5.0, so I want to stick to that for now.

"libmysqlclient15-dev" is required when we want to switch to a native mysql library for ruby. If you don't do this, you'll get deprecation warnings because the non-native, pure ruby library is going to be dropped soon.
  • gem install ruby-debug hpricot mongrel
So far, those have been the only gems I needed to install to get to the point where my specs would run.

Happy installing, and please add a comment if you've got any suggestions.


Private web browsing via SOCKS proxy

So, Big Brother is now requiring ISPs to keep a log of all your emails, internet phone calls and web browsing activity. This is, of course, to keep us safe from terrorists. How we stay safe from Big Brother is less clear.

Anyway, with this in mind, here is a quick guide to using a SOCKS proxy to keep your web browsing private by routing all your activity via an encrypted ssh tunnel to a server elsewhere.

When you browse the internet normally, e.g. to www.google.com, connections go like this;

Your machine -> www.google.com

Your ISP provides the connection in between, so they can log everything for those lovely government people. When you use a SOCKS proxy, it works like this;

Your machine (1)-> Your proxy (2)-> www.google.com

Connection (1) still goes via your ISP, but it's only a connection from your machine to your proxy, so they can't log where you're browsing to (in this case, www.google.com). All they can see is that you made a connection from your machine to your proxy. The connection is via SSH, so they have no way of knowing what information is travelling up and down the pipe.

Connection (2) is from your proxy to www.google.com Whichever ISP connects your proxy to the Internet can log your browsing activity. This is why it's important to have your proxy in a country where they won't do that.

Let's get started. You will need;

  • A server, located somewhere with a less Stalinist government. I use a virtual server in the US, which is quite ironic, really. This machine needs to be running an SSH server.
  • You need to be able to ssh onto this server from the machine you want to browse from. I've tried this from Mac OSX and Linux machines, where it's easy enough. If you're unlucky enough to be using Windows, it's probably possible to do this via cygwin.

  • Optionally, the "SwitchProxy" add-on for Firefox makes it easy to switch from browsing via your SSH tunnel to browsing normally.

First of all, you need to have things set up so that you can access your remote server va SSH. i.e. you can start a terminal window and type "ssh yourserver" and you've got a terminal session on your remote server. If you can't do this, you're screwed, and Big Brother knows about all the nasty websites you visit.

Now, open up a terminal and type;

ssh -CND 9999 yourserver

"-N" tells SSH that we don't want to run any commands on the remote box (such as a login shell).

"-D 9999" says "listen on local port 9999 and forward whatever connections come in to that port, using the appropriate protocol, via the remote server. So, if you use local port 9999 to try to connect to www.google.com on port 80, you actually end up connecting to your remote server via SSH and then *from there* to port 80 on www.google.com

Currently, the -D option to SSH only supports running as a SOCKS proxy, but since that's all we need, that's fine.

"-C" compresses all data sent over the SSH tunnel. Browsing via a remote proxy server is a bit slower than browsing direct, so compression may help speed things up a bit.

You will need to have a terminal window open, running this command, whenever you browse the net. There are clever ways to make this automatic, but I prefer to keep things simple.

Now we just need to tell our web browser to connect via our new SOCKS proxy, instead of going straight to the target website.

In Firefox, use the Edit menu and select;

Edit -> Preferences -> Advanced -> Network -> Settings

Configure the settings shown. i.e. a SOCKS host on localhost port 9999. You can use more or less any port you like, but any port number below 1024 can only be forwarded by root, so it's easier to use a higher number. Whatever number you used in the "ssh -ND ..." command is the port you need to put here.

For any sites where you don't want to use the SOCKS proxy, add them to the "No proxy for" section. For example, if you use BBC iPlayer, it won't work via a US host, so you need to not use the proxy for bbc.co.uk

There are similar options for other browsers such as Safari and, if you really have to use it, Internet Explorer.

Once you've configured your proxy settings, you're good to go. Try browsing to a website and it should be going via your SSH tunnel. To confirm, try closing the terminal window and refreshing the page - you should see an error saying, "Proxy Server Refused Connection". Restart the tunnel and refresh again and it should work.

The SwitchProxy add-on for Firefox makes it really easy to switch proxy configurations - hence the name - via a new entry in your "Tools" menu.

Happy private web-browsing.


Redundancy using Nginx for failover

I've got a website running on two servers, for redundancy, with an nginx load-balancer in front of them. If one server goes down, all requests will be sent to the other using this stanza in the nginx config;

upstream webservers {
server; # web1
server; # web2

This uses a round-robin system to send alternate requests to alternate servers. That's great for spreading the load around, but not exactly what I want.

The webservers are not heavily loaded, but they do run some quite complex reports against the database, the results of which are cached on the webserver's local filesystem. The cache is expired periodically.

Using a round-robin algorithm, we end up calculating each report twice per period - i.e. once on each webserver. What I really want is for all requests to go to web1, so that it's using its cache as efficiently as possible. web2 should only start receiving traffic if web1 goes down.

I don't know of a way to make nginx do exactly this, but you can use the 'weight' parameter to fake it, like this;

upstream webservers {
server weight=100000; # web1
server; # web2

Now, only one in every 100,000 web requests will go to web2, so the cache on web1 is being utilised to a much greater extent. If web1 goes down, web2 gets all the traffic.

Redundancy plus efficient caching.


Rewired State

I went to the Rewired State hack day yesterday.

A brilliant time was had by all. Photos here, video here.

Our hack was What's On Their Minds? (if that URL doesn't work, try this link).