Nothing at all to do with computers, this time.
Some people at LRUG asked about the circus show I'm in. So, here are the details.
Nothing at all to do with computers, this time.
Posted by digitalronin at 12:00 pm
[UPDATE] There seems to be a bug in Snow Leopard where changes made to the Airport interface's MTU do not persist - it just jumps back to 1500 immediately. To make the change stick, you need to do this in a terminal window;
$ sudo ifconfig en1 mtu 576
(Assuming en1 is your airport interface, and 576 is the MTU value you want)
Sadly, the change will not persist through a reboot, so you'll need to run the command whenever you boot up your mac.
I'm not a big fan of being spied on, so I've signed up to IPredator, the VPN service run by the people who brought us The Pirate Bay.
Setting it up on OS X is pretty trivial, and I haven't noticed much, if any, difference in network speeds. But, I did find myself unable to access certain websites. Notably my Basecamp site, which is quite inconvenient. I've got HTTPS turned on, and when using my VPN connection, I couldn't load the login page - it fails as soon as it tries to do the TLS handshake. Turn off the VPN, and everything is just fine. Some other sites behaved in a similar way - even some sites that weren't using HTTPS (although digging into the page content suggests that it is https elements on the page that were causing the browser to hang).
37Signals support, sadly, were no help in this case. But, I did find this page which suggested changing the MTU (Maximum Transmission Unit) setting of my TCP/IP connection.
That page talks about how to do this on Windo$e, but I found that the problem went away if I made the corresponding change on OS X. So, I'm posting this here in case anyone else finds it helpful.
First of all, in "Network Preferences", select your main network connection (I'm using wired ethernet), and click "Advanced".
In the advanced options, click "Ethernet" and look for the MTU setting. Change it from "Standard (1500)" to "Custom", and enter a lower value. In my case, I'm using 576.
Click "OK", click "Apply" and then disconnect and reconnect your VPN connection.
That's it. I can now access my Basecamp site via my IPredator VPN connection. I hope this helps someone else.
Have a happy & private Internet!
Posted by digitalronin at 8:36 pm
Packt publishing asked me to review "Zabbix 1.8 Network Monitoring" (Disclosure: They gave me a copy of the book. Other than that, I have not been compensated for this review in any way).
I've been using Zabbix for a few years to monitor a variety of systems. I like it for its small (initial) learning curve and the out-of-the-box graphs and reports. But, the kindest word I can use to describe the provided documentation is "minimal". So, I was pleased to see a book that tries to address this problem.
At just under 400 pages, there's a lot of material. The book takes a tutorial approach, going from installation through to fairly advanced monitoring ideas including distributed monitoring with Zabbix proxies.
The book's target audience is sysadmins (aka "devops") and developers who have to do their own monitoring. So, the level of assumed technical ability is quite high. As usual with technical books, this can lead to over-explanation of some steps and under-explanation in others. But, I found the tone and level of detail struck quite a nice balance. YMMV, of course.
As well as covering the detail of the steps you need to take to setup Zabbix and start monitoring, there are good explanations of the high-level architecture of how the various Zabbix components fit together and communicate, which I found very helpful.
The author, Richards Olups, works with Zabbix SIA (who develop Zabbix), and his in-depth knowledge of the system is obvious throughout the book.
For example, I was pleased to see a section about digging into the underlying database Zabbix uses to store configuration and historical values. Zabbix's PHP front-end is functional, but can be a bit quirky. IMO, it's ripe for replacement/augmentation with something a bit slicker - maybe a Rack application using some Flash-based graphing tools. Having a guide through the database structure will make it a lot easier to create something like that.
Also nice are some tips about using the unix command-line tools to help troubleshoot monitoring problems. e.g. using telnet to connect to the zabbix-agent process to confirm it's listening correctly for connections from the server. This is basic stuff, but a worrying number of developers (and even some sysadmins) I've met don't seem to have much awareness of the toolbox unix provides.
The section on how to upgrade Zabbix and patch its database is also a nice inclusion. A quick glance at the Zabbix support forums shows that that is often a problem area, so I like the fact that this has been addressed upfront. Using Zabbix to monitor its own health is also covered, and the section contains lots of advice that would have saved me at least one weekend spent rebuilding my monitoring from scratch when an earlier version of Zabbix decided to eat its database, one day!
As you can probably tell, I like this book a lot, which is just as well since it appears to be the only book on Zabbix available right now.
There are a few nitpicks, perhaps. The step-by-step tutorial approach makes the book far more suited to being read through rather than as a reference, but since this is the first book available (AFAIK), I think that's probably a good choice. I would have liked to see Ubuntu covered in the installation section (the author covers compilation/installation on SuSe and Slackware), although the book's target audience shouldn't have any trouble adapting the instructions for their distro of choice. The writing and grammar are a little quirky, and a glossary would have been useful. But, these are really minor points.
Overall, I think the book has a lot to offer anyone who is using, or thinking of using Zabbix as their monitoring solution. If I hadn't been given a copy, I'd probably buy it.
Posted by digitalronin at 11:01 am
I've just written a couple of simple ruby classes to help dump a mysql database without locking tables for extended periods.
In particular, some of my tables contain lots of date-based data which is not changed after the day it was created. The classes can be told to dump such data one day at a time, and won't dump a day that has already been dumped.
Check it out on github;
I hope people find it useful.
Posted by digitalronin at 3:59 pm
Scaling an application usually involves adding processing nodes. This means you end up with valuable data (e.g. server logs) existing on multiple different machines. Very often, we want to mine those logs for useful information.
One way to do this would be to put all the logs in one place and run some kind of computation over all the data. This is relatively simple, but it really doesn't scale. Pretty soon, you reach the point where the machine which is analysing a day's worth of log data is taking more than a day to do it.
Another way is to let each processing node analyse its own logs, and then have some way to collate the analysis from each node to find the overall answer you're looking for. This is a more practical solution for the long term, but it still has a couple of problems;
1. A processing node should be processing - i.e. doing the task for which it was created. Ideally, you wouldn't want to bog it down with other responsibilities. If you do, you probably need to make the node more powerful than it needs to be in order to carry out its primary task, so you're paying for extra capacity just in case you want to run a log analysis task.
2. In the case of log data in particular, keeping the logs on the node which created them generally means you have to keep the node around too. This makes it awkward to remove nodes if you want to scale down, or replace nodes with better nodes, because you have to worry about copying the logs off the node and keeping them somewhere.
It would be nice if we could have each node push its logs into something like Amazon S3 for storage, and spin up a distributed computing task whenever we want to run some analysis. Amazon Elastic Map Reduce (EMR) is designed to work in exactly this way, but the learning curve for writing map/reduce job flows is pretty steep - particularly if you're used to writing simple scripts to get useful information out of log data.
As of October 1st 2009, Amazon EMR supports Apache Hive, which makes things a lot easier.
What is Hive?
The proper answer is here.
The way I think of Hive is that it lets you pretend that a whole mess of semi-structured log files are actually big database tables, and then helps you run SQL-like queries over those tables. All this without having to actually insert the data into any kind of table, and without having to know how to write distrubuted map/reduce tasks.
Using Hive with Amazon EMR
This is a very basic introduction to working with Hive on Amazon EMR. Very basic because I've only just started looking into this myself.
You will need to be signed up for Amazon Web Services, including S3 and Elastic Map Reduce.
I'm going to go through part of an exercise from the Cloudera Introduction to Hive, which I strongly recommend working through. That training exercise uses a Cloudera VMWare virtual appliance running Hive. Here is how to I did a similar task using Hive on Amazon EMR.
For this exercise, we're going to take a data file consisting of words and the frequency of occurrence of those words within the complete works of William Shakespeare. The file consists of a number of lines like this;
The first value is an integer saying how many times the word occurs, then a tab character, then the word. This file is generated by an earlier exercise in one of the Cloudera Hadoop tutorials. If you don't feel like running through those exercises, just generate a file containing a bunch of numbers and words, separated by a tab character, and use that.
Upload the data to S3
Before we can analyse the data, we need it to be available in S3. Create a bucket called "hive-emr", and upload your data file into it using the key "/tables/wordfreqs/whatever". In my case, I have the tab-delimited text file in;
NB: The S3 path "hive-emr/tables/wordfreqs" is going to be our Hive table. If you're unfamiliar with S3, "hive-emr" is the name of our bucket, and 'tables/wordfreqs/shakespeare.txt' is the key whose value is the contents of our "shakespeare.txt" file.
Everything in the 'directory' "tables/wordfreqs/" (which isn't really a directory, but we can pretend it is) must be parseable as data for our table, so don't put any other types of file in there. You could, if you wanted, have more than one tab-delimited text file though, and all of the data in all of those files would become records in the same Hive table.
It's also important not to have any underscores in the S3 bucket or key. S3 will happily let you create and upload files to buckets/keys with underscores, but you'll get an S3 URI error when you try to create the table in Hive.
I'm using s3sync to upload the data files, but you can use anything you want provided you get the data into S3 with the correct bucket and key name.
Generating an EC2 Key Pair
We need a key pair to enable us to SSH onto our Hive cluster, when we've started it. If you don't have a suitable key pair already, sign in to the Amazon Web Services console and go to the Amazon EC2 tab. Near the bottom of the left-hand column, use the "Key Pairs" function to generate a key pair and save the secret key to your local machine.
Be aware of the "Region" you're using - key pairs will only work for servers of the same region. I'm using "EU-West", but it doesn't matter which you use, as long as you're consistent.
Sign in to the Amazon Web Services console and go to the Amazon Elastic MapReduce tab (you won't see the tab if you haven't signed up to the service, so make sure you do that first).
Click "Create New Job Flow". Make sure you're using the same region you used when you generated your key pair.
Give the job flow a name, and select "Hive Program" for the job type.
On the next screen, choose "Start an Interactive Hive Session".
On the next screen, we choose the number and size of the machines we want to comprise our cluster. In real life use, using a lot of big machines will make things go faster. For the purpose of this exercise, one small instance will do. We're not doing anything heavyweight here, and we only have one data file, so there isn't much point spending the extra money to run lots of large machines.
Select the key pair you generated earlier, and start the job flow. Don't forget to terminate the job flow when you've finished, otherwise you'll be paying to keep an idle cluster going.
Now we have to wait for the cluster to start up and reach the point where it's ready to do some work. This usually takes a few minutes.
When the job flow status is "WAITING", click on the job flow and scroll down in the lower pane to get the "Master Public DNS Name" assigned to your cluster so that we can SSH to it.
From a terminal window, ssh onto your cluster like this;
ssh -i key/hive.pem email@example.com
Replace key/hive.pem with the location and filename of the secret key you created and saved earlier.
Replace "ec2-79-125-30-42.eu-west-1.compute.amazonaws.com" with the Master Public DNS Name of your cluster. The username 'hadoop' is required.
You should now have a terminal prompt like this;
Type "hive" to get to the hive console. This is an interactive shell that works in a similar way to the "mysql" command-line client.
Creating a table
We're almost ready to start querying our data. First, we have to tell Hive where it is, and what kind of data is contained in our file.
Type these lines into the hive shell;
hive> create external table wordfreqs (freq int, word string)
> row format delimited fields terminated by '\t'
> stored as textfile
> location 's3://hive-emr/tables/wordfreqs';
Time taken: 1.29 seconds
Note that we didn't need to put "shakespeare.txt" as part of the location. Hive will look at the location we gave it and, provided all the "files" in that "directory" have the right kind of contents (lines consisting of an integer, a tab character and a string), all of their contents will be accessible in the 'wordfreqs' table.
Now that we've told Hive how to find and parse our data, we can start asking questions in almost the same way as we would do if it were in a mysql table.
hive> select * from wordfreqs limit 5;
Time taken: 4.868 seconds
So far, so good - even though that's a long time to take for a very simple query. Let's try something a little more interesting;
hive> select count(word) from wordfreqs;
Here is the output I got from this;
Total MapReduce jobs = 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
In order to limit the maximum number of reducers:
In order to set a constant number of reducers:
Starting Job = job_200911121319_0001, Tracking URL = http://ip-10-227-111-150.eu-west-1.compute.internal:9100/jobdetails.jsp?jobid=job_200911121319_0001
Kill Command = /home/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=ip-10-227-111-150.eu-west-1.compute.internal:9001 -kill job_200911121319_0001
2009-11-12 01:37:33,004 map = 0%, reduce =0%
2009-11-12 01:37:46,653 map = 50%, reduce =0%
2009-11-12 01:37:47,665 map = 100%, reduce =0%
2009-11-12 01:37:55,709 map = 100%, reduce =100%
Ended Job = job_200911121319_0001
Time taken: 28.455 seconds
All of that is Hive translating our sql-like query into MapReduce jobs that are then farmed out to our cluster. Since we're using a single, small instance, and since we only have one data file, there isn't any parallelisation happening, and the whole thing runs quite slowly. But, in principle, we could have terabytes of data files in our S3 bucket, and be using more and much larger machines in our cluster. Under those circumstances, we should see major gains from using Hive.
FYI, the reason the first query didn't have this kind of output is that Hive is smart enough to figure out that no MapReduce trickery is necessary for this request - it can just read a few lines from the file to satisfy the query.
This has been a very quick and simple introduction to Hive on Amazon EMR. I hope you found it useful. I plan to go into more advanced, and hopefully more useful, territory in future posts.
PS: Don't forget to terminate your Hive cluster!
Posted by digitalronin at 2:33 pm
I often find myself working on a quick hack, using a local git repository. Eventually, whatever I've hacked up becomes something I need to keep, and I want it in my remote git repository, with the local copy tracking the remote.
After setting this up manually several times, I finally got around to scripting it.
Assuming you've got a local git repo called 'myproject', and your current working directory is something like '/home/me/projects/myproject', then running this script will create a directory called 'myproject.git' on your remote git server, push your code to the remote repo and set the local copy to track it.
Don't forget to edit the script first to set the correct server name and main git directory, below which all your projects live.
The script assumes you're using SSH as the transport layer for git.
Now that "Hello, world" is out of the way, let's look at the next step in writing our log processing script. We want to be able to read lines from standard input.
OCaml has an "input_line" function, which takes a channel as a parameter. Standard input is available without doing any extra work as the channel 'stdin'. So, to read a line of text from standard input, we just need to call;
In OCaml, parameters to functions are not enclosed with braces. There are plenty of places you do need to use braces, but surrounding parameters is not one of them.
To do something useful with our line of text, we'll need to assign it to a variable. OCaml uses the "let" keyword for that, but we'll also need to declare a scope for our variable, using "in". So, the code we want is something like this;
let line = input_line stdin in
... a block of code ...
To read all the lines of text from standard input, until we run out, we'll need a loop of some kind. OCaml does allow us to write code in an imperative style, so we can just use a while loop. While loops are pretty basic in OCaml (and in functional languages in general), because you're meant to do much cleverer things with recursion.
Our loop will need to terminate when we run out of lines to read. The simplest way to do that in OCaml is to catch the "End_of_file" exception. I'm not a big fan of using exceptions for normal control flow, but we can live with it for now.
So, a simple program to read lines from standard input and echo them to standard output might look like this;
while true do
let line = input_line stdin in
Printf.printf "%s\n" line
End_of_file -> None
There are a few points to note here. The semi-colon after "done" is necessary to tell OCaml that it should evaluate everything before the semi-colon first, and then evaluate the stuff after it. Without the semi-colon, you'll get a syntax error. It needs to be ";" and not ";;" because we're not terminating a block of code.
We're using "End_of_file -> None" to discard the exception we get when "input_line" tries to read a line that isn't there. "None" is a bit like "nil" in Ruby or "undef" in Perl.
The "None" at the end of the block is required to keep the return type consistent. OCaml, like Perl or Ruby, returns whatever is the last thing evaluated in the block. OCaml requires that the try block return the same type of value as we will return if we catch an exception and end up in the with block. If you try running the code without the "None" before with, you'll get an error saying "This expression has type 'a option but is here used with type unit" (OCaml error messages are translated from French, so they're a little idiosyncratic).
The type "unit" is the empty type, like void in Java. Our with block is returning "None", so it's return type is unit, and the try block must return the same type.
If we change the with to say;
End_of_file -> "whatever"
Then the error becomes "This expression has type string but is here used with type 'a option". So, we can make it go away by replacing the earlier None with any string constant (like "hello" - try it).
The last thing we're going to do is to take our inline "Printf.printf" statement and turn it into a function call, so that we can do something more interesting with line later.
In OCaml, functions are values we can assign to variables. So, to define a function, we use the same let statement as we used to define line. Here's a function to print out our line;
let out = Printf.printf "%s\n";;
Notice that we terminated the statement without specifying what is supposed to be printed. If you type the code above into the interactive ocaml interpreter, you get this;
# let out = Printf.printf "%s\n";;
val out : string -> unit =
That's saying "the value out is a function which takes a single string and doesn't return anything". OCaml decided we were defining a function because we didn't specify all the arguments. If we had, it would have simply evaluated it and assigned the result to 'out'.
Now, we can simplify our program a little;
let out = Printf.printf "%s\n";;
while true do
let line = input_line stdin in
End_of_file -> None
Try running the program like this "ls | ocaml foo.ml", or by compiling it as shown in part 1.
So far, we haven't done anything very useful overall, but we've covered reading from standard input and writing to standard output, looping over all the available input, assigning each line to a variable and calling a function with that variable.
In part 3, we'll actually do something!