Debian is a world-class Linux distribution. It is used on it’s own
for so many applications (desktop, laptop, workstation, handeld,
server, etc.) as well as the foundation for so many wonderful
projects ((U|K|X)buntu, Maemo, etc.). Personally, I run Debian on
my laptop as well as my servers. In fact, when I went to see about
setting up a little ad-hoc cluster, I was rather disappointed.
Though there are a few
clustering tools available,
as well as several distributed filesystems
(GFS,
GlusterFS,
OCFS2, and
Lustre),
shockingly, I could not find any implementation of MapReduce
available in the Debian repositories. For those who might not know,
MapReduce is a novel data-processing system developed by Google for
internal usage and described in their publication entitled
MapReduce: Simplified Data Processing on Large Clusters.
For the enlightened out there, it should be clear that the name and
mechanism are derived from Lisp’s map and reduce functions. In
any case, though Google’s implementation is proprietary, there have
been several implementations based on their paper both written in
and geared toward a variety of programming languages.
Unfortunately, none of these are available in the Debian
repositories. In all fairness, Debian does include
CouchDB which uses map and reduce
functions for generating views. However, it’s not a solution aimed
at sorting and processing huge amounts of data, though it is an
interesting and capable piece of software. So, to try and get
things moving, I have filed three Debian RFPs (Request For Package)
for a few seperate MapReduce implementations.
- Hadoop
- Probably the most well-known of the Free/Open Source
implementations. Includes a distributed filesystem (HDFS),
scaleable distributed database (HBase) and tools to get you going
from start to finish. Hadoop is written in Java though it can
interoperate with other languages
([Scala](http://scala-blogs.org/2008/09/scalable-language-and-scalable.html),
too). It's a top-level project of the
[Apache Software Foundation](http://www.apache.org/) and licensed
under the
[Apache License 2.0](http://www.apache.org/licenses/LICENSE-2.0.html)
- [http://hadoop.apache.org](http://hadoop.apache.org/)
- Skynet
- A MapReduce implementation written in Ruby. It’s designed to be fault-tolerant and distrubuted, just like the big boys. Originally written for use at Geni.com and licensed under the MIT License - http://skynet.rubyforge.org/
- Disco
- Though the implementation is itself written in Erlang, thus providing excellent distributed fault-tolerance, Disco jobs can be written in Python. It was developed as an in-house tool for rapid data analysis at Nokia and they seem to be quite keen on it. Disco is licensed under a modified BSD License. Page at http://discoproject.org/ and code at http://github.com/tuulos/disco/tree/master
Ok, there might be a few objections to my choices. Why did I leave out neat projects like GridGain, FileMap and BashReduce? Well, for starters, GridGain is another Java implementation that doesn’t seem (at least to me) to have the same momentum Hadoop does. FileMap and BashReduce, while novel, useful and fascinating, are not designed for use in networked environments and are therefore unsuitable for cluster situations. So then whey not MapSharp? Well, primarily because of all the Debian Mono debates going on right now (Gnome’s fail!) . I’ve done work in C# and it’s got some neat features but cool stuff doesn’t and will not ensure that users are not liable from patent litigation. Also, it seems like those RFPs have some mistakes, so if anyone figures out how to edit them, let me know so I can clean them up.