Distributed Computing

From TaDa Wiki
Jump to: navigation, search

As industry and academics alike have seen an increasing need to work with massive datasets -- far too big to fit on a single computer -- they faced two choices: either start investing in massive super-computers, or learn how to network lots of standard computers ("commodity hardware") to work in concert.

In 2004, Google released a framework for coordinating commodity hardware (referred to as MapReduce), which was quickly developed into the now ubiquitous Hadoop Platform. Given that buying lots of commodity hardware is much cheaper than building equivalent super-computers, Hadoop quickly became an industry standard, and is now extremely widely used in industry, and services like Amazon Web Services make setting up a Hadoop cluster (set of computer connected into a Hadoop system) exceedingly straight-forward.

Is Distributed Computing for Me?

Don't jump into Distributed Computing blindly assuming more computers means faster execution, as Distributed Computing has two shortcomings issues (aside from requiring you to spend time learning a new skill).

First, there are huge overhead costs (in terms of computer processing time) to distributing your data and computations across lots of computers -- data has to be shuffled between machines, code for allocating parts of the problem across machines has to be run, etc. As a result, a single machine with the same number of processing cores as a distributed cluster will almost always do a job faster.

Second, not everything can be parallelized. Before you jump into Distributed Computing, make sure whatever task you want to undertake can easily be broken into small pieces and parallelized across the computers you plan to employ.

But with that said, you can't fit your data on a single computer, or you really want to use lots and lots of processors and can't that many processors in a single computer (and are confident your code will allow you to take advantage of all those processors!), then distributed computing is the place to go.

Specific Tools

Hadoop and Pig

Hadoop is the original platform for distributed computing. Lots of services will setup a Hadoop cluster for you with little effort, making it relatively easy to use.

With that said, you probably don't want to use Hadoop itself, but rather a program like Pig. Pig is a high-level language that sits on top of Hadoop. Pig basically accepts SQL queries, than converts those into the more complicated instructions needed to break up your program and distribute it across lots of computers. If you don't use Pig, you need to learn to how to write MapReduce code, which is almost certainly not worth your effort.

Spark

Spark is basically an attempt to create the next generation of Hadoop (indeed, it's built on top of Hadoop). The idea is basically the following:

When you send a command to Hadoop, it has each computer in the Hadoop cluster load some data from its harddrive, process it, aggregate the computations done by each computer, and then put the result back on various harddrives.

This is fine for many things, but if you're doing a set of computations where the result of each computation serves as the input of the next computation (like a Markov Chain Monte Carlo simulation or numerical optimizations), the fact that Hadoop keeps returning things to harddrives and then has to extract them from harddrives all over again for the next iteration can REALLY slow down operations. Spark was designed to allow users to specify when results should be kept in RAM and not moved back to the harddrive so it can be quickly accessed for the next operation, which can speed up operations by 100 fold in some situations.

As with Hadoop, Spark has a tool for writing your data transformations in the SQL language called SparkSQL (it used to be called "Shark"). If you're gonna work with Spark, you probably want to work with SparkSQL.

GraphX

One of the places RAM caching really shines is in network analysis algorithms. GraphX is a Spark library for network analysis on distributed platforms. If you're gonna do network analytics on a distributed platform, it's probably the tool to use!