Getting started with Hadoop and Pentaho

These days all the main Hadoop vendors provide VMs ready to go for testing and development – very convenient!

However a downside of these VMs is that they tend to require a LOT of resources. My laptop only has 8gb of ram, and that is all it can have!  So I downloaded the Cloudera 5.3 VM and tried it with 4gb, but no chance.

But there is another way…  The VM above is stuffed to the gunnels with every hadoop service under the sun. It is a demo after all!  The solution is just to install locally ( hence also removing the guff of a VM small though that is) using this guide:

You can literally get away with installing the basic packages, and hive and away you go.  And the download is MUCH smaller – we’re talking about 300MB vs 3.7GB for the VM!

Unfortunately it doesn’t quite stop there though.  You’ll know already that you need to set the correct “active hadoop configuration” in this file:


But that won’t be the end of it. The out of the box cdh51 shim expects a secure/kerberised install which is not the case here. So you have to make one more change, in for the cdh51 shim comment out all the lines containing:


Then finally PDI will start.  What is confusing is that the already has NO AUTH specified.  But the error message does in a very verbose way tell you you’re connecting a kerberised class with a non kerberised one. And it’s a terminal error kitchen/spoon won’t even start.  No doubt in future versions we’ll be able to:

  1. mess with the config without PDI utterly crashing
  2. Connect to multiple different hadoop clusters, with different shims at the same time!

Well hopefully!

