Getting started with Hadoop and Pentaho

These days all the main Hadoop vendors provide VMs ready to go for testing and development – very convenient!

However a downside of these VMs is that they tend to require a LOT of resources. My laptop only has 8gb of ram, and that is all it can have!  So I downloaded the Cloudera 5.3 VM and tried it with 4gb, but no chance.

But there is another way…  The VM above is stuffed to the gunnels with every hadoop service under the sun. It is a demo after all!  The solution is just to install locally ( hence also removing the guff of a VM small though that is) using this guide:

You can literally get away with installing the basic packages, and hive and away you go.  And the download is MUCH smaller – we’re talking about 300MB vs 3.7GB for the VM!

Unfortunately it doesn’t quite stop there though.  You’ll know already that you need to set the correct “active hadoop configuration” in this file:


But that won’t be the end of it. The out of the box cdh51 shim expects a secure/kerberised install which is not the case here. So you have to make one more change, in for the cdh51 shim comment out all the lines containing:


Then finally PDI will start.  What is confusing is that the already has NO AUTH specified.  But the error message does in a very verbose way tell you you’re connecting a kerberised class with a non kerberised one. And it’s a terminal error kitchen/spoon won’t even start.  No doubt in future versions we’ll be able to:

  1. mess with the config without PDI utterly crashing
  2. Connect to multiple different hadoop clusters, with different shims at the same time!

Well hopefully!

2015 Q1 Pentaho London Meetup

Happy new year everyone!  With 2015 being the year that #BigData delivers you can guarantee it’s going to be a spectacular year for #Pentaho

We’ve set a date for the next Pentaho London meetup.  It’s a while off, but we’re already seeing some good signups, despite not yet having an agenda! Well that makes it easier!  Anyway register here:

So why come along to PLUG?  Well..

  1. Networking
  2. Networking
  3. Networking
  4. Content
  5. Beer
  6. Sometimes Pizza depending on Sponsor

See you all at Skillsmatter, and of course down the Slaughtered Lamb afterwards. It’s a charming place.. really.  As always if you have anything you’d like to present then shout now and we’ll get that agenda tied down.


Non-Native Metadata Injection

What? Whats that then?  Well actually it’s an idea that Hazamonzo developed with Matt but unfortunately his blog post disappeared, so this is basically a re-write of the technique.

Well Metadata Injection is one of the single most powerful features of PDI.  However it has a dark secret. Only *some* steps support it.  A year ago that list was very small, thankfully it has grown recently due to concerted efforts by the dev team. But as PDI has 100s of steps, more being added weekly, there’s always going to be a scenario where the step you want to Inject doesnt support it.

By the way here is the list of those that do:

So what do you do? Well PDI just saves its transforms and jobs as either XML or data in a repository.  We could hack the XML…  Err, NO STOP RIGHT THERE.  Don’t even consider it.  Much better – use the API.  The API is how spoon itself sets the metadata for a step.

So the steps are:

  1. Open the transformation file
  2. Find the step(s) we want to change
  3. Configure it accordingly
  4. Save the file


The code can be found in my samples repository:

As always there are lots of ways to skin a cat.   Firstly; (counter intuitively) You can use the auto-documentation step to load the metadata, but in my case as it’s a single line of code I do it in the javascript step.

Secondly – when a step has multiple rows you are better off generating the whole object that configures that row – however in this simple case I’ve assumed there will always be 1 row in the UDJE step, and it’s that row we want to configure.

Finally there are LOADS and loads of helper functions you can use.  If you’re building SQL you can use the following function to get a correcly quoted field:

theField = inputDatabaseMeta.quoteField(source_name);

How do I access the API?

A few lines of code in a Modified Javascript step will do it:

var meta=new org.pentaho.di.trans.TransMeta( ktrFileName );
var UDJEStep = meta.findStep("User Defined Java Expression");
var UDJEStepMeta = UDJEStep.getStepMetaInterface();

Note that you can get all the steps if you want and loop around them, but “findStep” is an easier way of doing that.  You will need both the Step object as well as the Meta object.  (Exactly what these are is a question you can answer with google!)

How do I know how to configure a step?

At this point you must dig into the code.  Thankfully this is very easy and not something to be scared about. In reality all PDI developers will benefit from understanding the basics of the underlying code.

  1. Checkout PDI from github in eclipse or netbeans or any other tool
  2. Know that all the step code lives in engine/src/org/pentaho/di/trans/steps
  3. Work out what your step is called. Usually this is easy, but in the “User Defined Java Expression” case the step is actually called “Janino” (Due to the underlying library which compiles the expression)
  4. Open the file called
  5. Open it and look at all the getters and setters

So, lets take a look at JaninoMeta:


Note: In this case we have an setFormula/getFormula pair. They are an array because you can have multiple formulas in this step.  The Array is a “JaninoMetaFunction” whatever that is.  So; to get the function for the first row it is simply:


Now, what does JaninoMetaFunction look like then:


Well this is very common. It’s simply am object with a bunch of getters and setters which match up with the columns in the grid for the UDJE.  So; To replace the expression we can use code like this:

savedobject.setFormula("some new expression here");

So; You’ve figured out the code, now all you need to do is save the KTR, and that is as simple as:

meta.writeXML( targetFile );

OK, did i say it already? Simple!  Do not be afraid of the code, we’re only talking a few lines here.

Now; If you do find a step which cannot be injected natively, do please make sure there is a jira for it, and ultimately you can then move your code over to native injection as and when it comes in.

In all seriousness I don’t know of any other ETL tool that offers this feature – I’d love to be proven wrong, so if other tools have similar concepts then please let me know I’d like to look at them and see exactly how they do it.