Executing R from Pentaho Data Integration (PDI / Kettle)

These days everyone has heard of R (or RStats if you want something more google-able) and it is doing an amazing job of replacing SAS.  SAS is a traditional old-school package of tools and is extremely expensive – Although it is generally accepted that if you can afford it, it is the best tool.

R is open source, and has a mighty impressive selection of libraries. There are commercial offerings too – I don’t pretend to know the market in depth, but one of the leads seems to be RevolutionR

As usual I judge the success of the product by 2 things – Job opportunities and meetups.  Well given any knowledge of R means you can classify yourself as a data scientist, that means you can really pick and choose any of many different jobs.  It’s an extremely hot sector at the moment.  On the meetup side LondonR is extremely successful (not jealous, ahem..) and never fails to sell out.

OK so it’s a cool tech.. Lets see how it fits in with PDI/kettle.

Hang on. Why would I want to do that?  R can do everything PDI can do right?  Err, yes and no.  PDI is extremely good at the plumbing or data architecture side. R is extremely good at the number crunching or stats side.  So use the right tool for the job..

So, here’s how I set everything up, note I’m on (K)Ubuntu 12, so some of the steps may not be necessary for everyone.

  1. Download and install RevolutionR.  Don’t install the default r-base package, it’s ancient.
  2. Start “R”.
  3. install rJava as per this wiki page
  4. Now pick your PDI Plugin. The wiki page above refers to the R Executor step which is an enterprise only plugin.  (This comes with the enterprise PDI by default.)
  5. Mess around with the libjri.so and keep trying various places to put it until PDI Finds it.  Try an assortment of libswt/linux, libswt/linux/x86_64 and even ../libswt/linux (outside the PDI folder)
  6. Download the examples on the wiki page above and confirm they work.

Now; I went down the enterprise route, but there is a community version of the R script executor available in the marketplace from these guys:  http://dekarlab.de/wp/?p=5 However I was not able to get their example to work.  For some reason R didn’t like the simple a+b calculation.  I don’t know if this is an issue in the plugin, in my code, or even something different in PDI 5.3.  I think it would be good if they included a full working example in github alongside the plugin source code.

So now what?  Well now we can do something interesting.  Last October I noticed the twitter engineering team had released a R module for breakout detection.  (A breakout is when something you are measuring over time reaches a new high, i.e. it breaks out of previous normal boundaries.  It is commonly used in trading, as typically when a share breaks out it goes up quite a lot)  I admit it also piqued my interest that this was from Twitter!  you can find the library on github

(As an aside it would be very interesting to understand more about how twitter are managing to scale R and use it with their vast quantities of data – Not something that R has traditionally been very good at)

Later on this twitter blog appeared about a similar library – this time for anomaly detection.  Very nice – I’ve done a lot of work in the past around automatic alerting and it’s very hard to get this kind of thing right.  Where is the line between simply a quiet day due to a national holiday and an intermittent problem causing your site to lose traffic?

I followed that blog only to find devtools doesn’t install on ubuntu. Solution to that is here

Oh also see this page as to how to install from github. The command in that blog doesnt work either.

Then we just need to work out how to call that stuff from PDI.  It’s pretty easy, here is the R script:

res = AnomalyDetectionTs(raw_data, max_anoms=0.02, direction=’both’, plot=FALSE)
oput = as.data.frame(res$anoms)

So what’s going on here?  Well we call the library, use the sample dataset and detect our anomalies.  The only PDI centric bit is the last 2 lines

  1. Create a variable oput (Can be called whatever you want) which converts the result set to a data frame
    1. Note: We don’t just use res – because res is an composite object that contains config as well as results.  By converting res$anoms we get just the anomaly data written to the result.
  2. Call it (just type its name) so that the contents of that variable are sent back to pdi.

The next step naturally would be to feed data into this step from PDI rather than using the sample data set from the library.  That looks pretty easy – Alas I ran out of time for today!

By the way, I added a short ktr to my samples repo to show this example. You can find it here: https://github.com/codek/pdi-samples/tree/master/rstats Naturally you’ll need the AnomalyDetection library installed in R before running in PDI.