#Pentaho Meetups Galore..

So; A few weeks ago a new meetup came on the scene, and it was briefly a bit confusing. This seemed to be an official Pentaho meetup group in London. OK, err thats odd I thought. However a bit of digging and it soon becomes clear this is a different kettle of fish.

Now; It just so happened we had actually scheduled (albeit not published) The next PLUG in January, and as it turned out they moved Challenge big data so as not to clash with #PCM15.  So partly this was my fault for not promoting the January booking.

Anyway the big question is – can these two groups co-exist?  Well yes I think so. Should they have both been done under the “plug” banner, probably yes.  I think if you’re looking for a meetup group then having 2 in the same area will be confusing.  However as long as we’re clear with content, target audience and dates I don’t see a problem here.  And that’s the key thing – the target audience between PLUG and challenge big data is different. Sure there may be some cannibalisation, but lets see.

I shall be at the challenge event, promoting PLUG 🙂 And I fully expect that cross promotion to go both ways. I’m not sure I’ll attend every challenge event – I guess I’ll be a content tart!

In the mean time, we’re still after content for the next PLUG which now moves into February – Skillsmatter have shown in the past that the sooner content is locked down the more signups and ultimately the more people attend – so this isn’t crazy to be asking for talks now. If anyone has either content requests, or a talk they wish to do then please let me know.


Oh, and by the way don’t forget to support the first Kickstarter project in #Pentaho land – Saiku Reporting – this is a very worthy project! Details here

Skillsmatter are soon going to be crowdfunding themselves – they’re after £1/2 million to further the business. Given their stunning new venue and their frankly unique business model I’m pretty sure this is going to be a great success for them.  Details on crowdcube

#PCM15 thoughts

Good evening

So #PCM15 was over, and frankly was a resounding success. London delivered as a spectacular location, and the community provided some great content. Thanks to everyone involved!  Oh; And also we re-raised the T-shirt bar, so thats good too  (This had slipped in recent times!)

Personally I felt I actually missed a lot of the content, so i’ll be perusing the slides to see what I missed!

Diethard has already mentioned there are at least 2 strong candidates for #PCM16, there’ll be a vote some time in the new year and away we go again!  I suggest we set out a timescale, and allow a period of “PR” for the organisers!

A couple of things were discussed extensively over the weekend, or rather there’s a few things that I think should be put out widely into the community.

The first was event timing. I’d pondered on suggesting a move of date – I feared PWorld was canibalising PCM, but even before #PCM15 I had spoken to a few people and come to the conclusion that the pros outweigh the cons.  I got a similar impression from discussions during the event too – So it seems to make sense to stick to our early autumn time slot.  I do think it makes a lot of sense to come after PWorld though.

The second thing was about event structure – well we’ve fallen into a pretty regular routine now, albeit each new organiser tweaks it in their own way. We now have a Friday event (hack in our case), multi track conference on Saturday and social events on the Sunday.  The feedback around this is that people are looking for some change – but it’s not clear what.  Personally I don’t think PCM is ready to move to a multi-day conference – I like the fact that we blitz it in one day.  In reality maybe we could extend Friday, but again – would people travel any earlier? Or would you just get the same attendees turning up late as before?

Finally a quick dump of advice to the next organiser:  ((I may add to this, i’m sure i’ve forgotten stuff)

  1. Use meetup / eventbright to help organise
  2. Be super clear about directions, hotels and locations.  Send repeated emails (we didn’t do this some people complained!)
  3. Clearly the event is free – but do charge for lunch. We do recommend lunch onsite – otherwise you’ll never get people back from the cafe’s again! (Cascais!)
  4. Be sure to produce clear agendas with multiple paper copies available at the venue
  5. Spread the load – you’ll need a team to help you both organise and sponsor the event. There’s lots of different aspects so this is pretty easy to do.
  6. Feel free to add your own identity / twist to the event, but remember we are here because the formula above has evolved and matured nicely!
  7. Remember this is a community event organised by the community for the community.  That does not mean pentaho are excluded – precisely the opposite. It is important that there is a strong Pentaho presence at the event.

Either way I look forward to whatever shape #PCM16 brings. I’ll be there!

Last nights Pentaho London Usergroup, and upcoming labs session

Last night we held our Q2 Pentaho London usergroup at the ever excellent skillsmatter.  Actually that may be the last time we go there as they are moving to a stunning new venue in the next few weeks, yay!

Anyway we had a great night, you can see the talks here:


And as discussed details about PCM London can be found here:


As well as meeting a guy from Cern there was another guy interested in PDI, Spark and R. It was a pleasure to meet all the regulars as well as quite a few new faces.  The discussion continued down the pub as ever (and ahem – started in the pub too) and we then got to wait many many hours for our food at Pizza “express”. Pfft.

The next event is not scheduled yet – however we do have a plan. We won’t be doing talks, instead we’ll be doing labs.  So how will this work?

Well there will be tickets available for you to bring your problem/solution to a Pentaho expert to discuss, review and resolve the issue.  You’ll be expected to show your issue, and then we’ll have a tech discussion and possibly even build a solution.

At the same time, you’ll be expected to share your experience with others.  So you’ll probably need to make sure there’s no personal data on display!

There’ll probably just be 2 sessions for each pillar of the stack and the sessions will run in parallel. The pillars covered will be Mondrian, PDI, Reporting and cTools.  We may do a Weka session too if there is interest.

So; If people think this is a good idea, we’ll setup a ticket system which:

a) allows you to get a ticket to present to an expert

b) allows you to get a ticket to watch a particular session.

Feedback please…  Would this work at PLUG?

Performance testing Pentaho Data Integration (Get Variables)

Back in the past I worked as a dedicated performance tester on several OLTP systems, and recently have spent quite a lot of time working performance testing/tuning PDI (Pentaho Data Integration)

The one thing to remember about performance related work, is never underestimate how long it takes.  Doesn’t matter what the technology is either.  This is complex stuff and takes a seriously long amount of time.  (Hence one of the reasons why a performance tester is paid more than a senior tester, and is deemed a separate skill area). Also; Set an end goal. In a complex system you could tune forever.  So decide on a line in the sand and stop once you’ve reached it.

As with any system performance testing PDI has its own set of challenges.  There are a lot of ways to go about this, but here are some notes on a quick bit of step benchmarking

But Why? Well a colleague said to me “why is your get variables step inline rather than a join – a join would be faster”. So lets see if thats true..  Actually lets just tell you now – It’s not. The inline approach is faster.  How did I test this?

  1. Benchmark “Generate rows” Step – to prove what speed we can source and write data. Note: It is important to write it somewhere (e.g. a dummy) otherwise you’re not testing the step in its entirety.
  2. Benchmark “Get variables” Step inline.  Run 3 times and take the average. Run for a number of rows that takes a good 30s+ to run so that process init time becomes neglible. Just to add to the work PDI has to do make sure the variable is converted to an Integer.
  3. Benchmark with join rows approach
  4. Benchmark with getting 5 variables – both ways.

And the results:

  1. 2,374,000 records per second
  2. 1,528,000 r/s
  3. 1,316,000 r/s
  4. 1,491,000 r/s
  5. 1,300,000 r/s

So you can see the “join” approach (whether 1 variable or 5) only performs at 86% of the speed of simply using the step inline.  I’m not sure if this has always been the case, but it’s true right now with PDI 5.4.  My hunch is that this is nothing to do with the join per-se, but probably more closely related to the fact that there are more hops.

Now clearly this is just a benchmark. It’s a best case. It’s unlikely your incoming data is coming in at 2M r/s so therefore changing to inline probably wont help you. But if you’re CPU starved then it can’t hurt either.  And as always with performance testing there’s a lot of differences, so what works today, may not work optimally tomorrow in a different scenario!

One day we’ll get support for using variables in filter rows, and also the variable substitution in calculator will work, and then there’ll be a lot less get variables steps anyway!

Here’s a shot of the simple transformation, in the lovely new skin of 5.4.  Also note the use of canvas notes to document the performance scenarios.


A note on garbage collection with Pentaho Data Integration / PDI / Kettle

Garbage Collection

PDI is a Java application (dur..) and this means memory within the java virtual machine is freed up using a process called garbage collection.

Now; Garbage collection is a wildly complicated thing and many people say that GC tuning is a black art. Whilst I wouldn’t say that, I would say that you should always start with the defaults and work from there – if something has no effect, then give it up.  One thing is true with garbage collection is the nature of the app (And the way it has been coded) has a significant impact.  So don’t think that because you’ve tuned a tomcat app before, that knowledge will apply to PDI – it won’t!

Why should I do this?

Well if you have no issues with PDI then you should not.  Simple!  But if you’re working with masses of data and large heap sizes then this will bite you at some point.

And before you come to tuning the GC, have you tried a bigger heap size first? (Xmx setting). This is not necessarily a panacea but if you’re doing a lot of sorting or simply a lot of steps and hence a lot of hops then you will need proportionately more memory.

How do I know there is a problem?

This is easy – if you have a process that seems to hang for significant amounts of time, then fly along, then pause again etc, this is most likely GC.  If your process trundles through at a consistent throughput then you’re probably doing ok.

What settings?

Well despite my saying above that you should tune for your application, there is one setting which you should most definitely apply for PDI and thats the concurrent collector.  Having just googled this to find a reference to it I’ve realised there is both a concurrent collector and a parallel collector, and hence i now need to go to another PC to check which it is I use

<short break insert piped jazz music here>

OK, found it:

-XX:+UseConcMarkSweepGC -verbose:gc =XX:+PrintGCTimeStamps -XX::+PrintGCDetails -XX:PrintTenuringDistribution -Xloggc:/tmp/gc.log

OK – so seems I need to do some research on the parallel collector then, has anyone used that?

Either way, there are 2 things in the options above:

  1. The instruction to the VM to use the new collector ( UseConcMarkSweepGC )
  2. Everything else to configure logging – note the configuration of the log file.

These settings need to be put somewhere where PDI picks them up every time, i.e. in the environment in $PENTAHO_JAVA_OPTIONS, or actually in the spoon/carte/pan/kitchen scripts.

It is important to enable GC logging so you can see whether or not you do have a GC problem. Generally if you have GC full collections of more than a few seconds you may have a problem. And if you see full GC taking minutes or hours then you definitely have an issue!  The other options that I use relating to the logging – they’re pretty self explanatory, and google/stackoverflow will give further detail

And that’s it – More later in the week on the topic of big data with PDI.

Working with Big (lots) Data and Pentaho – Extreme Performance

OK, firstly, I’m not talking proper BigData here.  This is not Hadoop, or even an analytical database.  (Lets not get into whether an analytical database counts as bigdata though!). And it’s certainly not NoSQL.  Disk space we’re looking at 100’s of gigabytes, not terabytes.  Yet this project involves more data than the Hadoop projects I’ve done.

So tens of billions of records. Records that must be processed in a limited environment in extremely tight time windows.  And yes; I’m storing all of that in MySQL!

Hey, wake up, yes, I did say billions of records in MySQL, try not to lose consciousness again…  (It’s not the first time I’ve had billions of rows in MySQL either – Yet I know some of you will guffaw at the idea)

In fact, in this project we are moving away from a database cluster, to a single box. The database cluster has 64 nodes and 4TB of RAM.  Our single box has 500GB RAM and that was hard fought for after we proved it wasn’t going to work with the initial 64GB!  Impossible? Yup, that’s what I thought.  But we did it anyway.

Oh; and just for a laugh, why don’t we make the whole thing metadata driven and fully configurable so you never even know which fields will be in a stream. Sure; Lets do that too.  No one said this was easy..

Now; how on earth have I managed that?  Well firstly this was done with an enormous amount of testing, tuning and general graft.  You cannot do something like this without committing a significant amount of time and effort.  And it is vital to POC all the way. Prove the concept basically works before you go too far down the tuning route – As tuning is extremely expensive.  Anyway we built a solution that works very well for us – your mileage may vary.

I do accept that this is very much at the edge of sanity…

So what did we learn?  How did we do this?  Some of this is obvious standard stuff. But there are some golden nuggets in here too.

  1. Disk usage other than at the start or end of the process is the enemy.  Avoid shared infrastructure too.
  2. Sorting (which ultimately ends up using disk) is evil. Think extremely hard about what is sorted where.
  3. Minimise the work you do in MySQL. Tune the living daylights out of any work done there.
  4. MyISAM all the way
  5. NO INDEXES on large tables. Truncate and reload.
  6. RAM is not always your friend. You can have too much.
  7. Fastest CPUs you can find (Caveats still apply.. Check specs very carefully Intel do some weird things)
  8. Partitioning utterly rocks.
  9. Test with FULL production loads or more, PDI/java doesn’t scale how you might expect (primarily due to garbage collection), in fact it’s impossible to predict.  This is not a criticism, it is just how it is.
  10. In fact, PDI Rocks.
  11. Performance tune every single component separately. Because when you put it all together it’s very hard to work out where the bottlenecks are.  So if you start off knowing they ALL perform you’re already ahead of the game.
  12. Use munin or some other tool that automates performance stat gathering and visualisation.  But not exclusively. Also use top/iostat/sar/vmstat.  Obviously use Linux.
  13. What works on one box may not work on another. So if you’re planning on getting a new box then do it sooner rather than later.
  14. Be prepared to ignore emails from your sysadmin about stratospheric load averages <grin>

I plan to follow this up with a further blog going into details of sorting in PDI in the next few days – complete with UDJC code samples. This post is necessary to set the scene and whet appetites!

Looking forward to hearing other similar war stories.

Of course if anyone wants to know more then talk to me at the next Pentaho London Meetup

Getting started with Hadoop and Pentaho

These days all the main Hadoop vendors provide VMs ready to go for testing and development – very convenient!

However a downside of these VMs is that they tend to require a LOT of resources. My laptop only has 8gb of ram, and that is all it can have!  So I downloaded the Cloudera 5.3 VM and tried it with 4gb, but no chance.

But there is another way…  The VM above is stuffed to the gunnels with every hadoop service under the sun. It is a demo after all!  The solution is just to install locally ( hence also removing the guff of a VM small though that is) using this guide:


You can literally get away with installing the basic packages, and hive and away you go.  And the download is MUCH smaller – we’re talking about 300MB vs 3.7GB for the VM!

Unfortunately it doesn’t quite stop there though.  You’ll know already that you need to set the correct “active hadoop configuration” in this file:


But that won’t be the end of it. The out of the box cdh51 shim expects a secure/kerberised install which is not the case here. So you have to make one more change, in config.properties for the cdh51 shim comment out all the lines containing:


Then finally PDI will start.  What is confusing is that the config.properties already has NO AUTH specified.  But the error message does in a very verbose way tell you you’re connecting a kerberised class with a non kerberised one. And it’s a terminal error kitchen/spoon won’t even start.  No doubt in future versions we’ll be able to:

  1. mess with the config without PDI utterly crashing
  2. Connect to multiple different hadoop clusters, with different shims at the same time!

Well hopefully!

2015 Q1 Pentaho London Meetup

Happy new year everyone!  With 2015 being the year that #BigData delivers you can guarantee it’s going to be a spectacular year for #Pentaho

We’ve set a date for the next Pentaho London meetup.  It’s a while off, but we’re already seeing some good signups, despite not yet having an agenda! Well that makes it easier!  Anyway register here:


So why come along to PLUG?  Well..

  1. Networking
  2. Networking
  3. Networking
  4. Content
  5. Beer
  6. Sometimes Pizza depending on Sponsor

See you all at Skillsmatter, and of course down the Slaughtered Lamb afterwards. It’s a charming place.. really.  As always if you have anything you’d like to present then shout now and we’ll get that agenda tied down.


Pentaho London Usergroup July 22 2014

Hi, we’re way ahead of the game with the usergroup this quarter.  You can find details of the agenda and registration links here:


With a distinctly big data theme it’s bound to be a busy one. Beers will be sponsored as usual, and with an earlier start we’ll have more time down the pub too.  I do genuinely believe the networking side of this event is just as important as the content itself.

Also curious to see what Tom has to show us with Apache OODT – An interesting project currently rarely seen in the BI world.  You saw it here first!

Its also Diddys (Diethard Steiner) first time up, which is amazing really, he’s such a pillar of the blog community he deserves a place. But lets be sure to heckle him too!

Please do share and promote the event!  Any questions, feedback or comments welcome!