Last nights Pentaho London Usergroup, and upcoming labs session

Last night we held our Q2 Pentaho London usergroup at the ever excellent skillsmatter.  Actually that may be the last time we go there as they are moving to a stunning new venue in the next few weeks, yay!

Anyway we had a great night, you can see the talks here:

And as discussed details about PCM London can be found here:

As well as meeting a guy from Cern there was another guy interested in PDI, Spark and R. It was a pleasure to meet all the regulars as well as quite a few new faces.  The discussion continued down the pub as ever (and ahem – started in the pub too) and we then got to wait many many hours for our food at Pizza “express”. Pfft.

The next event is not scheduled yet – however we do have a plan. We won’t be doing talks, instead we’ll be doing labs.  So how will this work?

Well there will be tickets available for you to bring your problem/solution to a Pentaho expert to discuss, review and resolve the issue.  You’ll be expected to show your issue, and then we’ll have a tech discussion and possibly even build a solution.

At the same time, you’ll be expected to share your experience with others.  So you’ll probably need to make sure there’s no personal data on display!

There’ll probably just be 2 sessions for each pillar of the stack and the sessions will run in parallel. The pillars covered will be Mondrian, PDI, Reporting and cTools.  We may do a Weka session too if there is interest.

So; If people think this is a good idea, we’ll setup a ticket system which:

a) allows you to get a ticket to present to an expert

b) allows you to get a ticket to watch a particular session.

Feedback please…  Would this work at PLUG?

Performance testing Pentaho Data Integration (Get Variables)

Back in the past I worked as a dedicated performance tester on several OLTP systems, and recently have spent quite a lot of time working performance testing/tuning PDI (Pentaho Data Integration)

The one thing to remember about performance related work, is never underestimate how long it takes.  Doesn’t matter what the technology is either.  This is complex stuff and takes a seriously long amount of time.  (Hence one of the reasons why a performance tester is paid more than a senior tester, and is deemed a separate skill area). Also; Set an end goal. In a complex system you could tune forever.  So decide on a line in the sand and stop once you’ve reached it.

As with any system performance testing PDI has its own set of challenges.  There are a lot of ways to go about this, but here are some notes on a quick bit of step benchmarking

But Why? Well a colleague said to me “why is your get variables step inline rather than a join – a join would be faster”. So lets see if thats true..  Actually lets just tell you now – It’s not. The inline approach is faster.  How did I test this?

  1. Benchmark “Generate rows” Step – to prove what speed we can source and write data. Note: It is important to write it somewhere (e.g. a dummy) otherwise you’re not testing the step in its entirety.
  2. Benchmark “Get variables” Step inline.  Run 3 times and take the average. Run for a number of rows that takes a good 30s+ to run so that process init time becomes neglible. Just to add to the work PDI has to do make sure the variable is converted to an Integer.
  3. Benchmark with join rows approach
  4. Benchmark with getting 5 variables – both ways.

And the results:

  1. 2,374,000 records per second
  2. 1,528,000 r/s
  3. 1,316,000 r/s
  4. 1,491,000 r/s
  5. 1,300,000 r/s

So you can see the “join” approach (whether 1 variable or 5) only performs at 86% of the speed of simply using the step inline.  I’m not sure if this has always been the case, but it’s true right now with PDI 5.4.  My hunch is that this is nothing to do with the join per-se, but probably more closely related to the fact that there are more hops.

Now clearly this is just a benchmark. It’s a best case. It’s unlikely your incoming data is coming in at 2M r/s so therefore changing to inline probably wont help you. But if you’re CPU starved then it can’t hurt either.  And as always with performance testing there’s a lot of differences, so what works today, may not work optimally tomorrow in a different scenario!

One day we’ll get support for using variables in filter rows, and also the variable substitution in calculator will work, and then there’ll be a lot less get variables steps anyway!

Here’s a shot of the simple transformation, in the lovely new skin of 5.4.  Also note the use of canvas notes to document the performance scenarios.


Pentaho Community Meetup 2015

So, surely everyone knows what PCM is right? Pentaho Community Meetup?

Well it’s the first regular community meetup in the world.  No longer the biggest, but still holds a special place for a lot of people.  Perhaps we need to add “European” into the name somehow.  It is extremely well attended by the key architects / founders of the Pentaho Analytics stack.  The importance of this cannot be understated!

So what is the concept? Well simply a tech oriented meetup based loosely around the Pentaho analytics stack.  There’s strictly no sales presentations allowed (as with PLUG). How does it differ from PLUG? well being a yearly event it’s much more looking to the future, discussing roadmap, and seeing the latest bells and whistles that are available.

So this is the 7th year, in the past we’ve been to:

2008. Mainz
2009. Barcelona
2010. Cascais, PT
2011. Rome
2012. Amsterdam
2013. Sintra, PT
2014. Antwerp
2015. London
2016. Madrid?

In fact, back in 2008, tech meetups barely existed.  Sure we had conferences by vendors, but only your CTO went to those as you had to pay! (bonkers).  Nevertheless there were still 40 of us at that first meetup and it’s only grown from there.

Historically we picked places that were sunny. At the end of the day some of us were paying out of our own pocket so there has always been an unapologetic social side to the event.  The last few years however we lost our way, maybe next year we can return to the sun!

Why did we pick London? well as the event has grown we’ve come to realise it’s essential to have people on the ground who live and know the location. You just can’t organise an event like this remotely.  There’s another reason too – London Rocks! It’s been the most popular tourist destination for 5 of the last 7 years!

To find out more you can see the latest logistical details including dates, times, locations and agenda here.  If you’d like to talk then please send in your details.

Finally who are we?  Well organising this year is myself, Diethard Steiner and Nelson Sousa (All of us are Pentaho veterans and well known in the community!)  So if you have any questions then please do contact us.