Q3 Pentaho London meetup

Last night we had our Q3 Pentaho London meetup and it was different to the norm for three reasons

1. We visited codenode. The AMAZING new venue for skillsmatter.  Really its very good these guys are doing something right.  Even finding a space that large in the city must have been quite a battle.

2. We did “labs” (more below)

3. There was no beer (to be fixed by next time!)

Anyway why labs? Well it’s an idea we stole from the South American community.  I’ve already blogged about it but here is what we covered

Community – how to contribute etc – Worth a blog post in itself, this is an important topic!
Plugins – how to build (BA Server oriented discussion)
General sparkl chat (Natural progression from above)
Rules engines – jare and drools (Another blog post one day..)
Cloud integration applications
Aes connection security (ee only?)
Ws02 for Web service integration
Apache drill
Vertica pros and cons

And we also saw a bemusing issue from Richard where sparkl just didn’t start. Weird!

Anyway see you all at pcm.. no excuses!

To register for the PCM Hackathon go here

To see all the logistical details for PCM go to github

And to register for PCM go to eventbrite

Scaling merge joins in Pentaho Data Integration

A while ago I was working on a project which used brutal amounts of partitioning within PDI so we could get a decent throughput on our jobs.  The main reason for this was a constraint of using just a single box – Not a common architecture these days.  Anyway we managed to get the throughput we needed (Getting on for 1M rows per second over the whole job) so that’s fine.

We hit a minor issue though when it came to the merge join step.  (A similar albeit different issue exists for stream lookup). Basically if you have a lovely partitioned stream coming in, and want to join with another already partitioned stream, PDI doesn’t support this.  In fact, it gives you a great big red hop just to make the point:


If you hover over the hop it tells you why you cannot do this.

That appears to mean you must de-partition (via a dummy step) single threaded merge join, then re-partition moving forward.  Meh; That’s just not nice.

There is however another way in some cases – this depends entirely on your data whether or not it will work. But for us it did – and for us it scaled wonderfully.  The simple mapping.

This new(ish) mapping component has one important feature over the original mapping component (do we call it complex mapping?) – It can be partitioned. And the reason for that is because it can only have and must have 1 input and 1 output.

So our new transformation looks like this:


And the mapping itself looks like this:


Now; You must test this carefully because there’s an awful lot more work in doing this – and I can imagine in some cases you’ll actually get worse performance. But it’s another handy tool to have in the belt!  And YES; It does mean your input query for the right hand side of the join gets done as many partitions as you have – but that’s a price you may have to pay.  (And again, only works if one stream is an order of magnitude larger than the other.)

Don’t forget to checkout the free professional support sessions we’re offering at PLUG next week!

Hadoop Usergroup UK September 2015 #hug_uk

Last night I went to the monthly hadoop user group at impressive Digital Catapult venue opposite the british library.  This is an enormous glass box perched on top of an old building.  And what a curious company – a not for profit private company (sort of funded by another not for profit government organisation) working to improve various aspects of the digital economy.  Must be a pretty interesting place to work.

Anyway the agenda was most interesting – a couple of keynotes followed by the pitches of 8 minutes each.  I was initially unclear of the purpose of the pitches but as it went on it became clear – the audience was a even balance of business folk, tech people and VC type people.  So the purpose of the pitch was generally either for recruitment, or for funding.  A pretty clever place to recruit people.

Here’s my comments/views on the 6 pitches. One thing to note – At an event like this you may as well turn off predictive text on your phone. None of these company names are real words!


Machine learning as a service (ahem – again).  Interesting ideas. A specific focus on time series data, and teaching the machine to truly understand time.  The audience was incredulous!


A company trying to bring down the cost of satellite imagery and at the same time improve the accuracy and timeliness of this data.  Social observations was mentioned, but I didn’t really see how that came into it.  Interesting response to some pretty crucial privacy issues/problems.


Bringing the comoditised analytics readily available in the mobile world to IOT.  Machine learning with dedicated models for specific situations.  Point made that there’s really nothing that makes this specific to IOT – it’s just that IOT is a good market to chase right now (few standards, no existing dedicated analytics solutions etc)  However I don’t really see why IOT is different to any other analytic situation, it’s just data, but each to their own..

Another interesting point though – they hope to publish open dashboard templates – this could be interesting – For a long time there has been a lack of standardisation in the data world – we all end up building the same dashboards again and again, this sort of idea definitely has mileage.


Monitoring and reporting on news.  Text analytics and intelligent categorisation using data from e.g. Wikipedia.  Team of 25 with 15 clients already and a sales team which has only just got off the ground.  Using the currently uber-cool language Clojure.


The worlds most accurate speech recognition software. Claimed to be 25% more accurate than other leaders.  Interesting applications beyond basic speech recognition – such as automated language tests etc.  Self learning so can learn new languages quickly.  Going to have a battle to beat the big boys but the offline option for handsets is certainly interesting.


By far and away the most impressive pitch.  A model where unstructured data is structured into events and then analysed. Used by defence, finance and airline industries. Flexible.  The model identified a significant event in ukraine a week before any of the western news agencies started reporting it.  Potentially exciting applications in the area of AID distribution in disaster zones.  Apparently the first tech company to be spun out from Cambridge university.  These guys are recruiting rather than looking for funding.  Technology is Kafka (cool) and Storm (err, really? Someone is using it then…)

Looking forward to more pitch events – they’re very interesting!

Maybe we should do one at Pentaho London Usergroup ?