#Pentaho Meetups Galore..

So; A few weeks ago a new meetup came on the scene, and it was briefly a bit confusing. This seemed to be an official Pentaho meetup group in London. OK, err thats odd I thought. However a bit of digging and it soon becomes clear this is a different kettle of fish.

Now; It just so happened we had actually scheduled (albeit not published) The next PLUG in January, and as it turned out they moved Challenge big data so as not to clash with #PCM15.  So partly this was my fault for not promoting the January booking.

Anyway the big question is – can these two groups co-exist?  Well yes I think so. Should they have both been done under the “plug” banner, probably yes.  I think if you’re looking for a meetup group then having 2 in the same area will be confusing.  However as long as we’re clear with content, target audience and dates I don’t see a problem here.  And that’s the key thing – the target audience between PLUG and challenge big data is different. Sure there may be some cannibalisation, but lets see.

I shall be at the challenge event, promoting PLUG 🙂 And I fully expect that cross promotion to go both ways. I’m not sure I’ll attend every challenge event – I guess I’ll be a content tart!

In the mean time, we’re still after content for the next PLUG which now moves into February – Skillsmatter have shown in the past that the sooner content is locked down the more signups and ultimately the more people attend – so this isn’t crazy to be asking for talks now. If anyone has either content requests, or a talk they wish to do then please let me know.


Oh, and by the way don’t forget to support the first Kickstarter project in #Pentaho land – Saiku Reporting – this is a very worthy project! Details here

Skillsmatter are soon going to be crowdfunding themselves – they’re after £1/2 million to further the business. Given their stunning new venue and their frankly unique business model I’m pretty sure this is going to be a great success for them.  Details on crowdcube

10th Spark London Meetup

On Monday I attended the 10th Spark London Meetup. This was an impressively well attended meetup with 90 people, virtually everyone who registered turned up, and 170 people on the waiting list!  One day PLUG will be like that!

Anyway the presentation was some hard core spark stuff from Chris Fregly – Who has got one impressive career in this tech.  Anyway here are the key points I noted down:

  • A lot of talk about off-heap coding in Java.  This is a very interesting topic – I was only vaguely aware this was possible.  A lot of it seems to live in “unsafe” code areas – sounds exciting!
  • There are some interesting numbers from the daytona gray sort challenge – 100tb.  Hadoop sorted at a rate of 1.4tb/min with 2,100 nodes, and spark sorted at 4.27tb/min with only 206 nodes. Wow.
  • Spark also scaled linearly to a 1PB sort
  • Bizarrely they did this on the openjdk.  I’m amazed they did that, would love to understand why.
  • There’s some cool features (epol?) that allow irect NIC -> DISK transfers without using CPU usertime. Nice.
  • We saw some details of some clever optimisations with spark shuffle process. Not dissimilar to the pipelining feature with partitioning in PDI
  • There was talk about mappers and reducers. What? I need to know more about that 🙂 Didnt realise they still existed in a spark world.
  • LZF is now used in preference to snappy. LZF has the ability that you can process data on the CPU in a STILL compressed format – just as Actian manages to do so. Clever stuff.
  • Tuning is extremely painful. There’s only really a try and see approach.
  • An interesting matrix multiplication optimisation via transposing.  Amusingly once you’ve done that, it’s not always any faster 🙂 Nice theory though
  • IBM are forming a team of 3500 engineers (this is old news)
    • One has to wonder why IBM doesnt just buy databricks?
  • CPU is often the bottleneck. Disk/Network upgrades often only improve performance by small margins.
  • 100+ Spark SQL functions have been converted to janino for improved performance and flexibility

And then we continued on the 2nd talk…

  • Noted the demo had 2gb permsize defined, blimey
  • demo didnt work
  • Supposedly “simple” demo had about 15 different significant working parts – bonkers!
  • Biggest spark cluster is tencent, possibly up to 14k now.
  • Use dataframes not rdds
  • question about parquet vs avro – Slightly odd given they address different needs.
    • Apparently once you go beyond 22 columns in parquet performance degrades.
    • it used to have a hard 22 column limit – so presumably removing that limit is a hack
    • Scan speed benchmarking shown – but again, this isnt the point of parquet so not sure where this was going.
  • Everyone uses Kafka. EVERYONE!
  • Hyperloglog has an interesting approx distinct count algorithm
  • Interesting debate about how you do cold start problems
  • Zeppelin used again for demo
  • Job management seems to be a pain in Spark.
  • There’s a hidden REST API
  • Cunning matrix factorization example
  • Boil the frog!
  • Flink – If flink does take over, then the ibm crew wont be a case of trying to beat it – more to join them.

Tonight I’m off to see about spark at the hadoop london usergroup – i’m sure it’ll be similarly busy and the content looks just as interesting.

How to sort data faster in PDI

As anyone who follows my previous blogs will know I’ve been involved in a project recently involving billions of rows and where PDI is processing over a Trillion records step to step.

(Massively Simplified) Part of that includes one major sort.  And for various reasons it’s done in PDI hey ho.

So, how do I make sort faster?

Simple: Reduce the heap, use more temp files on disk.

Errr, what?!  Typo?  increase surely?

Nope. Here’s my setup:
Scenario 1: Sort a bunch of data.  Set heap to 48GB, and monitor tmp disk space. PDI uses 77GB of temp space, and it takes 8 hours.
Scenario 2: Look at the above and think, ooh; Don’t use tmp space, give PDI 77+48 heap. surely it’ll be faster?  Sort in memory, no brainer.  EVERYONE is talking in memory these days.  (And for the last 5 years)  Err, no, 19 hours. OUCH.
The reason is the enormous cost of garbage collection in the second process.  (And that’s with the concurrent garbage collector too!)  On a 32 cpu box I see hours for a stretch where only 1 cpu is being used.  Then suddenly it goes crazy, and then stops again.
Perhaps the different code path PDI uses when using tmp files to sort results in more efficient object usage?
Now; Our disk is SSD so in scenario 1 the impact of using tmp files is not as bad as it would normally be.  I had pondered on setting up a ~77gb ramdisk but I’m guessing any improvements would be very minor.  (I hardly ever see utilisation go up on the SSD itself)
Java8 has some VM enhancements specifically around sorting – I wonder what would it take for PDI to start using those features?  That’s assuming support for Java8 is added at all!
Happy Friday!

A note on garbage collection with Pentaho Data Integration / PDI / Kettle

Garbage Collection

PDI is a Java application (dur..) and this means memory within the java virtual machine is freed up using a process called garbage collection.

Now; Garbage collection is a wildly complicated thing and many people say that GC tuning is a black art. Whilst I wouldn’t say that, I would say that you should always start with the defaults and work from there – if something has no effect, then give it up.  One thing is true with garbage collection is the nature of the app (And the way it has been coded) has a significant impact.  So don’t think that because you’ve tuned a tomcat app before, that knowledge will apply to PDI – it won’t!

Why should I do this?

Well if you have no issues with PDI then you should not.  Simple!  But if you’re working with masses of data and large heap sizes then this will bite you at some point.

And before you come to tuning the GC, have you tried a bigger heap size first? (Xmx setting). This is not necessarily a panacea but if you’re doing a lot of sorting or simply a lot of steps and hence a lot of hops then you will need proportionately more memory.

How do I know there is a problem?

This is easy – if you have a process that seems to hang for significant amounts of time, then fly along, then pause again etc, this is most likely GC.  If your process trundles through at a consistent throughput then you’re probably doing ok.

What settings?

Well despite my saying above that you should tune for your application, there is one setting which you should most definitely apply for PDI and thats the concurrent collector.  Having just googled this to find a reference to it I’ve realised there is both a concurrent collector and a parallel collector, and hence i now need to go to another PC to check which it is I use

<short break insert piped jazz music here>

OK, found it:

-XX:+UseConcMarkSweepGC -verbose:gc =XX:+PrintGCTimeStamps -XX::+PrintGCDetails -XX:PrintTenuringDistribution -Xloggc:/tmp/gc.log

OK – so seems I need to do some research on the parallel collector then, has anyone used that?

Either way, there are 2 things in the options above:

  1. The instruction to the VM to use the new collector ( UseConcMarkSweepGC )
  2. Everything else to configure logging – note the configuration of the log file.

These settings need to be put somewhere where PDI picks them up every time, i.e. in the environment in $PENTAHO_JAVA_OPTIONS, or actually in the spoon/carte/pan/kitchen scripts.

It is important to enable GC logging so you can see whether or not you do have a GC problem. Generally if you have GC full collections of more than a few seconds you may have a problem. And if you see full GC taking minutes or hours then you definitely have an issue!  The other options that I use relating to the logging – they’re pretty self explanatory, and google/stackoverflow will give further detail

And that’s it – More later in the week on the topic of big data with PDI.

Working with Big (lots) Data and Pentaho – Extreme Performance

OK, firstly, I’m not talking proper BigData here.  This is not Hadoop, or even an analytical database.  (Lets not get into whether an analytical database counts as bigdata though!). And it’s certainly not NoSQL.  Disk space we’re looking at 100’s of gigabytes, not terabytes.  Yet this project involves more data than the Hadoop projects I’ve done.

So tens of billions of records. Records that must be processed in a limited environment in extremely tight time windows.  And yes; I’m storing all of that in MySQL!

Hey, wake up, yes, I did say billions of records in MySQL, try not to lose consciousness again…  (It’s not the first time I’ve had billions of rows in MySQL either – Yet I know some of you will guffaw at the idea)

In fact, in this project we are moving away from a database cluster, to a single box. The database cluster has 64 nodes and 4TB of RAM.  Our single box has 500GB RAM and that was hard fought for after we proved it wasn’t going to work with the initial 64GB!  Impossible? Yup, that’s what I thought.  But we did it anyway.

Oh; and just for a laugh, why don’t we make the whole thing metadata driven and fully configurable so you never even know which fields will be in a stream. Sure; Lets do that too.  No one said this was easy..

Now; how on earth have I managed that?  Well firstly this was done with an enormous amount of testing, tuning and general graft.  You cannot do something like this without committing a significant amount of time and effort.  And it is vital to POC all the way. Prove the concept basically works before you go too far down the tuning route – As tuning is extremely expensive.  Anyway we built a solution that works very well for us – your mileage may vary.

I do accept that this is very much at the edge of sanity…

So what did we learn?  How did we do this?  Some of this is obvious standard stuff. But there are some golden nuggets in here too.

  1. Disk usage other than at the start or end of the process is the enemy.  Avoid shared infrastructure too.
  2. Sorting (which ultimately ends up using disk) is evil. Think extremely hard about what is sorted where.
  3. Minimise the work you do in MySQL. Tune the living daylights out of any work done there.
  4. MyISAM all the way
  5. NO INDEXES on large tables. Truncate and reload.
  6. RAM is not always your friend. You can have too much.
  7. Fastest CPUs you can find (Caveats still apply.. Check specs very carefully Intel do some weird things)
  8. Partitioning utterly rocks.
  9. Test with FULL production loads or more, PDI/java doesn’t scale how you might expect (primarily due to garbage collection), in fact it’s impossible to predict.  This is not a criticism, it is just how it is.
  10. In fact, PDI Rocks.
  11. Performance tune every single component separately. Because when you put it all together it’s very hard to work out where the bottlenecks are.  So if you start off knowing they ALL perform you’re already ahead of the game.
  12. Use munin or some other tool that automates performance stat gathering and visualisation.  But not exclusively. Also use top/iostat/sar/vmstat.  Obviously use Linux.
  13. What works on one box may not work on another. So if you’re planning on getting a new box then do it sooner rather than later.
  14. Be prepared to ignore emails from your sysadmin about stratospheric load averages <grin>

I plan to follow this up with a further blog going into details of sorting in PDI in the next few days – complete with UDJC code samples. This post is necessary to set the scene and whet appetites!

Looking forward to hearing other similar war stories.

Of course if anyone wants to know more then talk to me at the next Pentaho London Meetup

Pentaho London Usergroup July 22 2014

Hi, we’re way ahead of the game with the usergroup this quarter.  You can find details of the agenda and registration links here:


With a distinctly big data theme it’s bound to be a busy one. Beers will be sponsored as usual, and with an earlier start we’ll have more time down the pub too.  I do genuinely believe the networking side of this event is just as important as the content itself.

Also curious to see what Tom has to show us with Apache OODT – An interesting project currently rarely seen in the BI world.  You saw it here first!

Its also Diddys (Diethard Steiner) first time up, which is amazing really, he’s such a pillar of the blog community he deserves a place. But lets be sure to heckle him too!

Please do share and promote the event!  Any questions, feedback or comments welcome!