Upcoming Pentaho events – Summer 2017

As we’re heading into crazy event season for the Pentaho community there wont be another PLUG (Pentaho London Usergroup) until around December time.

So, keep an eye on social media and your inboxes for the latest news on when and where PCM17 will be.  Hint: It’ll be November time again.

Also – Don’t forget the official Pentaho world conference is on again this year in Orlando – that’s one not to miss. Find that on the Pentaho website.

Finally – Mark hall – Creator of Weka is in town in early June and there’s a meetup with him where you can find out about “The future of machine learning”:


(Think cyberdine..)

If anyone wants to talk in December then put your hands up and let me know, otherwise have a great summer.  In a similar vein – any feedback about the group, content, location or timings – send that too.


#Serverless #AWS PDI

Hmm, what what?  Serverless PDI?

Yes, so serverless is *the* thing at the moment.  Partly driven by amazing advances in the devops space – Fundamentally we’ve all had enough of managing servers, patching etc. You know the story.

“Run code not computers”

Why to do this? – Simple – Integration. If you need to hook up 2 APIs of separate systems it’s actually pretty expensive to have a server sitting there running 24×7.  So what we want is to literally pay for the time we use and nothing more – We don’t want to have to startup and shutdown a whole server either!

Why Pentaho? The single most important argument is visual programming.  It’s faster to get started with PDI than it is with a scripted solution.  It’s more maintainable and it allows you to capitalise on general ETL skills.  (Experience of any ETL tool is enough to work with PDI) .  PDI has also done the boring input/output/API stuff, so all you need to focus on is your business logic. Simple!

So, how to do this? Well Amazon AWS Lambda is where to start.  I assume google cloud has a similar function, but I’ve already got stuff running in AWS so this was a no brainer.

The stats sound good. Upload your app and you only pay for run time, everything else is handled. There’s even something called API connect so you can trigger your ‘Functions’.  And finally – My favourite automation service Skeddly can also trigger AWS Lambda functions. Great!

There is one issue. The jar has to be less than 100mb. What! PDI is 1GB, how can that possibly make sense. Sure enough some googling shows lots of other people trying to use PDI in lamdba and finding this limit is far too low.

But; Matt Casters pointed out to me the kettle engine is only 6mb. What?  Really?  I took a look – and sure enough with a few dependencies thrown in you can build a PDI engine archive which only uses 22MB. We’re on.

To start, read these two pages:

Java programming model

Java packaging with Maven


  1. Create a pom.xml
  2. Add in your example java code
  3. Build the jar (mvn package).
  4. Remove any signed files: cd target; zip -d <file>.jar META-INF/*.RSA META-INF/*.DSA META-INF/*.SF
  5. Upload as a lambda function
  6. Set an environment variable KETTLE_HOME=/tmp (If you dont PDI will crash as the default home dir in lambda isn’t writable)
  7. TEST!

And here’s the proof:

Screen Shot 2017-04-25 at 12.53.39

Slightly disconcerting that it took 5.7s to run. On my laptop the same executed in 0.5s. I guess the lambda physical boxes are busy and low spec!

What’s next?

  1. Find a better way to package the ktr
  2. Hook the input file into PDI parameters
  3. Provide better output than “Done”!
  4. Setup with the API connect
  5. Schedule via Skeddly

I will be releasing all the code for this soon – In the mean time if anyone is particularly interested in this right now please do contact me. I think it’s a very interesting area and this simple integration opens up a vast amount of power.

#DevOpsManc Feb 2017

This evening I attended a dev ops meetup in Manchester. Why? Well interesting story, but primarily because:

  • I was in Manchester anyway
  • Tom was presenting his Nasa stuff
  • I’m working on a project with heavy dev ops requirements, so I dragged some of the geeks down with me!

So how did it work? Well they are lucky, they seem to have some great sponsors, and the venue was brilliant.

First up was Tom, showing off what they get up to catching organised criminals at NASA.   (UK Police interested in the memex programme)  Not just that but there’s a huge genomics project too.  They are trying to improve overall standards, and for them it’s vital they can test and deploy in different envs around the world seamlessly.

(He sadly pointed out that not being a US citizen means he’s not allowed near any cool space tech!)

There’s this I/O based monster with 3k cores called wrangler too – and they have very little privileges, so it’s all ansible based here.

Good points about pro/cons of containers…  (Issues with patching etc). And of course, a quick mandatory demo of juju.  One key thing about juju that I hadn’t appreciated was you can save your “project” as a bundle and then do a one line install of that bundle. How cool is that!

Next up was Matt Skelton – Slides here. This was all about designing teams for software development, with a view on how devops fits in.

He talked about business agility – and how quite a lot of people have never experienced working in a high performant team.  I agree with this – once you’ve experienced it, it’s quite a thing, and you don’t want to go back!

There are some fundmental rules, teams work best in 6-9 people, and a lot of this stuff is only applicable in a large enterprise environment. Most importantly is that the team must be stable (albeit slowly changing) and no gathered, and thrown away for every project (which is a model i’ve seen previously).

Anyway another key point is this:

  • Organisation architecture always wins out

What does this mean? It’s been shown that the software architecture always ends up mirroring the organisation setup. Thats crazy! But when you think about it, we’ve all seen it right? so fix the TEAM FIRST before doing your software architecture.

Finally he finished with a point that generated much discussion about cognitive load. Essentially you need to ensure the team is not overloaded. This can be best achieved by simply asking them if they’re confident to manage the running of their current project(s). (e.g. think, do they have sufficient knowledge and time to deal with a P1 incident?) . This is all because stress impacts the ability for a team to be performant.  It seems to me this is very close to “velocity”.  Discussion ongoing there!

Other comments – well I think the comment about google being way above everyone else on the tech front was a bit misleading, it’s not all roses over there..

Finally, there was some talk of “the cost of collaboration”.

Many thanks to the organisers who did a great job, and the sponsors of course!  A genuinely interesting group with a good vibe going on (barring the mandatory dissing of London – Qudos to the guy brave enough to mention that they also host meetups there!)


Sent from my phone


Pentaho London Usergroup (#PLUG17)

So, on 19th January we met for our first Pentaho London Usergroup of 2017. We struggled to get an agenda together, but thankfully Diethard and Nelson came up trumps and we had two excellent talks.

Additionally we decided to re-visit our free consulting thing again.  This was great fun, i think we did a whole of night of it about a year ago. The format is simple – bring your problems, issues etc and ask questions! We’ll then propose a solution as a group.

Last time we did this, it took a little bit to got going, but by the end of the session the questions were pouring in!  This time – we didn’t just suggest a solution, we actually implemented a POC solution, showing how it would be done, on Nigels laptop!  Keep an eye on Nigels feed – He’ll be reporting progress soon – I’ve already heard good things.

Even better – this issue had only come to Nigels attention that very day!  So what a great story that was, and what an AMAZING response time.

So, the talks. Well Diethard started with an excellent summary of the history of Pentaho, in a quiz style way.  Great fun, and many positive comments afterwards – It’s worth understanding and seeking out this story as it explains a lot about how we got to where we are now!  Diethard then presented


Nelson then proceeded to show off CBF2 – which compared to the old CBF looks pretty amazing. An essential tool for anyone who is working with multiple clients, or multiple environments.


The next meetup will be on 4th May – A starwars special  Let me know if you have something for the agenda!  We will be broadcasting this one live too – So no swearing!

#PCM16 – 9th community meetup!

So, a few weeks ago was PCM16 – A 3 day blur of beer, tech talk, beer and sight seeing in Antwerp, Belgium.

Before I give my view – do make sure you read others, there were 2 streams and so it was not possible to see everything!  So, the others I’m aware of:


Happy to add others!  Actually I can’t even find the live blog link right now, but you’ll find it on Know.bi somewhere!

So how was the event? Well as always every year has a different theme. This year it was clearly all about the product.  And it’s not just webspoon! There have been a slew of product upgrades in the last year, all very interesting.

And now the talks:

Pedro – Pentaho 7

  • Released in time for the event!
  • Analytics anywhere
  • This is a progression along that path
  • Hadoop security improvements (only CDH for now)
  • Single server in technology – now to make the product singular to the users.

Tom – Devops

  • Well we already saw this at PLUG!
  • Mentioned use of bigtop distro.
  • How is it nagios still looks as shite now as it did when it came out 10+ years ago? 🙂
  • PDI Charms available, BA being worked on (Juju)

Nelson/Miguel – Snapshots in hadoop

  • Definition discussion on “in” and “out” labels.
  • The approach only works if you can identify your “last” event in a chain – Unfortunately that’s not always possible.

Duarte – Viz API

  • v3 of API is nearly done
  • Lots of very interesting tech details – But the gist of it is it will become easier to just “plug in” your own visualisations
  • Not clear how many people *do* build their own visualisations though!
  • Probably be in the platform in 7.1, analyzer will switch shortly after
  • Good to see improvement and hard work going into the visualisation part of the product

Joao – CBF2

  • now shell script. (hmm!)
  • usual benefits of cbf – quick switching etc.
  • Deploys all your code – not sure if the JUJU charm does this?
  • CE and EE

Jens – PDI7

  • Metadata injection!
  • 11 more steps to be supported in 7.1
  • Filter rows uses the same xml from the KTR to define the condition to inject. Cunning workaround to a tricky problem!
  • 3 phases of metadata injection
    • Simple bog standard example
    • Data flow
    • 2 phase
  • :8080/pentaho/kettle/listServices
  • JSON performance has improved again – you dont have to pass the massive blob of json onto the next step
  • COPY AND PASTE BUG FIXED!!!!!!  This got a huge cheer
  • Labs – team are working with hitachi labs too. Exciting times

Matt Casters

  • In just 2-3 weeks since unit testing was shown at PLUG it’s moved on massively and is now fully usable!
  • 300+ team / community using pdi at hitachi
  • PDI Spreading wildly.

Hiromu Mota – Webspoon

  • Hurrah, web based spoon is here
  • Amazing architecture solution – Essentially just run spoon on the server, then you don’t have to maintain another UI

Julien + gang

  • Showed a handy SSO solution for embedded analytics.

Wael – data science!

  • Same as PLUG
  • Everything we do as ETL guys, will be for the data scientists in the future!

Bart / @rvanbruggen

  • Great talk showcasing neo4j, and beer

#BigDataWeek London

So, on Thursday I attended BigDataWeek in London just down from Liverpool St. Station. This was the conference that we tied up this weeks Pentaho London usergroup with as well so we both shared some attention and advertising.  Cunning eh?

Anyway herewith follows my brain dump / brief summary / key points from the tech stream. Yes – Tech stream only, I’ll not apologise for that!

So – A quick brain dump:


  • Very clever data driven marketing
  • Amazing insights
  • Good numbers of people using it – Surprised.
  • Integration with FB advertising, Twitter integration on the way. Enables them to build a “discussion” with a brand.


  • A very familiar secure datalake story – Hortonworks this time.
  • Use your CC 3 times, and worldpay will by then have your details. ~47% of all uk transactions.
  • 36M trx/day
  • Explicit policy to exploit open source
  • Team of 30
  • 18 months, in production
  • Ability to burst to cloud (double tokenisation). Great idea!
  • Disks never leave the data center. Security obsessed!
  • Aim to have 70% permies, 10% consultancy, and 20% contractors. Not at those levels yet
  • Lots of history from legacy systems still to load
  • 56 nodes, 20 cores, 12 x 4tb disks, 256gb ram. etc.


  • This was interesting – Looking at graphs for recommendations. It’s the opposite of a recent project I worked on 🙂
  • They look at expected degrees of separation between businesses.
  • Insane number of paths – becomes a massive matrix
  • Some results surprising and not really believable
  • They didn’t seem to know who they would sell this data to. Audience asked twice.
  • Spark. No graph DB. Using a pagerank algo in there too somehow.

Google data platform

  • Looked a the streetview dataset – initially done for a laugh, but then with image recognition came a tonne of new opportunities
  • Ability to do massive crunching like that at scale and low cost
  • Pay only for what you need.
  • bigquery now supports full SQL hurrah.  need to try this against mondrian…
  • Interesting discussion on dealing with the ever problematic issue of late arriving data in a streaming system
  • beam – batch AND stream.
  • mentioned loads of DI tools, inc Talend and various commercial ones.
  • spark cluster can be spun up in 90s.
  • good point about the history of google went from publishing papers, to then making those services available, and now doing both at the same time.

Smart Cities

  • @JonnyVoon
  • A great talk. Sadly he was up against a blockchain talk in the other room so a lot of people left.
  • It’s time to focus on the present
  • Buzzword bingo gives you the excuse to forget about the why
  • Example of the @seesense_cc (?) smart bikelight.
  • There is something called the London Datastore – This seems worth checking out! ***

Bigstep / Ansible

  • 2 minutes, you can have your services deployed on bare metal. (faster)
  • Pay per second. Usual AAS stuff
  • Usual full app stacks available.
  • Interesting use of the “tree” unix command.
  • Safe scale down – ability to evict data from data nodes before shutting them down.
  • Unclear how they handle security, how do they securely wipe that data?
  • Unclear architecture diagrams
  • Lots of good learnings (see their slides)

Skyscanner Data Quality

  • The only talk of many at this event on data quality. That is not good!
  • Must define confidence in your numbers
  • Must measure quality
  • Huge costs are associated with bad data quality.
  • github.com/Quartz/bad-data-guide
  • They push all validation failures into it’s own topic so they can be fixed and re-published


  • After lunch the tech room was re-invigorated and raring to go again!
  • Working on the “workplace” product which was originally an internal facebook tool
  • Scalable metrics platform
  • Bemoaning lack of open source data integration tools until recently. Seriously? Are facebook devs not allowed to use google?
  • Daily/Monthly accounting
  • Lots of standardised metrics.
  • Good metrics are not necessarily trivial to understand.
  • They’re hiring!
  • As the facebook ETL framework is not opensourced, they’re unlikely to opensource the metric framework
  • No screenshots. Boo.
  • They move fast – Faster even that a startup (Guy had startup experience)


  • This was very interesting. Lots of legacy systems, lack of visibility of data, and massive political challenges
  • Road space management
  • Still using lots of RDBMS
  • Pressure to publish data
  • Relying on opening data and letting integrators do the rest.  IMHO this is risky and in the past i’ve seen it not work at all – but times change..
  • 14K traffic sensors, at 4hz.  400M events/day
  • Use a lot of R and have a vibrant community of users
  • Check out their arch. diagrams – very interesting.
  • Problems are the lack of skills – not technology. Workaround to this: Use hackathons!


  • Healthcare in US example
  • Anomaly detection
  • ROI – $22 for every $1 spent. Nice!
  • Aadhar ID system  (In India) – very similar to Aegate drug tracking system (but on the right technology this time!)
  • 60% population, 20% reduction in fraud, $50B savings
  • pretty much a sales talk. No tech.


  • Struggling with so many disparate separate companies, different data systems, different tech.  created a global corp model
  • hortonworks -they want to use as open tech as possible
  • so no SAS – R, Spark, Hadoop
  • mentions the title “Data Journalist” what is that!!?
  • also apologises for no tech detail
  • Talking about what level people try to connect, e.g. 2g, 3g, 4g, but that’s weird, surely everyone starts at 4g these days and works down.
  • they have a team to train local people to implement the global model
  • issues with getting data out/in some countries – meaning can’t use cloud

NATS – air traffic control

  • some low bandwidth data links from the planes make things interesting
  • real time of course
  • 70 data scientists, 700 engineers
  • future prototypes on mapr, spark and bigstep
  • looking at gpu
  • 3pb
  • Looking at how to monitise the data
  • cracked some complicated data.  Encodings. Binary. Common in IOT it seems!  Radar

Overall summary of the event

  • Good talks – would go next year
  • Good location
  • Good balance of people – business/tech, even the odd sales. And even a balance of women and men.


Finally, here’s some links they sent out to some of the presentations/talks – there’s lots of goodies within this!









#Devops, #JUJU, #BigData and #Pentaho

Yup – That was PLUG (Pentaho London UserGroup) in London last week!  And boy did it deliver.

This quarter we headed to Canonicals head office which is nicely located just behind the Tate Modern in London. I must admit I don’t know much about what they do, but a bit of research later and I quickly saw they’re at the heart of open source, and they have a truly impressive culture  centering around community.  Anyway – We’re always grateful for a venue, and doubly so when they provide beer!

We started off with our talk from Tom Barber showing how easy and quick it is to get started with Juju.  I guess Juju is like docker on steroids. I particularly liked the way you could simulate you’re entire cluster (networking and all) on one box (laptop!).  This feature is especially interesting to the tester in me.  We also had an interesting discussion about how you’d go about hooking jobs into the existing monitoring API.  The slides from Toms talk are here and the video should be available soon.

Next up was Luis from UbiquisBI showing off their statistical analysis of the world cup stats, and trying to predict the result part way through a game.  I did like the way (as you have to when the event is passed) you could easily simulate replaying from a certain date/point in time.  Given the complexity of parsing some of the HTML data from wikipedia, I did wonder whether they had attempted to use HTML parsing stuff that used to be in Apache Calcite – As that allows you to present a simple wikipedia page as a SQL table – but unfortunately I cannot find that tool any more, so I guess that’s why not!

What was different this time other than the venue? Well we had a slight change in the balance of regulars vs new folk – with more new people coming – I put this down to the venue and topics.

More interestingly the vibe was pretty exciting this time. We had loads of positive feedback at the end of the event, both on meetup and face to face.  I think the subjects were good, and we nailed it, well done everyone.  So lets keep that going, our next meetup is likely to be one of our biggest ever, as we’re doing a tie in with the huge #bigdata London event. You can find details here. If anyone has a talk, then let me know and I’ll schedule you in.