Hadoop UK meetup Feb 2016

So, Tuesday 23rd Feb was the first #HUGUK meetup of 2016.  As is generally the case the event sold all 200 tickets out almost immediately. I managed to get in quick and headed down along with fellow Pentaho community legend Diethard.

The agenda was particularly interesting this time round. Here’s my summary of the talks – remember this is only my view, and everyone has bias some way or another..  Would love to hear what others thought though.

A quick initial conclusion though. 3 of the 4 talks (plus the sponsor) all pitched their current recruitment drive.  The market is HOT!

Home office

First up was a couple of chaps from the home office. With 30k staff they’re not the biggest org, but neither are they tiny. With 1/4 billion border crossings a year the data is certainly in “big data” territory.  They’re trying something new though – Moving away from the silo approach and trying to model in a clever fashion the entire customer journey.

Interesting they’ve chosen hortonworks when other UK government departments have gone cloudera. Joined up thinking here? hmm..

They face the same challenges others face, and some of their data is highly sensitive, the term “threat to life” was used.  Their budget is being cut by 30% over the next 5 years too.  They’ve been on the same journey everyone else has, of trying outsourcing, failing, bringing it all inhouse again.  Their underlying hardware seemed pretty high spec, was unclear what BI tools they’re using at the front end though.

Open streetmap

Steve knox rather bravely presented his conclusion that big data wasn’t the answer to the scaling issues that openstreetmap currently have.  Current database is 500gb-1tb ish (eh, how can they not know?) and growing by 1tb a year.  All this on a box with only 8 cores!

Couple of interesting points – one challenge is regarding how to efficiently index GIS data. Interesting challenge – but based upon an assumption that you need to index at all.   Perhaps there is a better way?

The benchmarks were mixed. On like for like hardware, seemed some of the hadoop kit was slower – but the hadoop solution scaled MUCH better when the data size increased – as expected.

Finally a lot of the issues mentioned with map reduce/yarn (debugging, jobs failing, re-compiling java programmes) are ALL addressed with Pentaho MapReduce. #JustSaying

(Err, and anyway, why look at map reduce in this day and age?)

The conclusions regarding hadoop were pretty conservative too – A surprisingly corporate opinion for an organisation that isnt a corporation!  But making the most of the skills you already have has some merit.

IBM Watson

This was a great talk by Alexandre about IBM Watson. Although – I’d like to see Watson give the talk next time :o)

Talked about 3 eras of computing – with our current state being 2nd era.

Watson has an open API so you can give it a go, which sounds pretty interesting. There was a video of a demo, which is always a negative, a live demo would have been far more impressive, but hey maybe it’s not quite there yet.

Great audience questions, one sounded almost staged :o)

There is an excellent partnership economy with watson, they take a revenue share of your product, and also have a VC fund too (who doesnt!?). This is pretty impressive – IBM have really turned themselves around these days and are in my book an extremely exciting company to watch.  Continuing with the partnership stuff, your data remains yours, so if watson learns some cool stuff from your questions, that “knowledge” remains yours.



Finally was privitar – this was also a very interesting presentation. Talking about how to find better ways of anonymising data, to encourage more sharing. They really sound like they should be heavily involved in the opendata scene and government, but actually their current clients are banks and finance.   Maybe I misunderstood the presentation somewhat here, but they made an excellent point about a crucial difference between Privacy and Security.  They also mentioned some funny security breaches caused by data not sufficiently blurred.  Final great point was this concept of having a knowledge on the degree of “data loss” the blurring has incurred to a data set.


Conclusions?  Well I think I may have been to 4 or 5 of the Hadoop meets now, this was the best! Good talks, great organisation and hosting, impressive stuff.

Anyone interested in Pentaho and BigData/Hadoop would do well to join our meetup!

[Pentaho London Usergroup] Q1 News Roundup

Hi everyone,
With the next PLUG just days away I thought I’d send out a reminder as well as some exciting news that is bouncing around the Pentaho world right now.
#Pentaho News Roundup
Gartner have now formally recognised Pentaho and placed it slap bang where it belongs in the right quadrant. Press release here:
#HDS Appliance
This is probably the first fruits of the new HDS ownership of Pentaho – an out of the box appliance. I’d love to see the details of how this works!  It’s impressive though, and a first in the OSBI world. Details on the register
Challenge BigData
A couple of weeks ago Pentaho held a meetup called Challenge Big Data, and that we did.  As often the case with meetups the real interesting stories were in meeting people from machine learning backgrounds, data scientists and perhaps most interestingly some folk from the insurance industry.
So, next PLUG is Wednesday, you can register here. Pedro alves will be giving an overview of imminent 6.1 release, and Nelson will be talking about our favorite topic of all!  Please also register on the skillsmatter site it will make it quicker to get in, and therefore quicker to get to the beer.
Additionally the next meetup is penciled in for 11th May. Please register onmeetup.com and if you’d like to talk then let me know!  There’s a few candidates for talks which look pretty interesting. Kafka? Saiku? lets see..
We’ve still not had a vote on PCM16 location yet because so far we only have one candidate!  There’s still time to make a proposal though.  I’m hoping for somewhere a touch sunnier than London managed!
Diethard Steiner
So, as usual Diethard is doing his thing and this time we’re tackling the theme of best practices with a focus on database schema management. Whilst there are plenty of tools out there commercial and open source Diethard produced a nice simple one in PDI. Details here
So; Do you have a Pentaho opportunity available?  Want me to list it here?  Sure why not. No agencies of course!