Hadoop UK meetup Feb 2016

So, Tuesday 23rd Feb was the first #HUGUK meetup of 2016.  As is generally the case the event sold all 200 tickets out almost immediately. I managed to get in quick and headed down along with fellow Pentaho community legend Diethard.

The agenda was particularly interesting this time round. Here’s my summary of the talks – remember this is only my view, and everyone has bias some way or another..  Would love to hear what others thought though.

A quick initial conclusion though. 3 of the 4 talks (plus the sponsor) all pitched their current recruitment drive.  The market is HOT!

Home office

First up was a couple of chaps from the home office. With 30k staff they’re not the biggest org, but neither are they tiny. With 1/4 billion border crossings a year the data is certainly in “big data” territory.  They’re trying something new though – Moving away from the silo approach and trying to model in a clever fashion the entire customer journey.

Interesting they’ve chosen hortonworks when other UK government departments have gone cloudera. Joined up thinking here? hmm..

They face the same challenges others face, and some of their data is highly sensitive, the term “threat to life” was used.  Their budget is being cut by 30% over the next 5 years too.  They’ve been on the same journey everyone else has, of trying outsourcing, failing, bringing it all inhouse again.  Their underlying hardware seemed pretty high spec, was unclear what BI tools they’re using at the front end though.

Open streetmap

Steve knox rather bravely presented his conclusion that big data wasn’t the answer to the scaling issues that openstreetmap currently have.  Current database is 500gb-1tb ish (eh, how can they not know?) and growing by 1tb a year.  All this on a box with only 8 cores!

Couple of interesting points – one challenge is regarding how to efficiently index GIS data. Interesting challenge – but based upon an assumption that you need to index at all.   Perhaps there is a better way?

The benchmarks were mixed. On like for like hardware, seemed some of the hadoop kit was slower – but the hadoop solution scaled MUCH better when the data size increased – as expected.

Finally a lot of the issues mentioned with map reduce/yarn (debugging, jobs failing, re-compiling java programmes) are ALL addressed with Pentaho MapReduce. #JustSaying

(Err, and anyway, why look at map reduce in this day and age?)

The conclusions regarding hadoop were pretty conservative too – A surprisingly corporate opinion for an organisation that isnt a corporation!  But making the most of the skills you already have has some merit.

IBM Watson

This was a great talk by Alexandre about IBM Watson. Although – I’d like to see Watson give the talk next time :o)

Talked about 3 eras of computing – with our current state being 2nd era.

Watson has an open API so you can give it a go, which sounds pretty interesting. There was a video of a demo, which is always a negative, a live demo would have been far more impressive, but hey maybe it’s not quite there yet.

Great audience questions, one sounded almost staged :o)

There is an excellent partnership economy with watson, they take a revenue share of your product, and also have a VC fund too (who doesnt!?). This is pretty impressive – IBM have really turned themselves around these days and are in my book an extremely exciting company to watch.  Continuing with the partnership stuff, your data remains yours, so if watson learns some cool stuff from your questions, that “knowledge” remains yours.

 

Privacy

Finally was privitar – this was also a very interesting presentation. Talking about how to find better ways of anonymising data, to encourage more sharing. They really sound like they should be heavily involved in the opendata scene and government, but actually their current clients are banks and finance.   Maybe I misunderstood the presentation somewhat here, but they made an excellent point about a crucial difference between Privacy and Security.  They also mentioned some funny security breaches caused by data not sufficiently blurred.  Final great point was this concept of having a knowledge on the degree of “data loss” the blurring has incurred to a data set.

 

Conclusions?  Well I think I may have been to 4 or 5 of the Hadoop meets now, this was the best! Good talks, great organisation and hosting, impressive stuff.

Anyone interested in Pentaho and BigData/Hadoop would do well to join our meetup!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s