#BigDataWeek London

So, on Thursday I attended BigDataWeek in London just down from Liverpool St. Station. This was the conference that we tied up this weeks Pentaho London usergroup with as well so we both shared some attention and advertising.  Cunning eh?

Anyway herewith follows my brain dump / brief summary / key points from the tech stream. Yes – Tech stream only, I’ll not apologise for that!

So – A quick brain dump:

Shazam

  • Very clever data driven marketing
  • Amazing insights
  • Good numbers of people using it – Surprised.
  • Integration with FB advertising, Twitter integration on the way. Enables them to build a “discussion” with a brand.

Worldpay

  • A very familiar secure datalake story – Hortonworks this time.
  • Use your CC 3 times, and worldpay will by then have your details. ~47% of all uk transactions.
  • 36M trx/day
  • Explicit policy to exploit open source
  • Team of 30
  • 18 months, in production
  • Ability to burst to cloud (double tokenisation). Great idea!
  • Disks never leave the data center. Security obsessed!
  • Aim to have 70% permies, 10% consultancy, and 20% contractors. Not at those levels yet
  • Lots of history from legacy systems still to load
  • 56 nodes, 20 cores, 12 x 4tb disks, 256gb ram. etc.

Barclays

  • This was interesting – Looking at graphs for recommendations. It’s the opposite of a recent project I worked on 🙂
  • They look at expected degrees of separation between businesses.
  • Insane number of paths – becomes a massive matrix
  • Some results surprising and not really believable
  • They didn’t seem to know who they would sell this data to. Audience asked twice.
  • Spark. No graph DB. Using a pagerank algo in there too somehow.

Google data platform

  • Looked a the streetview dataset – initially done for a laugh, but then with image recognition came a tonne of new opportunities
  • Ability to do massive crunching like that at scale and low cost
  • Pay only for what you need.
  • bigquery now supports full SQL hurrah.  need to try this against mondrian…
  • Interesting discussion on dealing with the ever problematic issue of late arriving data in a streaming system
  • beam – batch AND stream.
  • mentioned loads of DI tools, inc Talend and various commercial ones.
  • spark cluster can be spun up in 90s.
  • good point about the history of google went from publishing papers, to then making those services available, and now doing both at the same time.

Smart Cities

  • @JonnyVoon
  • A great talk. Sadly he was up against a blockchain talk in the other room so a lot of people left.
  • It’s time to focus on the present
  • Buzzword bingo gives you the excuse to forget about the why
  • Example of the @seesense_cc (?) smart bikelight.
  • There is something called the London Datastore – This seems worth checking out! ***

Bigstep / Ansible

  • 2 minutes, you can have your services deployed on bare metal. (faster)
  • Pay per second. Usual AAS stuff
  • Usual full app stacks available.
  • Interesting use of the “tree” unix command.
  • Safe scale down – ability to evict data from data nodes before shutting them down.
  • Unclear how they handle security, how do they securely wipe that data?
  • Unclear architecture diagrams
  • Lots of good learnings (see their slides)

Skyscanner Data Quality

  • The only talk of many at this event on data quality. That is not good!
  • Must define confidence in your numbers
  • Must measure quality
  • Huge costs are associated with bad data quality.
  • github.com/Quartz/bad-data-guide
  • They push all validation failures into it’s own topic so they can be fixed and re-published

Facebook

  • After lunch the tech room was re-invigorated and raring to go again!
  • Working on the “workplace” product which was originally an internal facebook tool
  • Scalable metrics platform
  • Bemoaning lack of open source data integration tools until recently. Seriously? Are facebook devs not allowed to use google?
  • Daily/Monthly accounting
  • Lots of standardised metrics.
  • Good metrics are not necessarily trivial to understand.
  • They’re hiring!
  • As the facebook ETL framework is not opensourced, they’re unlikely to opensource the metric framework
  • No screenshots. Boo.
  • They move fast – Faster even that a startup (Guy had startup experience)

TFL

  • This was very interesting. Lots of legacy systems, lack of visibility of data, and massive political challenges
  • Road space management
  • Still using lots of RDBMS
  • Pressure to publish data
  • Relying on opening data and letting integrators do the rest.  IMHO this is risky and in the past i’ve seen it not work at all – but times change..
  • 14K traffic sensors, at 4hz.  400M events/day
  • Use a lot of R and have a vibrant community of users
  • Check out their arch. diagrams – very interesting.
  • Problems are the lack of skills – not technology. Workaround to this: Use hackathons!

MapR

  • Healthcare in US example
  • Anomaly detection
  • ROI – $22 for every $1 spent. Nice!
  • Aadhar ID system  (In India) – very similar to Aegate drug tracking system (but on the right technology this time!)
  • 60% population, 20% reduction in fraud, $50B savings
  • pretty much a sales talk. No tech.

Telefonica

  • Struggling with so many disparate separate companies, different data systems, different tech.  created a global corp model
  • hortonworks -they want to use as open tech as possible
  • so no SAS – R, Spark, Hadoop
  • mentions the title “Data Journalist” what is that!!?
  • also apologises for no tech detail
  • Talking about what level people try to connect, e.g. 2g, 3g, 4g, but that’s weird, surely everyone starts at 4g these days and works down.
  • they have a team to train local people to implement the global model
  • issues with getting data out/in some countries – meaning can’t use cloud

NATS – air traffic control

  • some low bandwidth data links from the planes make things interesting
  • real time of course
  • 70 data scientists, 700 engineers
  • future prototypes on mapr, spark and bigstep
  • looking at gpu
  • 3pb
  • Looking at how to monitise the data
  • cracked some complicated data.  Encodings. Binary. Common in IOT it seems!  Radar

Overall summary of the event

  • Good talks – would go next year
  • Good location
  • Good balance of people – business/tech, even the odd sales. And even a balance of women and men.

 

Finally, here’s some links they sent out to some of the presentations/talks – there’s lots of goodies within this!

 

KEYNOTES

TECHNICAL TRACK

BUSINESS TRACK

 

 

 

 

Advertisements

#Devops, #JUJU, #BigData and #Pentaho

Yup – That was PLUG (Pentaho London UserGroup) in London last week!  And boy did it deliver.

This quarter we headed to Canonicals head office which is nicely located just behind the Tate Modern in London. I must admit I don’t know much about what they do, but a bit of research later and I quickly saw they’re at the heart of open source, and they have a truly impressive culture  centering around community.  Anyway – We’re always grateful for a venue, and doubly so when they provide beer!

We started off with our talk from Tom Barber showing how easy and quick it is to get started with Juju.  I guess Juju is like docker on steroids. I particularly liked the way you could simulate you’re entire cluster (networking and all) on one box (laptop!).  This feature is especially interesting to the tester in me.  We also had an interesting discussion about how you’d go about hooking jobs into the existing monitoring API.  The slides from Toms talk are here and the video should be available soon.

Next up was Luis from UbiquisBI showing off their statistical analysis of the world cup stats, and trying to predict the result part way through a game.  I did like the way (as you have to when the event is passed) you could easily simulate replaying from a certain date/point in time.  Given the complexity of parsing some of the HTML data from wikipedia, I did wonder whether they had attempted to use HTML parsing stuff that used to be in Apache Calcite – As that allows you to present a simple wikipedia page as a SQL table – but unfortunately I cannot find that tool any more, so I guess that’s why not!

What was different this time other than the venue? Well we had a slight change in the balance of regulars vs new folk – with more new people coming – I put this down to the venue and topics.

More interestingly the vibe was pretty exciting this time. We had loads of positive feedback at the end of the event, both on meetup and face to face.  I think the subjects were good, and we nailed it, well done everyone.  So lets keep that going, our next meetup is likely to be one of our biggest ever, as we’re doing a tie in with the huge #bigdata London event. You can find details here. If anyone has a talk, then let me know and I’ll schedule you in.

How to use sub-mappings in Pentaho map-reduce

So, short and sweet.  How do you use submappings in a mapper or reducer?

Natively you can’t. It doesn’t work – but the reason it doesn’t work is simple – The Map/Reduce step in Pentaho doesn’t make the submapping ktr available to the job. It only publishes the top level job.

So the solution is to use a HDFS url for the name of the sub mapping transformation i.e.:

hdfs://hdfsserver:8020/user/rockstar/mysubtransformation.ktr

This however has side effects – namely spoon will hang rather a lot. So the only way to apply this is to hack the XML of your transformation. Yuck!

You could actually use any resolvable url.  I think it would make sense to use HDFS, but make sure you put the ktr into the distributed cache so it’ll always be on the local node. BOOM!

Naturally there is now a jira but as we’re all going spark, I don’t see it being fixed too quickly 🙂

Building a lamda architecture

So, in recently researching lamda architectures I came across these links, and I thought some were worth sharing here:

This document has a great slide, which shows how you keep the data stores separate, but merge at the serving layer:

 
 
 
Just to keep things interesting, there is a subtly different view here: (Linkedin guy)
 
 
That solution is not dissimilar to this document here:
 
 
An important comment about the fundamental principle of immutable data in lamdba:
 
 
(Don’t worry, the page is nothing about Talend itself – A common marketing trick that tech companies seem to be using a lot these days – talk about cool tech, just to get yourself linked to. Oh damnit, I just did that. Damn!)
Then there’s the outsider – Kudu.  Kudu seems to be going back to mutability.  BUT kudu is far from being suitable for production use, and it has a horrible deployment architecture.
 
 
Finally, inquidia (BigData Pentaho partners in the states) have a page on it, a good summary of the options, latency implications etc. This can be found here:
 
 

Community News – July 2016

Hi Everyone,
So it’s been a while since the last news update, there’s been a lot going on, but really it’s all about the meetups..
#Pentaho News Roundup
 
Greenplum use case
Released only today, here’s a good read on EMC and their use of greenplum
PDI Running in snowflake
So, lets see if we can get inquidia to cross the pond and present this to PLUG, in the mean time to see how they’ve released a plugin to enable PDI to run in the cloud via snowflake, go here
 
#pentahomeetup
#PCM16
Ok, this is SURELY THE BIG NEWS!  Pentaho Community Meetup 2016 is back in Antwerp this year. For those that don’t know, it’ll follow a fairly familiar pattern along the lines of:
Friday Arrive in antwerp, chill
Friday Evening hackathon esque event
Saturday Main conference (At a stunning location!)
Saturday Evening – Trialing the stunning diversity of beer available in Belgium
Sunday – Sightseeing / crawling home
There’s no formal agenda yet, but we do know the date – 11th – 13th November.  So get your flights/trains/hotels booked now.
As usual from Pentaho themselves there will be an extensive turnout of Pentaho developers, product owners, Pedros etc.  Then there’ll be the usual blend of developers, users and data ninjas.
Also with no “Pentaho world” this year there’ll be no cannibalism, so this looks set to be a huge one.
#PLUG
But! don’t forget good old Pentaho London – PLUG!  We’re moving to canonicals offices for a meetup to discuss bigdata devops amongs other things. You’ll also see Nelson presenting something top secret, sounds intriguing. Register here
#blogs
Dynamic processing engines
 
This has generated quite some interest. Err, from myself!  Read it here
 
#jobs
Jobs
Not sure of any specifics going around right now, but there’s loads out there, the market is crazy right now!

Dynamic number of processing engines in Pentaho

So about 3 years ago this foxed me, and as it wasn’t particularly important at the time we went with a hard coded number of processing engines.  Randomly whilst driving through the beautiful British countryside last week i realised how simple the solution really is.  Here is what i was passing at the time… (Brocket hall, they do food i think!)

1flowers3

Whoa! hang on a minute, what am I on about – OK the scenario is this – common in a metadata driven system – you want to process all your data, and depending on some attribute send the data to one template or another. Fine.

BUT because we love metadata, and we love plugins, you don’t want to have to change the code just to add a brand new template.  Even if it would just be a copy and paste operation..

Concrete example? Sure. You’re processing signal data. You want to store the data differently depending on the signal.  So you have metadata that maps the signal to the processing engine.  Your options could include:

  • Store as-is with no changes, all data
  • Store only the records where the signal value has changed from the previous value
  • Perform some streaming FFT analysis to move the data to the frequency domain
  • Perform some DCT or other to reduce the granularity of the data
  • etc!

The solution in the end is ridiculously simple. Just use a simple mapping, partitioned, and use the partition ID in the mapping file name!

As always you can find the code here (Notice the creation of a dummy engine called enginepartition-id.ktr which just keeps PDI happy and stops it moaning, and preventing you closing the dialog!)mappings