So last night I went to the Hadoop usergroup in London #huguk primarily because the talks were on an ApacheSpark theme again. A great venue at Expedia – not least because it’s a short walk from Kings Cross so extremely convenient!
Session1 – a teradata company
Quick straw poll of the audience – more engineers than data scientists
Mentioned data lakes, and swamps – always amusing to hear that term used 🙂
Worked in 3 main areas – Devices, risk and customer analytics
People model too quick. Don’t involve business Very keen on involving the business more rught from the start.
When to consider BigData. Until recent most data scientists are on a workstation
In their experience Full cluster data science is rare.
Being data scientists they tried to make the argument for sampling – but thats always hard to a room of engineers.
Many algos don’t parallelise
Mahout seems dead
Sometimes easier to use r – requiring to reduce the dataset.
We’re no longer ok with slow models. Instant response required!
Scala performs much better than python
Gave an IOT example where a factory may have 200 failures a year. So from that perspective it’s small data. Sure.. The raw sensor data may be enormous, but it’s the failures that count. Interesting point, well made, worth more thought.
Multi armed bandit. Like AB testing on steroids!
Surprisingly good point about governance and metadata
Session 2 sean r owen cloudera
Presented Oryx. A lamda architecture in a box.
Sean is Jvm centric
A full end to end production solution is usually way more engineering than data science. Only a small part is data science.
Exploratory to operational
Seldon in London – Look these up!
Prediction io – Also look this up!
Seans 5th attempt at this framework.. !!!
OK performance. . 100 q/s on a 16 core machine.
Production but not supported
Real time anomaly detection
Quesion about why tomcat – well explained – it is pretty standard after all.
Geared towards known/fixed features
Mired in questions from hell at the end!
Session 3 flink
Kafka of course!
Time windowing or event window. This is nice.
In memory or off heap
Sql – unclear. Briefly mentioned, but not sure what exists.
Chains tasks like spark, but can chain from the mapper to the reducer too.
Exactly once. Yeah yeah..
Distributed snapshots are complicated – but this is how it does disater recover. Ties in with Kafka which can replay from an offset.
Framework handles differing speeds of operators
Configurable snapshot survival time, depends on your latency.
On and off heap with batch processing Pipelines
And finally, right at the end I met a channel 4 chap with a “legacy” batch SPARK architecture – and he’s looking to move real time, and seeing where Pentaho can help. Great!
Apologies for the brain dump – i’m out of time. Better than nothing!