So, the Pentaho Community Meetup hackathon is only a week away.
Firstly boring stuff – Location and signup page can be found here NOTE: Skillsmatter has moved!
There is a bar, we will try and arrange some snacks, but no pizza this time I’m afraid!
There are prizes too thanks to Harris, but I’m not sure what they are. I did reject signed photos of him though.
Now, how will it work? Well #PCM14 was a “chaos” hack, which meant “do anything you like”, and form teams as you see fit. We’ll carry on in that theme but with a few crucial changes:
- We would like to encourage “random”ish teams to form – If you turn up with 5 of your workmates, then please don’t form a team with them – what’s the point in that!
- Whilst we do encourage you to bring your own tools and existing frameworks – please do NOT do any pre-preparation in advance. No pre-building the ETL, cube or etc.
- Given we’ve said that – We would like to introduce a theme for the data this year – not mandatory, but maybe you’ll get more points! The theme is going to be #opendata and there is a great place to start looking here: https://data.gov.uk/data/search
Note: If you pick a horrid XML dump which is massive and impossible to parse then good luck.
Turn up from 6pm and some of us will be there earlier.
Officially kick off about 6.30 aiming to form teams and begin hacking by 7pm.
The hack will finish at 8pm to give us an hour for presentations.
At the end we’ll head off to a pub, as yet undefined in the local vicinity. Some mention has been made of finding a decent scotch whisky place – I’m open to ideas! It’s not an area I personally know well though because Skillsmatter recently moved.
So last night I went to the Hadoop usergroup in London #huguk primarily because the talks were on an ApacheSpark theme again. A great venue at Expedia – not least because it’s a short walk from Kings Cross so extremely convenient!
Session1 – a teradata company
Quick straw poll of the audience – more engineers than data scientists
Mentioned data lakes, and swamps – always amusing to hear that term used 🙂
Worked in 3 main areas – Devices, risk and customer analytics
People model too quick. Don’t involve business Very keen on involving the business more rught from the start.
When to consider BigData. Until recent most data scientists are on a workstation
In their experience Full cluster data science is rare.
Being data scientists they tried to make the argument for sampling – but thats always hard to a room of engineers.
Many algos don’t parallelise
Mahout seems dead
Sometimes easier to use r – requiring to reduce the dataset.
We’re no longer ok with slow models. Instant response required!
Scala performs much better than python
Gave an IOT example where a factory may have 200 failures a year. So from that perspective it’s small data. Sure.. The raw sensor data may be enormous, but it’s the failures that count. Interesting point, well made, worth more thought.
Multi armed bandit. Like AB testing on steroids!
Surprisingly good point about governance and metadata
Session 2 sean r owen cloudera
Presented Oryx. A lamda architecture in a box.
Sean is Jvm centric
A full end to end production solution is usually way more engineering than data science. Only a small part is data science.
Exploratory to operational
Seldon in London – Look these up!
Prediction io – Also look this up!
Seans 5th attempt at this framework.. !!!
OK performance. . 100 q/s on a 16 core machine.
Production but not supported
Real time anomaly detection
Quesion about why tomcat – well explained – it is pretty standard after all.
Geared towards known/fixed features
Mired in questions from hell at the end!
Session 3 flink
Kafka of course!
Time windowing or event window. This is nice.
In memory or off heap
Sql – unclear. Briefly mentioned, but not sure what exists.
Chains tasks like spark, but can chain from the mapper to the reducer too.
Exactly once. Yeah yeah..
Distributed snapshots are complicated – but this is how it does disater recover. Ties in with Kafka which can replay from an offset.
Framework handles differing speeds of operators
Configurable snapshot survival time, depends on your latency.
On and off heap with batch processing Pipelines
And finally, right at the end I met a channel 4 chap with a “legacy” batch SPARK architecture – and he’s looking to move real time, and seeing where Pentaho can help. Great!
Apologies for the brain dump – i’m out of time. Better than nothing!
On Monday I attended the 10th Spark London Meetup. This was an impressively well attended meetup with 90 people, virtually everyone who registered turned up, and 170 people on the waiting list! One day PLUG will be like that!
Anyway the presentation was some hard core spark stuff from Chris Fregly – Who has got one impressive career in this tech. Anyway here are the key points I noted down:
- A lot of talk about off-heap coding in Java. This is a very interesting topic – I was only vaguely aware this was possible. A lot of it seems to live in “unsafe” code areas – sounds exciting!
- There are some interesting numbers from the daytona gray sort challenge – 100tb. Hadoop sorted at a rate of 1.4tb/min with 2,100 nodes, and spark sorted at 4.27tb/min with only 206 nodes. Wow.
- Spark also scaled linearly to a 1PB sort
- Bizarrely they did this on the openjdk. I’m amazed they did that, would love to understand why.
- There’s some cool features (epol?) that allow irect NIC -> DISK transfers without using CPU usertime. Nice.
- We saw some details of some clever optimisations with spark shuffle process. Not dissimilar to the pipelining feature with partitioning in PDI
- There was talk about mappers and reducers. What? I need to know more about that 🙂 Didnt realise they still existed in a spark world.
- LZF is now used in preference to snappy. LZF has the ability that you can process data on the CPU in a STILL compressed format – just as Actian manages to do so. Clever stuff.
- Tuning is extremely painful. There’s only really a try and see approach.
- An interesting matrix multiplication optimisation via transposing. Amusingly once you’ve done that, it’s not always any faster 🙂 Nice theory though
- IBM are forming a team of 3500 engineers (this is old news)
- One has to wonder why IBM doesnt just buy databricks?
- CPU is often the bottleneck. Disk/Network upgrades often only improve performance by small margins.
- 100+ Spark SQL functions have been converted to janino for improved performance and flexibility
And then we continued on the 2nd talk…
- Noted the demo had 2gb permsize defined, blimey
- demo didnt work
- Supposedly “simple” demo had about 15 different significant working parts – bonkers!
- Biggest spark cluster is tencent, possibly up to 14k now.
- Use dataframes not rdds
- question about parquet vs avro – Slightly odd given they address different needs.
- Apparently once you go beyond 22 columns in parquet performance degrades.
- it used to have a hard 22 column limit – so presumably removing that limit is a hack
- Scan speed benchmarking shown – but again, this isnt the point of parquet so not sure where this was going.
- Everyone uses Kafka. EVERYONE!
- Hyperloglog has an interesting approx distinct count algorithm
- Interesting debate about how you do cold start problems
- Zeppelin used again for demo
- Job management seems to be a pain in Spark.
- There’s a hidden REST API
- Cunning matrix factorization example
- Boil the frog!
- Flink – If flink does take over, then the ibm crew wont be a case of trying to beat it – more to join them.
Tonight I’m off to see about spark at the hadoop london usergroup – i’m sure it’ll be similarly busy and the content looks just as interesting.