On Monday I attended the 10th Spark London Meetup. This was an impressively well attended meetup with 90 people, virtually everyone who registered turned up, and 170 people on the waiting list! One day PLUG will be like that!
Anyway the presentation was some hard core spark stuff from Chris Fregly – Who has got one impressive career in this tech. Anyway here are the key points I noted down:
- A lot of talk about off-heap coding in Java. This is a very interesting topic – I was only vaguely aware this was possible. A lot of it seems to live in “unsafe” code areas – sounds exciting!
- There are some interesting numbers from the daytona gray sort challenge – 100tb. Hadoop sorted at a rate of 1.4tb/min with 2,100 nodes, and spark sorted at 4.27tb/min with only 206 nodes. Wow.
- Spark also scaled linearly to a 1PB sort
- Bizarrely they did this on the openjdk. I’m amazed they did that, would love to understand why.
- There’s some cool features (epol?) that allow irect NIC -> DISK transfers without using CPU usertime. Nice.
- We saw some details of some clever optimisations with spark shuffle process. Not dissimilar to the pipelining feature with partitioning in PDI
- There was talk about mappers and reducers. What? I need to know more about that 🙂 Didnt realise they still existed in a spark world.
- LZF is now used in preference to snappy. LZF has the ability that you can process data on the CPU in a STILL compressed format – just as Actian manages to do so. Clever stuff.
- Tuning is extremely painful. There’s only really a try and see approach.
- An interesting matrix multiplication optimisation via transposing. Amusingly once you’ve done that, it’s not always any faster 🙂 Nice theory though
- IBM are forming a team of 3500 engineers (this is old news)
- One has to wonder why IBM doesnt just buy databricks?
- CPU is often the bottleneck. Disk/Network upgrades often only improve performance by small margins.
- 100+ Spark SQL functions have been converted to janino for improved performance and flexibility
And then we continued on the 2nd talk…
- Noted the demo had 2gb permsize defined, blimey
- demo didnt work
- Supposedly “simple” demo had about 15 different significant working parts – bonkers!
- Biggest spark cluster is tencent, possibly up to 14k now.
- Use dataframes not rdds
- question about parquet vs avro – Slightly odd given they address different needs.
- Apparently once you go beyond 22 columns in parquet performance degrades.
- it used to have a hard 22 column limit – so presumably removing that limit is a hack
- Scan speed benchmarking shown – but again, this isnt the point of parquet so not sure where this was going.
- Everyone uses Kafka. EVERYONE!
- Hyperloglog has an interesting approx distinct count algorithm
- Interesting debate about how you do cold start problems
- Zeppelin used again for demo
- Job management seems to be a pain in Spark.
- There’s a hidden REST API
- Cunning matrix factorization example
- Boil the frog!
- Flink – If flink does take over, then the ibm crew wont be a case of trying to beat it – more to join them.
Tonight I’m off to see about spark at the hadoop london usergroup – i’m sure it’ll be similarly busy and the content looks just as interesting.