#PCM15 thoughts

Good evening

So #PCM15 was over, and frankly was a resounding success. London delivered as a spectacular location, and the community provided some great content. Thanks to everyone involved!  Oh; And also we re-raised the T-shirt bar, so thats good too  (This had slipped in recent times!)

Personally I felt I actually missed a lot of the content, so i’ll be perusing the slides to see what I missed!

Diethard has already mentioned there are at least 2 strong candidates for #PCM16, there’ll be a vote some time in the new year and away we go again!  I suggest we set out a timescale, and allow a period of “PR” for the organisers!

A couple of things were discussed extensively over the weekend, or rather there’s a few things that I think should be put out widely into the community.

The first was event timing. I’d pondered on suggesting a move of date – I feared PWorld was canibalising PCM, but even before #PCM15 I had spoken to a few people and come to the conclusion that the pros outweigh the cons.  I got a similar impression from discussions during the event too – So it seems to make sense to stick to our early autumn time slot.  I do think it makes a lot of sense to come after PWorld though.

The second thing was about event structure – well we’ve fallen into a pretty regular routine now, albeit each new organiser tweaks it in their own way. We now have a Friday event (hack in our case), multi track conference on Saturday and social events on the Sunday.  The feedback around this is that people are looking for some change – but it’s not clear what.  Personally I don’t think PCM is ready to move to a multi-day conference – I like the fact that we blitz it in one day.  In reality maybe we could extend Friday, but again – would people travel any earlier? Or would you just get the same attendees turning up late as before?

Finally a quick dump of advice to the next organiser:  ((I may add to this, i’m sure i’ve forgotten stuff)

  1. Use meetup / eventbright to help organise
  2. Be super clear about directions, hotels and locations.  Send repeated emails (we didn’t do this some people complained!)
  3. Clearly the event is free – but do charge for lunch. We do recommend lunch onsite – otherwise you’ll never get people back from the cafe’s again! (Cascais!)
  4. Be sure to produce clear agendas with multiple paper copies available at the venue
  5. Spread the load – you’ll need a team to help you both organise and sponsor the event. There’s lots of different aspects so this is pretty easy to do.
  6. Feel free to add your own identity / twist to the event, but remember we are here because the formula above has evolved and matured nicely!
  7. Remember this is a community event organised by the community for the community.  That does not mean pentaho are excluded – precisely the opposite. It is important that there is a strong Pentaho presence at the event.

Either way I look forward to whatever shape #PCM16 brings. I’ll be there!

10th Spark London Meetup

On Monday I attended the 10th Spark London Meetup. This was an impressively well attended meetup with 90 people, virtually everyone who registered turned up, and 170 people on the waiting list!  One day PLUG will be like that!

Anyway the presentation was some hard core spark stuff from Chris Fregly – Who has got one impressive career in this tech.  Anyway here are the key points I noted down:

  • A lot of talk about off-heap coding in Java.  This is a very interesting topic – I was only vaguely aware this was possible.  A lot of it seems to live in “unsafe” code areas – sounds exciting!
  • There are some interesting numbers from the daytona gray sort challenge – 100tb.  Hadoop sorted at a rate of 1.4tb/min with 2,100 nodes, and spark sorted at 4.27tb/min with only 206 nodes. Wow.
  • Spark also scaled linearly to a 1PB sort
  • Bizarrely they did this on the openjdk.  I’m amazed they did that, would love to understand why.
  • There’s some cool features (epol?) that allow irect NIC -> DISK transfers without using CPU usertime. Nice.
  • We saw some details of some clever optimisations with spark shuffle process. Not dissimilar to the pipelining feature with partitioning in PDI
  • There was talk about mappers and reducers. What? I need to know more about that 🙂 Didnt realise they still existed in a spark world.
  • LZF is now used in preference to snappy. LZF has the ability that you can process data on the CPU in a STILL compressed format – just as Actian manages to do so. Clever stuff.
  • Tuning is extremely painful. There’s only really a try and see approach.
  • An interesting matrix multiplication optimisation via transposing.  Amusingly once you’ve done that, it’s not always any faster 🙂 Nice theory though
  • IBM are forming a team of 3500 engineers (this is old news)
    • One has to wonder why IBM doesnt just buy databricks?
  • CPU is often the bottleneck. Disk/Network upgrades often only improve performance by small margins.
  • 100+ Spark SQL functions have been converted to janino for improved performance and flexibility

And then we continued on the 2nd talk…

  • Noted the demo had 2gb permsize defined, blimey
  • demo didnt work
  • Supposedly “simple” demo had about 15 different significant working parts – bonkers!
  • Biggest spark cluster is tencent, possibly up to 14k now.
  • Use dataframes not rdds
  • question about parquet vs avro – Slightly odd given they address different needs.
    • Apparently once you go beyond 22 columns in parquet performance degrades.
    • it used to have a hard 22 column limit – so presumably removing that limit is a hack
    • Scan speed benchmarking shown – but again, this isnt the point of parquet so not sure where this was going.
  • Everyone uses Kafka. EVERYONE!
  • Hyperloglog has an interesting approx distinct count algorithm
  • Interesting debate about how you do cold start problems
  • Zeppelin used again for demo
  • Job management seems to be a pain in Spark.
  • There’s a hidden REST API
  • Cunning matrix factorization example
  • Boil the frog!
  • Flink – If flink does take over, then the ibm crew wont be a case of trying to beat it – more to join them.

Tonight I’m off to see about spark at the hadoop london usergroup – i’m sure it’ll be similarly busy and the content looks just as interesting.

Saiku, Open Source and sustainability

Recently Tom Barber (Magicaltrout) made a post here talking about the troubles of funding an open source product such as Saiku.

In talking to some colleagues there’s an angle here which perhaps some have not considered – which is that of the community version of the Pentaho BI server.  Should Saiku disappear where does that leave the BI Server?

Well; In a very bad state I would suggest. Essentially then you’re a BI server with no olap client.  So what is included in CE then? Well nothing other than reporting.  (And not even adhoc reporting as of 5.0)  Hmm; Thats strangely like another recently acquired “open source” BI server out there where the community version is rarely seen in production.

So; I just hope that those in the Pentaho community (Including those at the corporation itself) realise the benefit that Saiku brings to the BI suite

And before anyone points out dashboards in CE ( which are actually industry leading) – They’re not part of the default CE product and that is a great shame – A surprising number of people don’t even know about the whole Ctool stack.

I personally like some of the models I’ve seen recently where the community collaboratively fund specific features. Maybe that could work for Saiku!