So, last night (11/1/18) I attended only the second ApacheBeamLondon meetup, and it was a very interesting affair.
Firstly – The venue – Qubit – right bang in the middle of covent garden, what a cool location. Not sure what they do – but the offices were pretty nice!
The first talk was about an implementation of a money (unit based) tracking system called Futureflow – Implemented using ApacheBeam (or previously dataflow). The data is persisted in BigTable. They are only interested in the flow of money, not who it goes between, and thus think they can allay any privacy or regulatory concerns. Using Pub/Sub they also think that makes it easy to get the data from the banks.
This is not dissimilar to another situation i’ve seen concerning grocery shopping data. Again in that market to get access to the data can be very long winded. By simplifying it up front for the supplier you’re more likely to succeed.
Developing in a pipeline is good because you solidify your inputs/outputs and then you can just get on with the boxes in the middle without affecting anyone else. And it’s that box(s) in the middle that take the work!
There is some creative table design which trades storage for fast lookup – It’s a very dedicated data model for a row scan centric system. But in their case they have to be able to show scale, so it must be considered up front. The whole system relies on very fast transaction history lookup for a given unit of money.
The second talk from JB was a deep dive into IOs in Apache Beam. This was very interesting and I was pleased the organisers combined a deeply technical talk, with a real use case talk.
Curiously I saw a lot of similarities between some of the internals of Pentaho PDI and some of the pcollection/ptransforms in Beam – In particular, a pcollection === rowset, and a ptransform === step.
Anyway it was very interesting to see how the guts of the IO steps work, and how batching is handled – including the new archicture for the SplittableDoFN.
There is even a mapreduce runner for Beam! Why? :
— Jean-Baptiste Onofré (@jbonofre) January 11, 2018
Makes sense – Especially when you think about those people who are stuck on older clusters, but want to prepare for an upgrade.
On the IO side i liked the model of a bounded or unbounded source. Allows you to split the read over X readers and keep control of it.
There is a runner compatability matrix – but this just covers functionality NOT performance 🙂
Finally there was a really good discussion about TDD and mocking beam pipelines for unit testing. This should be easy and there’s nothing in beam to prevent it, but it seems it’s actually quite hard. (Although; The unit tests of beam itself make use of this technology) . Now just imagine if there was a product that explicitly supported unit testing from within – AND/OR provided examples, it would be amazing. I think it’s amazing and a great sign that this came up in the discussion.
So thanks to the speakers, organisers and sponsors, and well done for putting on such an interesting event.
See you all at PLUG in 2 weeks!