So last night I went to the first @ApacheKafka meetup in London. This was something I’d mentioned on twitter about it being surprising that there wasn’t one! The guys did an excellent job, and the meetup sold out typically quickly – just like a lot of the other uber cool tech meetups in London.
You can find details on who talked here: http://www.meetup.com/Apache-Kafka-London/events/226160410/
So we were hosted at connected homes (british gas) owners of hive, and various other IOT devices. They explained how even with 10k customers you soon hit problems with scaling efficiently. i.e. without throwing money at the problem! They are not at 250k customers. Average of 9 devices per house.
Paul Makkar (previous particle physicist) then presented about “Why Streaming”
Half the audience seem to be newbs, and half experts. 1/3 have written kafka code.
Mentioned how you may choose to throw away a degree of the data. This ties into a comment at another data science meetup a few months ago, where the guy was pointing out if you have terrabytes of sensor data, which covers 10 component failures, then actually – you don’t have big data do you. You ONLY have 10 failures. Interesting.
Kafka is basically your redo logs from a relational DB. Good analogy!
Logs survive a specific time
Ticker tape comparison
Mentions of stagnnt data lakes, and kafka can feed into these, so it’s a data river? hmm
Flavio from confluent @fpjunqueira
You use brokers so you can easily handle consumption at different rates
And Independent failures
It Might crash! So Replicate it.
Timing! Use a sequencer. One leader several followers
Multiple Web servers
3 streams. ‘Topics’ can be replicated and partitioned.
Can partition a massive topic over several replica sets.
Key based partitioning, round Robin, or custom, usual stuff here.
Each consumer stores their offset and are responsible for that, the new consumer is better at this though
More consumers means faster processing
Offset persistence is important to avoid dupes (It’s not really a dupe – it’s just you’ve consumed it twice)
Can now store offset in kafka!
Zookeeper for replica management
Log compaction – this is clever. With a KV store it just keeps the latest value for each key.
End to end compression
Originally From LinkedIn
Asf top level
Confluent adds a platform
Hiring of course
Lots of good questions!
Ben stopford practical kafka
Showed some pretty easy to write code
Tuned for latency over throughout but tunable, so you can choose what you want
Isr – “in sync replica”
Demo will it work?
Running as root!
New consumer approach is polling Eh? The OLD tech was streaming. This is bizarre? Did I get that right?
New api is better for tracking offsets (API handles it)
Fail over is nice and dynamic – Demo shows this