#Pentaho Meetups Galore..

Posted by codek

So; A few weeks ago a new meetup came on the scene, and it was briefly a bit confusing. This seemed to be an official Pentaho meetup group in London. OK, err thats odd I thought. However a bit of digging and it soon becomes clear this is a different kettle of fish.

Now; It just so happened we had actually scheduled (albeit not published) The next PLUG in January, and as it turned out they moved Challenge big data so as not to clash with #PCM15. So partly this was my fault for not promoting the January booking.

Anyway the big question is – can these two groups co-exist? Well yes I think so. Should they have both been done under the “plug” banner, probably yes. I think if you’re looking for a meetup group then having 2 in the same area will be confusing. However as long as we’re clear with content, target audience and dates I don’t see a problem here. And that’s the key thing – the target audience between PLUG and challenge big data is different. Sure there may be some cannibalisation, but lets see.

I shall be at the challenge event, promoting PLUG 🙂 And I fully expect that cross promotion to go both ways. I’m not sure I’ll attend every challenge event – I guess I’ll be a content tart!

In the mean time, we’re still after content for the next PLUG which now moves into February – Skillsmatter have shown in the past that the sooner content is locked down the more signups and ultimately the more people attend – so this isn’t crazy to be asking for talks now. If anyone has either content requests, or a talk they wish to do then please let me know.

Oh, and by the way don’t forget to support the first Kickstarter project in #Pentaho land – Saiku Reporting – this is a very worthy project! Details here

Skillsmatter are soon going to be crowdfunding themselves – they’re after £1/2 million to further the business. Given their stunning new venue and their frankly unique business model I’m pretty sure this is going to be a great success for them. Details on crowdcube

RabbitMQ and PDI

Posted by codek

So last week I was at a Kafka meetup and this week here I am working with RabbitMQ. Funny world..

RabbitMQ has been around forever. If you’re interested in how it compares to kafka there’s a good quora post here

So; Lets install rabbit and get going. The server is incredibly small, about 4mb. Wow, really?

It doesn’t install out of the box on ubuntu precise, so you have to fudge around with erlang as described here but then after that away you go. Server starts up and whoosh you have a queue. (well actually you have nothing, but you can start creating queues!)

So, lets populate the queue with some data. On the PDI marketplace there is a step called “IC AMQP” plugin. Install that and you should be able to both consume and populate a queue. Unfortunately I couldn’t get it to populate, so I just did it in a user defined java step instead. Pretty easy and I can populate the queue (single threaded) at a rate of 25,000 messages per second (200 byte messages), this pushes my CPU pretty high, so is probably as far as my 4year old laptop is going to go. Oh; you need to download the Rabbit client jars into PDI to do this – being careful not to duplicate one of the commons libraries. (PDI has a newer one so stick with that)

I was a bit surprised there’s no command line tools, or maybe I missed them – I assumed there’d be some way of creating queues and configuring routes/topics and such things in advance.

On the consumption side I was able to read the data using the IC plugin but it was a bit more sluggish. By reading with multiple copies I was able to read at 7,000 messages per second. I suspect this may relate to the implementation of the IC AMQP step rather than anything to do with Rabbit.

Additionally the IC step doesn’t allow streaming from a queue. Once the queue is emptied, or once you hit a record limit, it exits. We’ll ultimately need an option that’ll sit there and listen forever. (Pretty simple to code, so looks like a tiny modification or re-write of that step)

What next? I’m curious to see if the message size affects the population/consumption rate. That’ll be important when scaling this up. Potentially we can then use an AWS auto scaling group to scale out the PDI servers if they are unable to consume from the queue at a sustainable rate.

And beyond that? well clearly exactly the same approach can be used with Kafka. It should be pretty easy to build an input/output step.. Something for a rainy(er) day I think!

#KafkaLondon Inaugural Meetup

Posted by codek

So last night I went to the first @ApacheKafka meetup in London. This was something I’d mentioned on twitter about it being surprising that there wasn’t one! The guys did an excellent job, and the meetup sold out typically quickly – just like a lot of the other uber cool tech meetups in London.

You can find details on who talked here: http://www.meetup.com/Apache-Kafka-London/events/226160410/

So we were hosted at connected homes (british gas) owners of hive, and various other IOT devices. They explained how even with 10k customers you soon hit problems with scaling efficiently. i.e. without throwing money at the problem! They are not at 250k customers. Average of 9 devices per house.

Paul Makkar (previous particle physicist) then presented about “Why Streaming”
Half the audience seem to be newbs, and half experts. 1/3 have written kafka code.
Mentioned how you may choose to throw away a degree of the data. This ties into a comment at another data science meetup a few months ago, where the guy was pointing out if you have terrabytes of sensor data, which covers 10 component failures, then actually – you don’t have big data do you. You ONLY have 10 failures. Interesting.
Kafka is basically your redo logs from a relational DB. Good analogy!
Logs survive a specific time
Ticker tape comparison
Mentions of stagnnt data lakes, and kafka can feed into these, so it’s a data river? hmm

Flavio from confluent @fpjunqueira
You use brokers so you can easily handle consumption at different rates
And Independent failures
It Might crash! So Replicate it.
Timing! Use a sequencer. One leader several followers
Multiple Web servers
3 streams. ‘Topics’ can be replicated and partitioned.
Can partition a massive topic over several replica sets.
Key based partitioning, round Robin, or custom, usual stuff here.
Each consumer stores their offset and are responsible for that, the new consumer is better at this though
More consumers means faster processing
Offset persistence is important to avoid dupes (It’s not really a dupe – it’s just you’ve consumed it twice)
Can now store offset in kafka!
Zookeeper for replica management
Log compaction – this is clever. With a KV store it just keeps the latest value for each key.
End to end compression
Originally From LinkedIn
Asf top level
0.8.2.2
Confluent adds a platform
Hiring of course
Lots of good questions!

Ben stopford practical kafka
Showed some pretty easy to write code
Use callbacks
Tuned for latency over throughout but tunable, so you can choose what you want
Isr – “in sync replica”
Demo will it work?
Running as root!
New consumer approach is polling Eh? The OLD tech was streaming. This is bizarre? Did I get that right?
New api is better for tracking offsets (API handles it)
Fail over is nice and dynamic – Demo shows this

#PCM15 thoughts

Posted by codek

Good evening

So #PCM15 was over, and frankly was a resounding success. London delivered as a spectacular location, and the community provided some great content. Thanks to everyone involved! Oh; And also we re-raised the T-shirt bar, so thats good too (This had slipped in recent times!)

Personally I felt I actually missed a lot of the content, so i’ll be perusing the slides to see what I missed!

Diethard has already mentioned there are at least 2 strong candidates for #PCM16, there’ll be a vote some time in the new year and away we go again! I suggest we set out a timescale, and allow a period of “PR” for the organisers!

A couple of things were discussed extensively over the weekend, or rather there’s a few things that I think should be put out widely into the community.

The first was event timing. I’d pondered on suggesting a move of date – I feared PWorld was canibalising PCM, but even before #PCM15 I had spoken to a few people and come to the conclusion that the pros outweigh the cons. I got a similar impression from discussions during the event too – So it seems to make sense to stick to our early autumn time slot. I do think it makes a lot of sense to come after PWorld though.

The second thing was about event structure – well we’ve fallen into a pretty regular routine now, albeit each new organiser tweaks it in their own way. We now have a Friday event (hack in our case), multi track conference on Saturday and social events on the Sunday. The feedback around this is that people are looking for some change – but it’s not clear what. Personally I don’t think PCM is ready to move to a multi-day conference – I like the fact that we blitz it in one day. In reality maybe we could extend Friday, but again – would people travel any earlier? Or would you just get the same attendees turning up late as before?

Finally a quick dump of advice to the next organiser: ((I may add to this, i’m sure i’ve forgotten stuff)

Use meetup / eventbright to help organise
Be super clear about directions, hotels and locations. Send repeated emails (we didn’t do this some people complained!)
Clearly the event is free – but do charge for lunch. We do recommend lunch onsite – otherwise you’ll never get people back from the cafe’s again! (Cascais!)
Be sure to produce clear agendas with multiple paper copies available at the venue
Spread the load – you’ll need a team to help you both organise and sponsor the event. There’s lots of different aspects so this is pretty easy to do.
Feel free to add your own identity / twist to the event, but remember we are here because the formula above has evolved and matured nicely!
Remember this is a community event organised by the community for the community. That does not mean pentaho are excluded – precisely the opposite. It is important that there is a strong Pentaho presence at the event.

Either way I look forward to whatever shape #PCM16 brings. I’ll be there!

#PCM15 Agenda with Timings! Saturday 7th November LONDON

Posted by codek

Good Morning

Here’s the final agenda for #PCM15 this weekend:

	Ian on #Serverless #AWS PDI
	codek on #Serverless #AWS PDI
	NxP on #Serverless #AWS PDI
	codek on Uploading files with CFR and…
	Andrey on Uploading files with CFR and…

Codeks Blog

Open source BI Consultant

Month: November 2015

#Pentaho Meetups Galore..

RabbitMQ and PDI

#KafkaLondon Inaugural Meetup

#PCM15 thoughts

#PCM15 Agenda with Timings! Saturday 7th November LONDON