#Metadata acquisition (#IOT)

Posted by codek

So, everything is metadata driven these days, this concept is no longer new. This is especially true in the land of IOT where you have wildly varying sources of data, of all shapes and sizes.

That leaves an age old problem though – how do you populate your metadata? If your data consists of 1,000’s of different values this is a significant issue.

Well in the olden RDBMS days, there’s a simple yet powerful plugin from Roland which exposes the JDBC metadata directly to PDI:

https://github.com/rpbouman/pentaho-pdi-plugin-jdbc-metadata

However; Despite years of promise of self describing systems, the horrendousness that was WSDL and more, we’ve actually gone backwards, and data is now going simpler and back to basic formats.

That may be as basic as a text file – in which case there is also a plugin to help you – the file meta plugin – this scans your text file and makes an attempt to guess:

separator
column headers
data types

Obviously it can only ever be a guess, but it’s better than nothing. Maybe you’ll even have a spec, and maybe that spec will be up to date (lol, OK sorry, i know, that’s not going to happen.)

It’s possible you have an XML data source, perhaps for a more elderly system, or more likely these days we’re looking at JSON.

In the PDI XML world the XML input step is pretty mature. It’s easy to use with a great get nodes, and get paths button. Couldn’t be simpler. Kinda sad though that the step matures, just as the format becomes obsolete..

In the json world, it’s not so good. The step is less mature and therefore doesn’t have these nice config features. So given a json you know nothing about, what to do? Well the answer is to hit stackoverflow, grab a snippet of code to give you all the jsonPaths in a json file, and execute it in PDI! See the samples here (On a positive note, the performance of the json input step in pdi6+ is now up to an expected level)

So great, we’ve understood what’s in our 1000s of input files. Job done.

The use case is more nuanced than that. Sure you can then use this scan to initialise your metadata repository, which in turn then configures your transformation via metadata injection of the json input step. (A PDI7 feature btw). But actually you can use this scan to check if the structure of your incoming files is as expected. You can diff the attributes and if you have new keys, add them to your metadata library. If you have missing keys you can raise an alert. It’s important to *look* for change and deal with it rather than waiting to be notified of it. The latter will never happen!

#Pentaho Meetups Galore..

Posted by codek

So; A few weeks ago a new meetup came on the scene, and it was briefly a bit confusing. This seemed to be an official Pentaho meetup group in London. OK, err thats odd I thought. However a bit of digging and it soon becomes clear this is a different kettle of fish.

Now; It just so happened we had actually scheduled (albeit not published) The next PLUG in January, and as it turned out they moved Challenge big data so as not to clash with #PCM15. So partly this was my fault for not promoting the January booking.

Anyway the big question is – can these two groups co-exist? Well yes I think so. Should they have both been done under the “plug” banner, probably yes. I think if you’re looking for a meetup group then having 2 in the same area will be confusing. However as long as we’re clear with content, target audience and dates I don’t see a problem here. And that’s the key thing – the target audience between PLUG and challenge big data is different. Sure there may be some cannibalisation, but lets see.

I shall be at the challenge event, promoting PLUG 🙂 And I fully expect that cross promotion to go both ways. I’m not sure I’ll attend every challenge event – I guess I’ll be a content tart!

In the mean time, we’re still after content for the next PLUG which now moves into February – Skillsmatter have shown in the past that the sooner content is locked down the more signups and ultimately the more people attend – so this isn’t crazy to be asking for talks now. If anyone has either content requests, or a talk they wish to do then please let me know.

Oh, and by the way don’t forget to support the first Kickstarter project in #Pentaho land – Saiku Reporting – this is a very worthy project! Details here

Skillsmatter are soon going to be crowdfunding themselves – they’re after £1/2 million to further the business. Given their stunning new venue and their frankly unique business model I’m pretty sure this is going to be a great success for them. Details on crowdcube

#PCM15 thoughts

Posted by codek

Good evening

So #PCM15 was over, and frankly was a resounding success. London delivered as a spectacular location, and the community provided some great content. Thanks to everyone involved! Oh; And also we re-raised the T-shirt bar, so thats good too (This had slipped in recent times!)

Personally I felt I actually missed a lot of the content, so i’ll be perusing the slides to see what I missed!

Diethard has already mentioned there are at least 2 strong candidates for #PCM16, there’ll be a vote some time in the new year and away we go again! I suggest we set out a timescale, and allow a period of “PR” for the organisers!

A couple of things were discussed extensively over the weekend, or rather there’s a few things that I think should be put out widely into the community.

The first was event timing. I’d pondered on suggesting a move of date – I feared PWorld was canibalising PCM, but even before #PCM15 I had spoken to a few people and come to the conclusion that the pros outweigh the cons. I got a similar impression from discussions during the event too – So it seems to make sense to stick to our early autumn time slot. I do think it makes a lot of sense to come after PWorld though.

The second thing was about event structure – well we’ve fallen into a pretty regular routine now, albeit each new organiser tweaks it in their own way. We now have a Friday event (hack in our case), multi track conference on Saturday and social events on the Sunday. The feedback around this is that people are looking for some change – but it’s not clear what. Personally I don’t think PCM is ready to move to a multi-day conference – I like the fact that we blitz it in one day. In reality maybe we could extend Friday, but again – would people travel any earlier? Or would you just get the same attendees turning up late as before?

Finally a quick dump of advice to the next organiser: ((I may add to this, i’m sure i’ve forgotten stuff)

Use meetup / eventbright to help organise
Be super clear about directions, hotels and locations. Send repeated emails (we didn’t do this some people complained!)
Clearly the event is free – but do charge for lunch. We do recommend lunch onsite – otherwise you’ll never get people back from the cafe’s again! (Cascais!)
Be sure to produce clear agendas with multiple paper copies available at the venue
Spread the load – you’ll need a team to help you both organise and sponsor the event. There’s lots of different aspects so this is pretty easy to do.
Feel free to add your own identity / twist to the event, but remember we are here because the formula above has evolved and matured nicely!
Remember this is a community event organised by the community for the community. That does not mean pentaho are excluded – precisely the opposite. It is important that there is a strong Pentaho presence at the event.

Either way I look forward to whatever shape #PCM16 brings. I’ll be there!

#PCM15 Hackathon – Details!

Posted by codek

So, the Pentaho Community Meetup hackathon is only a week away.

Firstly boring stuff – Location and signup page can be found here NOTE: Skillsmatter has moved!

There is a bar, we will try and arrange some snacks, but no pizza this time I’m afraid!

There are prizes too thanks to Harris, but I’m not sure what they are. I did reject signed photos of him though.

Now, how will it work? Well #PCM14 was a “chaos” hack, which meant “do anything you like”, and form teams as you see fit. We’ll carry on in that theme but with a few crucial changes:

We would like to encourage “random”ish teams to form – If you turn up with 5 of your workmates, then please don’t form a team with them – what’s the point in that!
Whilst we do encourage you to bring your own tools and existing frameworks – please do NOT do any pre-preparation in advance. No pre-building the ETL, cube or etc.
Given we’ve said that – We would like to introduce a theme for the data this year – not mandatory, but maybe you’ll get more points! The theme is going to be #opendata and there is a great place to start looking here: https://data.gov.uk/data/search

Note: If you pick a horrid XML dump which is massive and impossible to parse then good luck.

Schedule:

Turn up from 6pm and some of us will be there earlier.

Officially kick off about 6.30 aiming to form teams and begin hacking by 7pm.

The hack will finish at 8pm to give us an hour for presentations.

At the end we’ll head off to a pub, as yet undefined in the local vicinity. Some mention has been made of finding a decent scotch whisky place – I’m open to ideas! It’s not an area I personally know well though because Skillsmatter recently moved.

Last nights Pentaho London Usergroup, and upcoming labs session

Posted by codek

Last night we held our Q2 Pentaho London usergroup at the ever excellent skillsmatter. Actually that may be the last time we go there as they are moving to a stunning new venue in the next few weeks, yay!

Anyway we had a great night, you can see the talks here:

https://skillsmatter.com/meetups/7220-pentaho-london-2015-q2-meetup#overview

And as discussed details about PCM London can be found here:

https://github.com/PentahoCommunityMeetup2015/info/blob/master/README.md

As well as meeting a guy from Cern there was another guy interested in PDI, Spark and R. It was a pleasure to meet all the regulars as well as quite a few new faces. The discussion continued down the pub as ever (and ahem – started in the pub too) and we then got to wait many many hours for our food at Pizza “express”. Pfft.

The next event is not scheduled yet – however we do have a plan. We won’t be doing talks, instead we’ll be doing labs. So how will this work?

Well there will be tickets available for you to bring your problem/solution to a Pentaho expert to discuss, review and resolve the issue. You’ll be expected to show your issue, and then we’ll have a tech discussion and possibly even build a solution.

At the same time, you’ll be expected to share your experience with others. So you’ll probably need to make sure there’s no personal data on display!

There’ll probably just be 2 sessions for each pillar of the stack and the sessions will run in parallel. The pillars covered will be Mondrian, PDI, Reporting and cTools. We may do a Weka session too if there is interest.

So; If people think this is a good idea, we’ll setup a ticket system which:

a) allows you to get a ticket to present to an expert

b) allows you to get a ticket to watch a particular session.

Feedback please… Would this work at PLUG?

Performance testing Pentaho Data Integration (Get Variables)

Posted by codek

Back in the past I worked as a dedicated performance tester on several OLTP systems, and recently have spent quite a lot of time working performance testing/tuning PDI (Pentaho Data Integration)

The one thing to remember about performance related work, is never underestimate how long it takes. Doesn’t matter what the technology is either. This is complex stuff and takes a seriously long amount of time. (Hence one of the reasons why a performance tester is paid more than a senior tester, and is deemed a separate skill area). Also; Set an end goal. In a complex system you could tune forever. So decide on a line in the sand and stop once you’ve reached it.

As with any system performance testing PDI has its own set of challenges. There are a lot of ways to go about this, but here are some notes on a quick bit of step benchmarking

But Why? Well a colleague said to me “why is your get variables step inline rather than a join – a join would be faster”. So lets see if thats true.. Actually lets just tell you now – It’s not. The inline approach is faster. How did I test this?

Benchmark “Generate rows” Step – to prove what speed we can source and write data. Note: It is important to write it somewhere (e.g. a dummy) otherwise you’re not testing the step in its entirety.
Benchmark “Get variables” Step inline. Run 3 times and take the average. Run for a number of rows that takes a good 30s+ to run so that process init time becomes neglible. Just to add to the work PDI has to do make sure the variable is converted to an Integer.
Benchmark with join rows approach
Benchmark with getting 5 variables – both ways.

And the results:

2,374,000 records per second
1,528,000 r/s
1,316,000 r/s
1,491,000 r/s
1,300,000 r/s

So you can see the “join” approach (whether 1 variable or 5) only performs at 86% of the speed of simply using the step inline. I’m not sure if this has always been the case, but it’s true right now with PDI 5.4. My hunch is that this is nothing to do with the join per-se, but probably more closely related to the fact that there are more hops.

Now clearly this is just a benchmark. It’s a best case. It’s unlikely your incoming data is coming in at 2M r/s so therefore changing to inline probably wont help you. But if you’re CPU starved then it can’t hurt either. And as always with performance testing there’s a lot of differences, so what works today, may not work optimally tomorrow in a different scenario!

One day we’ll get support for using variables in filter rows, and also the variable substitution in calculator will work, and then there’ll be a lot less get variables steps anyway!

Here’s a shot of the simple transformation, in the lovely new skin of 5.4. Also note the use of canvas notes to document the performance scenarios.

Executing R from Pentaho Data Integration (PDI / Kettle)

Posted by codek

These days everyone has heard of R (or RStats if you want something more google-able) and it is doing an amazing job of replacing SAS. SAS is a traditional old-school package of tools and is extremely expensive – Although it is generally accepted that if you can afford it, it is the best tool.

R is open source, and has a mighty impressive selection of libraries. There are commercial offerings too – I don’t pretend to know the market in depth, but one of the leads seems to be RevolutionR

As usual I judge the success of the product by 2 things – Job opportunities and meetups. Well given any knowledge of R means you can classify yourself as a data scientist, that means you can really pick and choose any of many different jobs. It’s an extremely hot sector at the moment. On the meetup side LondonR is extremely successful (not jealous, ahem..) and never fails to sell out.

OK so it’s a cool tech.. Lets see how it fits in with PDI/kettle.

Hang on. Why would I want to do that? R can do everything PDI can do right? Err, yes and no. PDI is extremely good at the plumbing or data architecture side. R is extremely good at the number crunching or stats side. So use the right tool for the job..

So, here’s how I set everything up, note I’m on (K)Ubuntu 12, so some of the steps may not be necessary for everyone.

Download and install RevolutionR. Don’t install the default r-base package, it’s ancient.
Start “R”.
install rJava as per this wiki page
Now pick your PDI Plugin. The wiki page above refers to the R Executor step which is an enterprise only plugin. (This comes with the enterprise PDI by default.)
Mess around with the libjri.so and keep trying various places to put it until PDI Finds it. Try an assortment of libswt/linux, libswt/linux/x86_64 and even ../libswt/linux (outside the PDI folder)
Download the examples on the wiki page above and confirm they work.

Now; I went down the enterprise route, but there is a community version of the R script executor available in the marketplace from these guys: http://dekarlab.de/wp/?p=5 However I was not able to get their example to work. For some reason R didn’t like the simple a+b calculation. I don’t know if this is an issue in the plugin, in my code, or even something different in PDI 5.3. I think it would be good if they included a full working example in github alongside the plugin source code.

So now what? Well now we can do something interesting. Last October I noticed the twitter engineering team had released a R module for breakout detection. (A breakout is when something you are measuring over time reaches a new high, i.e. it breaks out of previous normal boundaries. It is commonly used in trading, as typically when a share breaks out it goes up quite a lot) I admit it also piqued my interest that this was from Twitter! you can find the library on github

(As an aside it would be very interesting to understand more about how twitter are managing to scale R and use it with their vast quantities of data – Not something that R has traditionally been very good at)

Later on this twitter blog appeared about a similar library – this time for anomaly detection. Very nice – I’ve done a lot of work in the past around automatic alerting and it’s very hard to get this kind of thing right. Where is the line between simply a quiet day due to a national holiday and an intermittent problem causing your site to lose traffic?

I followed that blog only to find devtools doesn’t install on ubuntu. Solution to that is here

Oh also see this page as to how to install from github. The command in that blog doesnt work either.

Then we just need to work out how to call that stuff from PDI. It’s pretty easy, here is the R script:

library(AnomalyDetection)
data(raw_data)
res = AnomalyDetectionTs(raw_data, max_anoms=0.02, direction=’both’, plot=FALSE)
oput = as.data.frame(res$anoms)
oput

So what’s going on here? Well we call the library, use the sample dataset and detect our anomalies. The only PDI centric bit is the last 2 lines

Create a variable oput (Can be called whatever you want) which converts the result set to a data frame
1. Note: We don’t just use res – because res is an composite object that contains config as well as results. By converting res$anoms we get just the anomaly data written to the result.
Call it (just type its name) so that the contents of that variable are sent back to pdi.

The next step naturally would be to feed data into this step from PDI rather than using the sample data set from the library. That looks pretty easy – Alas I ran out of time for today!

By the way, I added a short ktr to my samples repo to show this example. You can find it here: https://github.com/codek/pdi-samples/tree/master/rstats Naturally you’ll need the AnomalyDetection library installed in R before running in PDI.

How to sort data faster in PDI

Posted by codek

As anyone who follows my previous blogs will know I’ve been involved in a project recently involving billions of rows and where PDI is processing over a Trillion records step to step.

(Massively Simplified) Part of that includes one major sort. And for various reasons it’s done in PDI hey ho.

So, how do I make sort faster?

Simple: Reduce the heap, use more temp files on disk.

Errr, what?! Typo? increase surely?

Nope. Here’s my setup:

Scenario 1: Sort a bunch of data. Set heap to 48GB, and monitor tmp disk space. PDI uses 77GB of temp space, and it takes 8 hours.

Scenario 2: Look at the above and think, ooh; Don’t use tmp space, give PDI 77+48 heap. surely it’ll be faster? Sort in memory, no brainer. EVERYONE is talking in memory these days. (And for the last 5 years) Err, no, 19 hours. OUCH.

The reason is the enormous cost of garbage collection in the second process. (And that’s with the concurrent garbage collector too!) On a 32 cpu box I see hours for a stretch where only 1 cpu is being used. Then suddenly it goes crazy, and then stops again.

Perhaps the different code path PDI uses when using tmp files to sort results in more efficient object usage?

Now; Our disk is SSD so in scenario 1 the impact of using tmp files is not as bad as it would normally be. I had pondered on setting up a ~77gb ramdisk but I’m guessing any improvements would be very minor. (I hardly ever see utilisation go up on the SSD itself)

Java8 has some VM enhancements specifically around sorting – I wonder what would it take for PDI to start using those features? That’s assuming support for Java8 is added at all!

Happy Friday!

A note on garbage collection with Pentaho Data Integration / PDI / Kettle

Posted by codek

Garbage Collection

PDI is a Java application (dur..) and this means memory within the java virtual machine is freed up using a process called garbage collection.

Now; Garbage collection is a wildly complicated thing and many people say that GC tuning is a black art. Whilst I wouldn’t say that, I would say that you should always start with the defaults and work from there – if something has no effect, then give it up. One thing is true with garbage collection is the nature of the app (And the way it has been coded) has a significant impact. So don’t think that because you’ve tuned a tomcat app before, that knowledge will apply to PDI – it won’t!

Why should I do this?

Well if you have no issues with PDI then you should not. Simple! But if you’re working with masses of data and large heap sizes then this will bite you at some point.

And before you come to tuning the GC, have you tried a bigger heap size first? (Xmx setting). This is not necessarily a panacea but if you’re doing a lot of sorting or simply a lot of steps and hence a lot of hops then you will need proportionately more memory.

How do I know there is a problem?

This is easy – if you have a process that seems to hang for significant amounts of time, then fly along, then pause again etc, this is most likely GC. If your process trundles through at a consistent throughput then you’re probably doing ok.

What settings?

Well despite my saying above that you should tune for your application, there is one setting which you should most definitely apply for PDI and thats the concurrent collector. Having just googled this to find a reference to it I’ve realised there is both a concurrent collector and a parallel collector, and hence i now need to go to another PC to check which it is I use

<short break insert piped jazz music here>

OK, found it:

-XX:+UseConcMarkSweepGC -verbose:gc =XX:+PrintGCTimeStamps -XX::+PrintGCDetails -XX:PrintTenuringDistribution -Xloggc:/tmp/gc.log

OK – so seems I need to do some research on the parallel collector then, has anyone used that?

Either way, there are 2 things in the options above:

The instruction to the VM to use the new collector ( UseConcMarkSweepGC )
Everything else to configure logging – note the configuration of the log file.

These settings need to be put somewhere where PDI picks them up every time, i.e. in the environment in $PENTAHO_JAVA_OPTIONS, or actually in the spoon/carte/pan/kitchen scripts.

It is important to enable GC logging so you can see whether or not you do have a GC problem. Generally if you have GC full collections of more than a few seconds you may have a problem. And if you see full GC taking minutes or hours then you definitely have an issue! The other options that I use relating to the logging – they’re pretty self explanatory, and google/stackoverflow will give further detail

And that’s it – More later in the week on the topic of big data with PDI.

Working with Big (lots) Data and Pentaho – Extreme Performance

Posted by codek

OK, firstly, I’m not talking proper BigData here. This is not Hadoop, or even an analytical database. (Lets not get into whether an analytical database counts as bigdata though!). And it’s certainly not NoSQL. Disk space we’re looking at 100’s of gigabytes, not terabytes. Yet this project involves more data than the Hadoop projects I’ve done.

So tens of billions of records. Records that must be processed in a limited environment in extremely tight time windows. And yes; I’m storing all of that in MySQL!

Hey, wake up, yes, I did say billions of records in MySQL, try not to lose consciousness again… (It’s not the first time I’ve had billions of rows in MySQL either – Yet I know some of you will guffaw at the idea)

In fact, in this project we are moving away from a database cluster, to a single box. The database cluster has 64 nodes and 4TB of RAM. Our single box has 500GB RAM and that was hard fought for after we proved it wasn’t going to work with the initial 64GB! Impossible? Yup, that’s what I thought. But we did it anyway.

Oh; and just for a laugh, why don’t we make the whole thing metadata driven and fully configurable so you never even know which fields will be in a stream. Sure; Lets do that too. No one said this was easy..

Now; how on earth have I managed that? Well firstly this was done with an enormous amount of testing, tuning and general graft. You cannot do something like this without committing a significant amount of time and effort. And it is vital to POC all the way. Prove the concept basically works before you go too far down the tuning route – As tuning is extremely expensive. Anyway we built a solution that works very well for us – your mileage may vary.

I do accept that this is very much at the edge of sanity…

So what did we learn? How did we do this? Some of this is obvious standard stuff. But there are some golden nuggets in here too.

Disk usage other than at the start or end of the process is the enemy. Avoid shared infrastructure too.
Sorting (which ultimately ends up using disk) is evil. Think extremely hard about what is sorted where.
Minimise the work you do in MySQL. Tune the living daylights out of any work done there.
MyISAM all the way
NO INDEXES on large tables. Truncate and reload.
RAM is not always your friend. You can have too much.
Fastest CPUs you can find (Caveats still apply.. Check specs very carefully Intel do some weird things)
Partitioning utterly rocks.
Test with FULL production loads or more, PDI/java doesn’t scale how you might expect (primarily due to garbage collection), in fact it’s impossible to predict. This is not a criticism, it is just how it is.
In fact, PDI Rocks.
Performance tune every single component separately. Because when you put it all together it’s very hard to work out where the bottlenecks are. So if you start off knowing they ALL perform you’re already ahead of the game.
Use munin or some other tool that automates performance stat gathering and visualisation. But not exclusively. Also use top/iostat/sar/vmstat. Obviously use Linux.
What works on one box may not work on another. So if you’re planning on getting a new box then do it sooner rather than later.
Be prepared to ignore emails from your sysadmin about stratospheric load averages <grin>

I plan to follow this up with a further blog going into details of sorting in PDI in the next few days – complete with UDJC code samples. This post is necessary to set the scene and whet appetites!

Looking forward to hearing other similar war stories.

Of course if anyone wants to know more then talk to me at the next Pentaho London Meetup

	Ian on #Serverless #AWS PDI
	codek on #Serverless #AWS PDI
	NxP on #Serverless #AWS PDI
	codek on Uploading files with CFR and…
	Andrey on Uploading files with CFR and…

Codeks Blog

Open source BI Consultant

Pentaho

#Metadata acquisition (#IOT)

#Pentaho Meetups Galore..

#PCM15 thoughts

#PCM15 Hackathon – Details!

Last nights Pentaho London Usergroup, and upcoming labs session

Performance testing Pentaho Data Integration (Get Variables)

Executing R from Pentaho Data Integration (PDI / Kettle)

How to sort data faster in PDI

A note on garbage collection with Pentaho Data Integration / PDI / Kettle

Working with Big (lots) Data and Pentaho – Extreme Performance