How to use sub-mappings in Pentaho map-reduce

So, short and sweet.  How do you use submappings in a mapper or reducer?

Natively you can’t. It doesn’t work – but the reason it doesn’t work is simple – The Map/Reduce step in Pentaho doesn’t make the submapping ktr available to the job. It only publishes the top level job.

So the solution is to use a HDFS url for the name of the sub mapping transformation i.e.:

hdfs://hdfsserver:8020/user/rockstar/mysubtransformation.ktr

This however has side effects – namely spoon will hang rather a lot. So the only way to apply this is to hack the XML of your transformation. Yuck!

You could actually use any resolvable url.  I think it would make sense to use HDFS, but make sure you put the ktr into the distributed cache so it’ll always be on the local node. BOOM!

Naturally there is now a jira but as we’re all going spark, I don’t see it being fixed too quickly 🙂

Building a lamda architecture

So, in recently researching lamda architectures I came across these links, and I thought some were worth sharing here:

This document has a great slide, which shows how you keep the data stores separate, but merge at the serving layer:

 
 
 
Just to keep things interesting, there is a subtly different view here: (Linkedin guy)
 
 
That solution is not dissimilar to this document here:
 
 
An important comment about the fundamental principle of immutable data in lamdba:
 
 
(Don’t worry, the page is nothing about Talend itself – A common marketing trick that tech companies seem to be using a lot these days – talk about cool tech, just to get yourself linked to. Oh damnit, I just did that. Damn!)
Then there’s the outsider – Kudu.  Kudu seems to be going back to mutability.  BUT kudu is far from being suitable for production use, and it has a horrible deployment architecture.
 
 
Finally, inquidia (BigData Pentaho partners in the states) have a page on it, a good summary of the options, latency implications etc. This can be found here:
 
 

Community News – July 2016

Hi Everyone,
So it’s been a while since the last news update, there’s been a lot going on, but really it’s all about the meetups..
#Pentaho News Roundup
 
Greenplum use case
Released only today, here’s a good read on EMC and their use of greenplum
PDI Running in snowflake
So, lets see if we can get inquidia to cross the pond and present this to PLUG, in the mean time to see how they’ve released a plugin to enable PDI to run in the cloud via snowflake, go here
 
#pentahomeetup
#PCM16
Ok, this is SURELY THE BIG NEWS!  Pentaho Community Meetup 2016 is back in Antwerp this year. For those that don’t know, it’ll follow a fairly familiar pattern along the lines of:
Friday Arrive in antwerp, chill
Friday Evening hackathon esque event
Saturday Main conference (At a stunning location!)
Saturday Evening – Trialing the stunning diversity of beer available in Belgium
Sunday – Sightseeing / crawling home
There’s no formal agenda yet, but we do know the date – 11th – 13th November.  So get your flights/trains/hotels booked now.
As usual from Pentaho themselves there will be an extensive turnout of Pentaho developers, product owners, Pedros etc.  Then there’ll be the usual blend of developers, users and data ninjas.
Also with no “Pentaho world” this year there’ll be no cannibalism, so this looks set to be a huge one.
#PLUG
But! don’t forget good old Pentaho London – PLUG!  We’re moving to canonicals offices for a meetup to discuss bigdata devops amongs other things. You’ll also see Nelson presenting something top secret, sounds intriguing. Register here
#blogs
Dynamic processing engines
 
This has generated quite some interest. Err, from myself!  Read it here
 
#jobs
Jobs
Not sure of any specifics going around right now, but there’s loads out there, the market is crazy right now!

Dynamic number of processing engines in Pentaho

So about 3 years ago this foxed me, and as it wasn’t particularly important at the time we went with a hard coded number of processing engines.  Randomly whilst driving through the beautiful British countryside last week i realised how simple the solution really is.  Here is what i was passing at the time… (Brocket hall, they do food i think!)

1flowers3

Whoa! hang on a minute, what am I on about – OK the scenario is this – common in a metadata driven system – you want to process all your data, and depending on some attribute send the data to one template or another. Fine.

BUT because we love metadata, and we love plugins, you don’t want to have to change the code just to add a brand new template.  Even if it would just be a copy and paste operation..

Concrete example? Sure. You’re processing signal data. You want to store the data differently depending on the signal.  So you have metadata that maps the signal to the processing engine.  Your options could include:

  • Store as-is with no changes, all data
  • Store only the records where the signal value has changed from the previous value
  • Perform some streaming FFT analysis to move the data to the frequency domain
  • Perform some DCT or other to reduce the granularity of the data
  • etc!

The solution in the end is ridiculously simple. Just use a simple mapping, partitioned, and use the partition ID in the mapping file name!

As always you can find the code here (Notice the creation of a dummy engine called enginepartition-id.ktr which just keeps PDI happy and stops it moaning, and preventing you closing the dialog!)mappings