Pentaho Security – Hybrid LDAP / JDBC

Pentaho uses Spring security under the hood – Version 4.1.3 as of 8.0. You don’t really need to know much about this except it’s an industry standard (for java at least) security layer.

The great thing about that, is the flexibility it gives for users/tweakers of the Pentaho platform.

For the Pentaho developers (way back in the day) it also meant they didn’t have to re-invent the wheel, and also rather handily by following industry standard it’s better from a security standpoint – hence there’s been very FEW security vulnerabilities in the Pentaho platform.

Anyway – It’s very very common to see these things in virtually all environments

  • LDAP / Active Directory
  • Roles/Permissions available in a database.

Now, I’ve been at a few places where LDAP contains both the users (for authentication) and the roles (for authorisation).  And in those where they didn’t have the latter, we often recommend that LDAP is the right place for that.  In some places this was achieved by creating distribution groups in outlook (!)

However in a lot of environments it can be very hard / slow to get data in LDAP updated.  hence it may be nicer to store the authorisation data elsewhere, such as in a database.

Lo and behold! I was perusing the docs the other day, and this is clearly and concisely documented as a LDAP hybrid security option, read all about it here:

In fact, if you have to do any security configuration, LDAP or not, be sure to get up to speed with these docs and the files involved – it’ll help you understand the basic concepts.


RequireJS, JQuery Plugins and #Pentaho CDE

So, what seems like a year ago or so, but actually turns out to be 2015, Pedro Alves posted about this huge new change to CDF – Support for requireJS.  Great! Whats that then?

Well actually, one of the main advantages is embed-ability, and ability to communicate to other objects on the page.  This is great in theory, but in practice rarely used.  So it’s a shame that such a significant underlying change has to impact everyone.  It’s not a backwards compatible change.

However; Another advantage, although one that is forced upon us, is that all the modern components such as the templateComponent and possibly a few others now REQUIRE a requireJS dashboard. So we’ll all have to move eventually – So it’s not a question of choosing, it’s a migration job.  In reality, the way require handles the dependencies is much nicer, and does solve some headaches.  It’s interesting to see that sparkl (app builder) has not been modified to work in a requireJS paradigm yet.

One of the enormous benefits of CDF and a key point about the architecture is that it uses opensource libraries where possible – requirejs in fact being one of those!  So how do we use some of these additional libraries now?

Well the first thing, is that if your plugin is not available as an AMD module you have to create a shim, so here’s how this works, using jeditable as an example:

  1. Put jquery.jeditable.js into your solution, anywhere really, i put it in /public/dashboards/plugins
  2. Put this code in a resource section in your dashboard (no need to give it a name in this case)
var requireConfig = requireCfg.config;

if(!requireConfig['amd']) {
 requireConfig['amd'] = {};

if(!requireConfig['amd']['shim']) {
 requireConfig['amd']['shim'] = {};

requireConfig['amd']['shim']["cde/resources/public/dashboard/plugins/jquery.jeditable"] = {
 exports: "jQuery",
 deps: {
 "cdf/lib/jquery": "jQuery" 

], function($) {


Now there’s two things going on here.   You’re setting up your config first, then loading the shim.  The config is important because it defines that jeditable depends on jQuery, and it’s this that resolves issues with $ being uninitialised.

Note: I took the jquery.jeditable.js from CDF, rather than downloading the latest.  Seems to work, but like a lot of pentaho libraries it’s probably quite out of date.

Unfortunately this shim approach doesn’t always work – you just need to have a go. I found it didn’t work for bootstrap-editable for example.  This code appears to be exactly the same structure, but for now jeditable will do the job for me.

Anyway; How do you then use jeditable? Pretty simple. Create a Kettle transformation endpoint in your App Builder plugin, with 2 parameters – ID and value:


Then add some HTML in your dashboard:


Then add this into a dashboard_postinit function in the components section:

  name: 'paramvalue',
  id:   'paramid'

Note you must rename the parameters because CDA puts this ‘param’ onto them for you for some reason.  In my example above – Bow is the name of our app, and testpdiupdate is the transformation name.  Note: If you edit the transformation in-place don’t forget to click the refresh button on the endpoints screen otherwise the old endpoint code will run.


Thats it! Now run your dashboard. Click the field, change the value hit enter and watch your server logs.  Be sure when using this in production to apply a security check on any parameter values that are being submitted to be sure the user really does have permission to edit that field.  (This is bread and butter security stuff)


There is documentation on the old redmine site, but that’s gone now – I did find a version here, not sure for how long though.  There’s also a really good summary on the forums


Uploading files with CFR and Pentaho

For a long time Pentaho has had a plugin called CFR

This is actually a really great plugin – Check out the facilities it offers. Secure and easy file transfer to and from your server. Great.

The API is excellent – clearly well thought out. It even offers security!   Fantastic!  In true google style, it does a simple clear thing very well without being over complicated.  Precisely what you expect of a plugin.

However; The downside is that the embedded CDE components either don’t work at all, or are incredibly flaky/inflexible.  (They only recently got updated for Pentaho 8 and don’t seem to have been tested)

So; At the end of the day, the UI side is simple, why do you need one of these components. It’s the API that is the real value of CFR, so just use it directly.  Here’s how:

  • Make sure you’re NOT using a requireJS dashboard.
  • Import the jquery form resource:

Screen Shot 2018-04-13 at 15.38.05

  • Add a text component
  • Put this in the expression:
function() {
 var uploadForm = '<form id="uploadForm" action="http://SERVER:8080/pentaho/plugin/cfr/api/store" method="post" enctype="multipart/form-data">';
 uploadForm = uploadForm + '<p><input id="fileField" type="file" class="file" name="file"/>';
 uploadForm = uploadForm + '<input type="hidden" name="path" value=""/>';
 uploadForm = uploadForm + '<p><button type="submit" class="submitBtn">Upload File</button></form>';
 return uploadForm;
  • put this in the post execution
function() {
  dataType: 'json',
  success: function(res) { 
   var filename = $('#fileField').val().split(/[\\/]/).pop();
   alert("Success! " + filename + "-" + JSON.stringify(res)); 
   Dashboards.fireChange('paramUploadedFile', filename);
  error: function(res) {
   alert("Error:" + JSON.stringify(res));

Test it!

How does it work? Well the expression creates a standard HTML File upload form on your page. This is bog standard HTML, nothing unusual here. The hidden input field for the path can be set accordingly if you like (this is the target folder in CFR, I just used the root for now)

The Post Execution is where you use ajaxForm to hook into the form.  This is where you handle the response, errors and so on.  At this point, once your file has uploaded, you’ll probably want to hit a Sparkl endpoint to trigger loading of the file you’ve just uploaded.  Thats a simple runEndpoint call..


It makes more sense to control the UI from scratch anyway, rather than use a component – primarily because you gain 100% control.

How did I figure all this out? Pretty easy really – Just look at the source code (while it’s still available!)


For folk local to London, PLUG is on Monday April 23rd, Don’t miss it!


Second #ApacheBeamLondon Meetup

So, last night (11/1/18) I attended only the second ApacheBeamLondon meetup, and it was a very interesting affair.

Firstly – The venue – Qubit – right bang in the middle of covent garden, what a cool location. Not sure what they do – but the offices were pretty nice!

The first talk was about an implementation  of a money (unit based) tracking system called Futureflow – Implemented using ApacheBeam (or previously dataflow). The data is persisted in BigTable.  They are only interested in the flow of money, not who it goes between, and thus think they can allay any privacy or regulatory concerns. Using Pub/Sub they also think that makes it easy to get the data from the banks.

This is not dissimilar to another situation i’ve seen concerning grocery shopping data.  Again in that market to get access to the data can be very long winded.  By simplifying it up front for the supplier you’re more likely to succeed.

Developing in a pipeline is good because you solidify your inputs/outputs and then you can just get on with the boxes in the middle without affecting anyone else.  And it’s that box(s) in the middle that take the work!

There is some creative table design which trades storage for fast lookup – It’s a very dedicated data model for a row scan centric system.  But in their case they have to be able to show scale, so it must be considered up front.  The whole system relies on very fast transaction history lookup for a given unit of money.

The second talk from JB was a deep dive into IOs in Apache Beam.   This was very interesting and I was pleased the organisers combined a deeply technical talk, with a real use case talk.

Curiously I saw a lot of similarities between some of the internals of Pentaho PDI and some of the pcollection/ptransforms in Beam – In particular, a pcollection === rowset, and a ptransform === step.

Anyway it was very interesting to see how the guts of the IO steps work, and how batching is handled – including the new archicture for the SplittableDoFN.

There is even a mapreduce runner for Beam! Why? :

Makes sense – Especially when you think about those people who are stuck on older clusters, but want to prepare for an upgrade.

On the IO side i liked the model of a bounded or unbounded source.  Allows you to split the read over X readers and keep control of it.

There is a runner compatability matrix – but this just covers functionality NOT performance 🙂

Finally there was a really good discussion about TDD and mocking beam pipelines for unit testing. This should be easy and there’s nothing in beam to prevent it, but it seems it’s actually quite hard. (Although; The unit tests of beam itself make use of this technology) . Now just imagine if there was a product that explicitly supported unit testing from within – AND/OR provided examples, it would be amazing. I think it’s amazing and a great sign that this came up in the discussion.

So thanks to the speakers, organisers and sponsors, and well done for putting on such an interesting event.

See you all at PLUG in 2 weeks!

The single server in Pentaho 7.0 #topology


Just a quick one this on the move to a single server configuration in Pentaho 7.  This resolved a long running quirk with the Pentaho server stack, but don’t take that as a recommendation for installation in production!  It’s still very important to separate your DI and front end analytic workloads and I doubt we’ll see any other than the very smallest installations using the single server for both tasks simultaneously.

Separating the workload gives several important advantages:

  • independent scaling (reduced cost and no wasted resources)
  • security
  • protecting either side from over ambitious processing

Of course! Don’t take my word for it – Pedro said the same in the release announcement:

Screen Shot 2017-05-16 at 06.58.35

And luckily the Pentaho docs on the website give clear instructions for adding/removing plugins from the server – Key thing being don’t install PDD or PAZ on your DI server.

Final point – You can of course choose whether to extend the logical separation to the repository itself.  By separating the repository as well it gives you ultimate control over your system, even if for now it is hosted on the same database.

Serverless PDI in AWS – Building the jar

So, following on from the first post in this series, here’s all the technical gubbins.

Firstly, how do you build PDI as an engine?  Well simple – you need to create a pom.xml and use maven.

The key parts of that file are:

  1. Adding the Pentaho repository
  2. Defining pentaho.kettle.version
  3. Adding the core lambda java libraries
  4. Figuring out that the vfs library version needs to be this weird thing: 20050307052300
  5. And then the key point – using the “Maven shade” plugin, which basically gathers up the whole lot and dumps it into a jar suitable for uploading directly to AWS.

What next? Well topics for next few weeks include:

  • The java code wrapper to launch PDI
  • Logging/Monitoring
  • Triggering
  • Persistence (S3 / redshift)

Upcoming Pentaho events – Summer 2017

As we’re heading into crazy event season for the Pentaho community there wont be another PLUG (Pentaho London Usergroup) until around December time.

So, keep an eye on social media and your inboxes for the latest news on when and where PCM17 will be.  Hint: It’ll be November time again.

Also – Don’t forget the official Pentaho world conference is on again this year in Orlando – that’s one not to miss. Find that on the Pentaho website.

Finally – Mark hall – Creator of Weka is in town in early June and there’s a meetup with him where you can find out about “The future of machine learning”:

(Think cyberdine..)

If anyone wants to talk in December then put your hands up and let me know, otherwise have a great summer.  In a similar vein – any feedback about the group, content, location or timings – send that too.

#Serverless #AWS PDI

Hmm, what what?  Serverless PDI?

Yes, so serverless is *the* thing at the moment.  Partly driven by amazing advances in the devops space – Fundamentally we’ve all had enough of managing servers, patching etc. You know the story.

“Run code not computers”

Why to do this? – Simple – Integration. If you need to hook up 2 APIs of separate systems it’s actually pretty expensive to have a server sitting there running 24×7.  So what we want is to literally pay for the time we use and nothing more – We don’t want to have to startup and shutdown a whole server either!

Why Pentaho? The single most important argument is visual programming.  It’s faster to get started with PDI than it is with a scripted solution.  It’s more maintainable and it allows you to capitalise on general ETL skills.  (Experience of any ETL tool is enough to work with PDI) .  PDI has also done the boring input/output/API stuff, so all you need to focus on is your business logic. Simple!

So, how to do this? Well Amazon AWS Lambda is where to start.  I assume google cloud has a similar function, but I’ve already got stuff running in AWS so this was a no brainer.

The stats sound good. Upload your app and you only pay for run time, everything else is handled. There’s even something called API connect so you can trigger your ‘Functions’.  And finally – My favourite automation service Skeddly can also trigger AWS Lambda functions. Great!

There is one issue. The jar has to be less than 100mb. What! PDI is 1GB, how can that possibly make sense. Sure enough some googling shows lots of other people trying to use PDI in lamdba and finding this limit is far too low.

But; Matt Casters pointed out to me the kettle engine is only 6mb. What?  Really?  I took a look – and sure enough with a few dependencies thrown in you can build a PDI engine archive which only uses 22MB. We’re on.

To start, read these two pages:

Java programming model

Java packaging with Maven


  1. Create a pom.xml
  2. Add in your example java code
  3. Build the jar (mvn package).
  4. Remove any signed files: cd target; zip -d <file>.jar META-INF/*.RSA META-INF/*.DSA META-INF/*.SF
  5. Upload as a lambda function
  6. Set an environment variable KETTLE_HOME=/tmp (If you dont PDI will crash as the default home dir in lambda isn’t writable)
  7. TEST!

And here’s the proof:

Screen Shot 2017-04-25 at 12.53.39

Slightly disconcerting that it took 5.7s to run. On my laptop the same executed in 0.5s. I guess the lambda physical boxes are busy and low spec!

What’s next?

  1. Find a better way to package the ktr
  2. Hook the input file into PDI parameters
  3. Provide better output than “Done”!
  4. Setup with the API connect
  5. Schedule via Skeddly

I will be releasing all the code for this soon – In the mean time if anyone is particularly interested in this right now please do contact me. I think it’s a very interesting area and this simple integration opens up a vast amount of power.

#DevOpsManc Feb 2017

This evening I attended a dev ops meetup in Manchester. Why? Well interesting story, but primarily because:

  • I was in Manchester anyway
  • Tom was presenting his Nasa stuff
  • I’m working on a project with heavy dev ops requirements, so I dragged some of the geeks down with me!

So how did it work? Well they are lucky, they seem to have some great sponsors, and the venue was brilliant.

First up was Tom, showing off what they get up to catching organised criminals at NASA.   (UK Police interested in the memex programme)  Not just that but there’s a huge genomics project too.  They are trying to improve overall standards, and for them it’s vital they can test and deploy in different envs around the world seamlessly.

(He sadly pointed out that not being a US citizen means he’s not allowed near any cool space tech!)

There’s this I/O based monster with 3k cores called wrangler too – and they have very little privileges, so it’s all ansible based here.

Good points about pro/cons of containers…  (Issues with patching etc). And of course, a quick mandatory demo of juju.  One key thing about juju that I hadn’t appreciated was you can save your “project” as a bundle and then do a one line install of that bundle. How cool is that!

Next up was Matt Skelton – Slides here. This was all about designing teams for software development, with a view on how devops fits in.

He talked about business agility – and how quite a lot of people have never experienced working in a high performant team.  I agree with this – once you’ve experienced it, it’s quite a thing, and you don’t want to go back!

There are some fundmental rules, teams work best in 6-9 people, and a lot of this stuff is only applicable in a large enterprise environment. Most importantly is that the team must be stable (albeit slowly changing) and no gathered, and thrown away for every project (which is a model i’ve seen previously).

Anyway another key point is this:

  • Organisation architecture always wins out

What does this mean? It’s been shown that the software architecture always ends up mirroring the organisation setup. Thats crazy! But when you think about it, we’ve all seen it right? so fix the TEAM FIRST before doing your software architecture.

Finally he finished with a point that generated much discussion about cognitive load. Essentially you need to ensure the team is not overloaded. This can be best achieved by simply asking them if they’re confident to manage the running of their current project(s). (e.g. think, do they have sufficient knowledge and time to deal with a P1 incident?) . This is all because stress impacts the ability for a team to be performant.  It seems to me this is very close to “velocity”.  Discussion ongoing there!

Other comments – well I think the comment about google being way above everyone else on the tech front was a bit misleading, it’s not all roses over there..

Finally, there was some talk of “the cost of collaboration”.

Many thanks to the organisers who did a great job, and the sponsors of course!  A genuinely interesting group with a good vibe going on (barring the mandatory dissing of London – Qudos to the guy brave enough to mention that they also host meetups there!)


Sent from my phone


Pentaho London Usergroup (#PLUG17)

So, on 19th January we met for our first Pentaho London Usergroup of 2017. We struggled to get an agenda together, but thankfully Diethard and Nelson came up trumps and we had two excellent talks.

Additionally we decided to re-visit our free consulting thing again.  This was great fun, i think we did a whole of night of it about a year ago. The format is simple – bring your problems, issues etc and ask questions! We’ll then propose a solution as a group.

Last time we did this, it took a little bit to got going, but by the end of the session the questions were pouring in!  This time – we didn’t just suggest a solution, we actually implemented a POC solution, showing how it would be done, on Nigels laptop!  Keep an eye on Nigels feed – He’ll be reporting progress soon – I’ve already heard good things.

Even better – this issue had only come to Nigels attention that very day!  So what a great story that was, and what an AMAZING response time.

So, the talks. Well Diethard started with an excellent summary of the history of Pentaho, in a quiz style way.  Great fun, and many positive comments afterwards – It’s worth understanding and seeking out this story as it explains a lot about how we got to where we are now!  Diethard then presented

Nelson then proceeded to show off CBF2 – which compared to the old CBF looks pretty amazing. An essential tool for anyone who is working with multiple clients, or multiple environments.

The next meetup will be on 4th May – A starwars special  Let me know if you have something for the agenda!  We will be broadcasting this one live too – So no swearing!