Updating dimensions in your DWH – Python, Kettle and Athena

I’m quite reassured by both talks at BigDataLDN and other talks at Skillsmatter (RIP) which have been across many different technologies, and the reoccurring theme is building a data warehouse.

Now a few years ago everyone declared the DWH was dead. Well that couldn’t be further from the truth. It did fall poorly for some time, but it seems to me to have got a second wind.  It’s just virtualised now, but the modelling and concepts are more important than ever.

So; How do you maintain your dimensions? In a dimension record, your data can be versioned and you can choose whether you keep a history of your attributes (type2) or just overwrite with the latest value (type 1). There’s all sorts of combinations and variants of these too.

Now, Kettle or PDI (An industry standard Opensource ETL tool) has a lovely step which does ALL of this for you.  So, I can maintain a dimension with just two steps. It’s also incredibly rich, so i can choose on a field by field basis exactly what i want to happen. Here’s a picture:

Screenshot 2019-11-28 at 11.41.36

So, you can see here a table input, which runs a simple bit of SQL.  And a lookup/update step which has about 5 fields of configuration, and boom! We have a fully maintained and updated dimension. Now rinse and repeat.  Even a new user to Kettle could create this in 5-10 minutes.

Oh; And this is gloriously database independent too.  Doesn’t matter what the DB is, source or target, it’ll work. Perhaps your SQL may occasionally need to be tweaked, but nothing in Kettle would change.

And also, while i’m at it (this is NOT a sales pitch btw!) Kettle will also generate the SQL for the target tables. And it’ll insert a zero record for the case when you can’t find the dimension. (A quirk about dimensional modelling that you must ALWAYS have a key in the fact table)

Now these days, where it’s at is python. Not ETL tools. OK so how do we do this in Python? Well actually, it’s also extremely simple.  This time it took me about half an hour to get it working, but that is more reflective of my python knowledge than the tool itself.  So how does it work? Lets work through the code line by line:

The first thing, is install the library “pygrametl” via pip.  This seems to be a mature library, still occasionally updated, so seems a good one to use.

Then, in our code:

Screenshot 2019-11-28 at 12.04.42

So that’s the usual imports, etc, fine, boring..

Screenshot 2019-11-28 at 12.05.28

This is quite nice, here we’re defining two connections to postgres, and running some SQL. We’re also defining the mapping of the columns – something you don’t need to do in Kettle.  Nevertheless with just a couple of lines we have our connection and we’ve ran our SQL. Good stuff.

Now lets define the dimension itself:

Screenshot 2019-11-28 at 12.07.14

So there’s two here. The first one is type 1 (overwrite). The second is type 2 – update. (History) . So that’s also pretty simple, we’re just defining the fields and their uses. Good stuff. In fact you could actually generate this from metadata…  Ooh..

Next, let’s update the dimension!

Screenshot 2019-11-28 at 12.11.02

Again, that’s gloriously simple. For each row in the data, we simply make one call to our library to “ensure” the dimension. This then does either insert, update, or nothing.   (And then don’t forget to close our connections!)


So, now what? All of this is great, but there’s one niggling problem here – Both tools rely on the data being in a relational database – or more specifically one that supports update functions.  Athena (and other bigdata SQL engines on Hadoop) typically don’t support update so this doesn’t work.

That means you can either rebuild your whole dimension each time using inserts, or find another way.  And if you’re going to do that, you need to have all the data to do so – sometimes the source doesn’t actually have this data, and it’s data you’ve curated, so be sure to read the dimension first in that case, before clearing it down!

So one way i’ve seen, is to abuse SQL to look forwards (or backwards) over the data and replicate the valid from / valid to and version fields in a View.  Whilst this is really nice, as it hugely simplifies the stored data, it puts an awful lot of trust in the database being happy to run some more complex SQL.

So I think that leaves a hybrid – This is actually common in data architecture, and basically the advice is – use the right tool for the job. So in a hybrid setup your golden source of dimensional data is kept and maintained in an RDBMS, but then blatted over (Technical term sorry) to S3 or HDFS post updates.  Again, in Kettle this is something that is embarrassingly easy!  (And probably in python too)

Any other thoughts? How do you manage your dimensions when your DWH is based upon a filestore rather than an RDBMS?

What’s in a name? #PLUG

Good morning,

An off the cuff comment got me thinking the other day that should we change the name of the #Pentaho London Usergroup (meetup) to KLUG?  i.e. Kettle London Usergroup?

Well; Our keen supporters will remember at one point we tried to rebrand as an Opensource BI meetup, and we bought in all sorts of complimentary tools and talks. Not all opensource however!  This unfortunately didn’t work and just spread things too thin. There is value in meetups being focussed around a thin area.

So what does the word Pentaho vs the word Kettle mean?  Historically Pentaho was the company name of course, and also used interchangeably as the product name but Hitachi have now disowned the word Pentaho, and in fact, Pentaho swag is now banned in the offices.  So it means nothing to them. In my world it’s simple – its the original tech stack – So including the BI server, all modules within that, the Metadata layer, Reporting and Mondrian, and of course Kettle.  Oh, and ctools 🙂

Now; All of those modules apart from ctools are active independently from the BI server and used across various other stacks.  So “Pentaho” is just a collection of libraries really at its heart.  (Think Saiku, tibco etc)

Ctools on the other hand is still an extremely capable tool. There’s little else out there that is so flexible, despite it being positively ancient now. Just imagine if someone took the ctools architecture and rebuilt it for the modern times, it would be a killer app.

I digress. So, Pentaho is just a collective name for a bunch of BI tools!  Therefore, I think PLUG (for now) can remain our acronym! What do you think?  Any other options?

PS: Of course, with Kettle being renamed to Hop and moving to ASF, that presents another renaming opportunity, but somehow HLUG just doesn’t flow off the tongue correctly 🙂


Using #Kettle (#Hop) with #Apache #Airflow

A few days ago someone mentioned using airflow to orchestrate their messaging solution, so I figured it was worth a look. In essence, it’s a scheduler that understands the individual steps in your job, and the dependencies therein.  Of course, it scales out to workers, etc etc.  All good.

In fact, i recommend watching a few videos on youtube and you’ll get a feel for it.  Then, whilst searching for something else I came across this blog – Do take a read, it is quite interesting.

This is particularly interesting, reading through it, the barrier to entry was learning all of Airflows operators.  (And learning to love all the bugs/quirks of each one!). They propose a generic operator, and then the developers just concentrate on the code.  In fact the philosophy of airflow is and should be to separate the actual workflow code from the job, so this makes a lot of sense.

Now; You know where i’m going with this.  Kettle IS that generic operator, so this could be incredibly powerful. So lets have a go…  I think the devil in the detail is at what level you allow the orchestration to happen in Airflow and at what level you do it in the job definition.  My personal feeling is that jobs in Kettle should always be self contained anyway, so keeping to that you wont go far wrong.  There’ll always be a need to run jobs independently, so don’t lose that feature – And you don’t want duplication either as they will get out of sync. However Airflow is EXACTLY the scheduler we’ve been looking for to manage the interactions of your 100’s of jobs.  Unfortunately if the job does all of the orchestration you lose the clever bits of Airflow including re-runability/resume and performance tracking.  See, i told you there’s a devil here.  Some hybrid may be the best way to go, where you do replicate a job for manual runs, but break out the stages for airflow.  I would only put really high level breaks, and only very simple logic into airflow however – For example, at this level:

  • Load data to stage
  • Process dimensions
  • Process Facts
    • Could be argued to be broken down further, if you already have rerunnability setup.

So; Having written all that out, and looked at the typical DAGs in Airflow, i have concluded that you can generate a DAG from a Kettle Job, and boom! You have the best of both worlds.  Of course, this would be a post commit hook and happen automatically so they can never get out of sync. #CI

Note: this approach will of course break down if you’re ingesting data in a metadata driven fashion (and therefore, only running one job)  But maybe Airflow has an API and you can report progress that way? One to research…

OK. Let’s get it started.  Well thats easy – Just 4 commands, follow the Quick Start.  If you’re using Anaconda then be sure to install using conda, not pip.  Once done, it starts straight away, talk about easy.  Interesting that the UI is a separate service to the scheduler too. Nice – a sensible separation there.

So how are we to do this? Well the quick and easy approach is of course to use the bash operator and call Kitchen.  There’s some discussion of this approach in StackOverflow

And in fact, it’s super simple.  What I love about Airflow, is the schedule itself is in the code or the DAG. And that means all you need to do to update your schedules is a pull of your latest code.  So here it is:

Screenshot 2019-10-07 at 10.10.17

You’ll notice the schedule is very familiar – it’s basically a cron string. And thats it, our job is now running.  Airflow then captures duration, failure, logs etc etc etc.

Another approach would be to build a new operator I guess, which would do little more than launch kettle anyway. Perhaps pyJava can be used, but i suspect that’s overkill. The advantage of the bash operator would be around separation of the processes, resources etc.

But hang on, there’s a http operator. Why don’t we just call Carte?  Not quite sure how that would work in scaleout terms, perhaps each airflow worker has a single carte running on it, and when you call carte you just always call the localhost one?  Or perhaps you can devise a way to teach airflow about the carte cluster.

So what else is there? Well they’re very dynamic – DAGs can of course instantiate other DAGs.  And the original product came from Airbnb. I guess thats a good thing too? 🙂 . It’s easy to pass parameters of course as well and there’s lots of examples of how to do that.

For anyone putting up with Windows Scheduler, or cron jobs, and the sometimes obtuse places you have to look for logs/problems, then I do recommend taking a closer look at airflow. Let me know how you get on!

PS: The DAG itself is as simple as:

from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2015, 6, 1),
    'email': ['airflow@example.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),

dag = DAG('KettleGetBeerTags', default_args=default_args, schedule_interval=timedelta(days=1))

t1 = BashOperator(
    bash_command='/Users/codek/apps/kettle-remix-20190130/data-integration/kitchen.sh -file=/Users/codek/projects/airflow/jb-extract-beer-tags.kjb',

Dunnhumby and their use of GCP

So, a few days ago I came across this video, and as it happens Jamie is actually talking at the GCloud meetup in Manchester next week, It’s quite tempting to go in fact!

It’s a really interesting watch. Especially after I spent quite some time at Aimia wrestling with exactly the sort of problems that are described here!

Key points of interest (to me!) are really:

Transient (Ephemeral) clusters. Enabled via a shared hive metastore using Cloud SQL.  This is clever stuff.  So you run production jobs on their own clusters, which also means it’s one client, so single tenant removes so much complexity!  And you’re not going to have a flexible architecture where you can do that without using the cloud.

The usage stats are amazing! Just look at these numbers:

17k compute hrs/day
175 data scientists. about $7600 a day . (Thats pretty cheap! And remember its all traceable down to the job)
35 clients
~3k nodes.

Terraform is used to handle updates, and can build / tear down the entire cluster.

Oh; Another really interesting thing – each solution is spread across 2 GCP projects – one which is the store, and the other is called the runtime (all the ephemeral clusters etc) . Per client.

They don’t use bigquery (much). Partly because some of their clients prescribe different cloud providers.

A particularly amazing graph at the start shows how their costs reduced drastically once they introduced ephemeral clusters. Now this is particularly interesting – Cloud gives you the horsepower to throw resources at the problem and prove your product. THEN as the tech improves you’re able to achieve the same with less – which really must have profound implications for your business.


So some questions I thought of..


  • What about analytics across multiple clients?  Is that a strict no-no?  Not even aggregate comparisons etc?
  • Is there any standard data model? Or is each client different?
  • Cloud portable, but not cloud agnostic. Why’s that then? Is it because any agnostic layer will only ever support the lowest common denominator tech wise?
  • Do we see an end to needing to deploy network layers (firewalls etc) as things move more towards serverless?



RequireJS again!

Remember this?


Well, I came across a more complex example – a Library with various dependencies.  On reading all the docs MANY times I built a shim but it just didnt work, hence in a moment of frustrating i went for the last resort option of a post on the community site:


Note the deathly silence!

However; Later on I was editing my shim inline rather than as an external file, and hang on, i spotted a warning – there was a missing comma! Arrgh. How annoying. So after all that the shim WAS correct, and my understanding of requireJS was indeed valid.

So here’s how to do it:

var requireConfig = requireCfg.config;

if(!requireConfig['amd']) {
requireConfig['amd'] = {};

if(!requireConfig['amd']['shim']) {
requireConfig['amd']['shim'] = {};

requireConfig['amd']['shim']["cde/resources/public/dashboard/plugins/alpaca"] = {
exports: "jQuery",
deps: {
"cdf/lib/jquery": "jQuery" 

requireConfig['amd']['shim']["cde/resources/public/dashboard/plugins/handlebars"] = {
exports: "jQuery",
deps: {
"cdf/lib/jquery": "jQuery" 


], function($) {


Note: this is a simplified example, there are more dependencies, but as you see you can simply copy and paste those as needed.

There’s another point too. do NOT call your resource alpaca, because that clashes with the alpaca call itself. Instead call it alpacaShim or something.

I hope you all continue to enjoy the horrors of requireJS!

Until next time!


Security – Custom handling of login failures in #Pentaho

Not so long ago I blogged about using Salted authentication


Well; it turns out the Pentaho authentication layer despite being spring security is not really “up to date” in a lot of things.

A classic example is preventing brute force authentication.

Now; I’m currently integrating to a home grown security system, hence this is an issue. If we were going LDAP like most would then it’s a non-issue, because your LDAP would handle this.

So; Whats the requirements? Well simply:

  • Call a specific procedure once logged on
  • Call a different procedure if login fails

The former is easily achieved using a session startup action. see sessionStartupActions.xml in pentaho-solutions/system. In an ideal world you could use the startup rules cTool which lets you code these rules using Kettle but unfortunately it’s broken in 8.1 (I believe fixed in 8.2 but not proven)

The latter is somewhat harder. You must build a custom spring authentication failure handler.  Thankfully that’s as easy as:

  1. following this ultra simple blog example: https://www.baeldung.com/spring-security-custom-authentication-failure-handler
  2. enabling the new handler in applicationContext-spring-security.xml file.

b.t.w. the username is in the request – so access it like so:


So there you go. we now have a relatively easy hook into authentication failures in the BA Server.



#BigDataLDN Wrap up

On Tuesday/Wednesday I attended the BigDataLDN conference. It was pretty big, in fact by far and away the biggest conf I’ve been to!  I actually had a two-fold purpose in visiting – I wanted to visit a particular vendor, and also had a colleague i needed to heckle!
So, before i comment on the talks and vendors, here’s my highlights/learnings:
  • Data ops really is a thing, but it’s just a name for what we’re all doing anyway!
    • (Everyone who mentioned data ops started by trying to explain what it is!)
  • Everyone is going cloud
    • And if you’re going to go cloud, you really should re-architect, not just dump your existing app in as-is.
  • Everyone is moving about between employers at a crazy pace!
    • a LOT of ex Pentaho folk there.
  • Architecture
    • Everyone has the same diagram.
    • The “performance” solving solutions (e.g. GPUs) don’t solve the problem per-se, they simply allow you to get it all on one box which hides the problem.
      • Although, on that note, snowflake genuinely does seem to solve the problem
  • Conspicuous in their absence?
    • Talend not there.
    • Hitachi Vantara not there.
    • Neo4j despite talking, didnt have a stand. Thats quite bizarre!
  • Why were Mercedes themselves there? This i don’t get at all.  Tibco must have found some way to encourage them.  Certainly there’s nothing in it for Mercedes in being there.
  • The data catalog guys – They all were very flexible – if you already have a metastore then we’ll use that. Or we can be your metastore.  etc. This is really clever stuff.  However, they’re all ripe for acquisition, I can’t see a “data catalog” company being sustainable as it’s own thing.  Imagine if Pentaho combined IO Tahoe with their metadata driven data ingestion framework…….  That will be amazing (and indeed, it’s what i’m actually planning in the CE version!)

Anyway – the talks. This was a mixed bag. If you only attended Day 1 then you did miss out:

  • Jay kreps –  Kafka keynote.  Not much content in this, i guess being a keynote doesn’t mean it’s necessarily interesting
  • Tamr- good agile points. Good points about always allowing data feedback.  Get it out fast, and react.
  • Attunity actually still talking about Hadoop. Huh!
  • Mercedes – One of 3 great talks.
    • Concentrate on making sure their 200 analysts never look at boring data.
    • Don’t have that much data, only 15TB per week
    • 60ghz wifi allows them to transfer 2gb data from the car at 60mph in the time it takes to travel 100m. Wow! That includes security and handshaking!
    • ITERATE!
  • Zaf Khan – Arcadia data, turned into a bit of a sales pitch.  BUT used the good old “use the right tool” adage.
  • Serverless talk – this was good – need to understand what (if) is the difference between AWS lamda and faas!?
    • Event thinking. Events as the api not commands
      • If you think about analytics – A lot of our day job is converting stateful data into events!  A fact table by definition is a record of events…
    • Domain driven design greg young
    • Cube.js – analytics on faas, interesting.
  • Matt aslet 451 research – Future!
    • Calling out cloudera acquisition for what it is! lol
    • Total data warehouse
    • Blockchain
    • Agility and data ops
    • Operationalisation
  • Google
    • New architecture new possibilities
    • Complexity kills innovation
    • You have to solve data before effective ml
    • It’s never the first that defines it. Google rarely first
    • Moving from client server to fundamentally distributed
    • Deploy…
    • Hsbc example 57$ per run. 6 mins. This for a process that they spent millions on that used to take 6 days.  Then they upgraded it and it still took 6hrs. Then bigquery and boom.  Sorted.
    • BigQuery/Dremel (internal implementation) is a SQL interface that actually works – when you have that power and flexibility amazing possibilities open up
    • The presenters job was commercialising googles internal tools.
    • Interesting that gcloud has no graph DB.  They must be using one internally though?
  • Jim Weber – Neo4j
    • Very funny talk.
    • Much needed at the end of the day
    • Neo is clearly great. It’s finding a use for it that is the trick…
  • Hannah Fry
    • Amazing talk – If you’ve not seen her before on TV then check her out on BBC4.
    • City data expert – What a fun job!


Vendor Visits

  • SQream – GPU based DWH. Actually there are no end of fast DWH’s at the show must have been 10+. Nothing different here.
  • Influx data – impressive timeseries db – worth a look.
  • Data Catalogs  (ALL of these are very interesting. Using ML on your metadata to inprove quality and linkage)
    • IO Tahoe
    • Tamr
    • Waterline
    • Calibra(didn’t visit this one)
  • Snowflake – This has to be worth a look.

#Neo4j #London #Meetup

Last night I attended the Neo4j meetup, which wasnt far from this pretty spectacular building:


Anyway, I digress.

So the talk was all about knowledge graphs, and was presented by Petra Selma who is driving the direction and development of Cypher – the neo4j query language.

So, some very interesting points were made, here are my highlights, in no particular order!

  • Neo4j and Edinburgh university are working to define and lock down the semantics of Cypher – or rather graph query language. The aim is to produce a standard that all vendors actually adhere to – Unlike SQL where every dialect is different. This is a noble aim, however if graph tech does take off, I can’t see it happening!
  • It’s quite curious that Cypher queries a graph, yet returns a table. This struck me as odd from the very start but subsequently Petra pointed out that in the next version you do have the option to return a graph – and indeed to build chains of queries.  Interesting stuff. (Composition was it called?)
  • Another interesting point – Typically when querying your graph it’s not uncommon to find unexpected insights – the whole “you dont know what you dont know”.  It’s hard to see from the query syntax how that is encouraged but I guess you need to delve deep into it to see.
  • When scaling out Neo4j they use causal consistency – so even if writes occur on different boxes, they are guaranteed to occur in the correct order.
    • This is related to another point – Neo4j seems very focussed on OLTP.  Insert speed. Acid etc.  It’ll be interesting to see how (if) that can also translate to a more analytic tool (which is the way they’re going now they’re moving to a graph “platform”
    • It’s very operationally focussed. All the connectors are geared towards keeping the Neo graph up to date in real time – presumably so that analytics etc are always up to date.  In that sense it’s more like another part of your operational architecture. It’s not like a datalake/warehouse.
    • Obviously there’s connectors for all sorts of sources. Plus you can use kettle where there isn’t – they didn’t mention that though!
    • However, in pointing out that you’re trying to move away from silo’d data etc, you are of course,  creating another silo, albeit one that reads from multiple other sources.
  • next versions will have support for tenancy, more data types, etc.  Multiple graphs. etc.
  • Indexing is not what you think – typically when querying a graph you find a node to start, and then traverse from there. So indexing is all about finding that initial node.
  • A really good point I liked a lot – the best graphs are grown and enriched organically.  As you load more data, enrich your graph. It’s a cycle
    • Additionally you can use ML (machine learning) to add labels to your graph.  Then the enriched graph becomes input to your ML again, and round you go!
    • So, start simple, and build up.  Let the benefits come.

All in all very interesting. It seems a tool well worth playing with, and kettle makes this super easy of course with the existing connectors developed by Know BI.  So have a go and see what you can find.  The barriers to starting are incredibly low.

I’m particularly interested in seeing where the putting relationships as first class citizens leads us – but i’m also curious to see how that fits alongside properties and data storage within the graph.  I can see some interesting examples with clinical data, and indeed, some fun examples in the beer world!

If you went, what did you think? Strike up a discussion on twitter!

Pentaho Security – Full JDBC – Passwords with Salts

Following on from this post:


you don’t always have users/passwords stored in LDAP.  Admittedly it seems this is more legacy these days, but imagine you have a webapp which all your (1000+) users are registered with and you want to share those credentials.  I was in EXACTLY this situation about 12 years ago, and we hit a snag – The password was hashed.  Luckily, thanks to spring security, this was quickly resolved by simply configuring a passwordEncoder.  NO code changes, nice!  (At the time none of this was documented!)

Now move things forward.  These days, passwords are not simply hashed, they are salted. This is primarily a reaction to an increase in compute speed making brute force/rainbow dictionary attacks a lot easier.

Ah ha you may think! Spring will handle it for us!  Well yes and no..

Firstly; This is not simply a password encoder.  Unfortunately you need access to the username and the encoder does not have this.  However; there is something else – You need to create a “Salt Source”.  Ideally you’d use the reflection one, and specify a userDetails property for the salt (e.g. username) BUT in this case the salt was assigned by the webapp…  And is in the users table.

As I understand it, the correct/clean way to do this would therefore be to override the Userdetails object and add support for getting/setting the salt on the object. Then you can use the reflection salt source, and boom.

However; With Pentaho, thats not so easy.

Instead, you can create a saltSource, something like this:

package org.dan.salts;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.springframework.security.authentication.dao.SaltSource;
import org.springframework.security.core.userdetails.UserDetails;

public class DBSaltSource implements SaltSource {

private PreparedStatement pstmt;
private Connection con;

private String dbdriver;
private String url;
private String username;
private String password;

private static final Log logger = LogFactory.getLog(DBSaltSource.class);

DBSaltSource() throws Exception


public Object getSalt(UserDetails userDetails)
** Use the userDetails.getUsername() in your query **
* It's easy! *
return null; 

* Getters and setters for properties in spring security bean xml.

public void setDbdriver(String dbdriver) { 
this.dbdriver = dbdriver;

public String getDbdriver() {
return this.dbdriver;

ETC ETC . (For all 4 properties)


So; Now what?

Well build your class, dump it into a jar, and throw it into the BA server.  At this point I’ll assume you’ve done all the JDBC configuration

Now, open up applicationContext-spring-security-jdbc.xml which you should already be familar with, and look at these changes:

 <bean id="authenticationProvider"
<property name="userDetailsService">
<pen:bean class="org.springframework.security.core.userdetails.UserDetailsService"/>
<property name="passwordEncoder">
<ref bean="jdbcPasswordEncoder" />
<property name="saltSource">
<bean class="org.dan.salts.DBSaltSource"> <!-- Not sure why, but the vars used below don't work here. Suspect because we're in the authenticationProvider scope? -->
<property name="dbdriver" value="net.sourceforge.jtds.jdbc.Driver"/>
<property name="url" value="jdbc:jtds:sqlserver://localserver:1433/adbsomewhere"/>
<property name="username" value="dbuser"/>
<property name="password" value="password"/>

Note two things:

We link our class which we coded above, to the “saltSource” for the authentication provider.

To see how this works, look at the code here:


Make sure you look at the right version of the code. Check the libs in Pentaho server to be sure.

Anyway you’ll see thats how the salt thingumabob works.  So now, we have our code, which gets a user specific salt and sends it to the password encoder.

If your lucky you can use a standard password encoder. If not, then you can customise one!  In doing so, pay very careful attention to encodings (base64) but also the charset. So in my case, the password was base64 encoded, SHA-256 hashed, but the original string that was hashed had to be UTF16_LE . (This equates to Encoding.UNICODE in C#)

In fact – this is a key learning here – before you even go near any custom encoders, or spring, make absolutely sure you can write the code to match the passwords in the database FIRST.  (You can do this in PDI, quick and easy).

One word of warning – All the encoder and saltSource stuff has changed in spring5, the passwordEncoder is now deprecated.  It doesn’t look like the solution above will work in the same way so as always, when coming to upgrade time you’ll have to test, and re-write these snippets of code. (No sign that there’s any plan to upgrade spring at the moment however)

Finally, huge thanks and shout out to Alex Schurman for spending 5 minutes guiding me along the way to a solution!

Lets also thank the #Opensource gods. None of this could have been done with closed proprietary code.

Pentaho Security – Hybrid LDAP / JDBC

Pentaho uses Spring security under the hood – Version 4.1.3 as of 8.0. You don’t really need to know much about this except it’s an industry standard (for java at least) security layer.

The great thing about that, is the flexibility it gives for users/tweakers of the Pentaho platform.

For the Pentaho developers (way back in the day) it also meant they didn’t have to re-invent the wheel, and also rather handily by following industry standard it’s better from a security standpoint – hence there’s been very FEW security vulnerabilities in the Pentaho platform.

Anyway – It’s very very common to see these things in virtually all environments

  • LDAP / Active Directory
  • Roles/Permissions available in a database.

Now, I’ve been at a few places where LDAP contains both the users (for authentication) and the roles (for authorisation).  And in those where they didn’t have the latter, we often recommend that LDAP is the right place for that.  In some places this was achieved by creating distribution groups in outlook (!)

However in a lot of environments it can be very hard / slow to get data in LDAP updated.  hence it may be nicer to store the authorisation data elsewhere, such as in a database.

Lo and behold! I was perusing the docs the other day, and this is clearly and concisely documented as a LDAP hybrid security option, read all about it here:


In fact, if you have to do any security configuration, LDAP or not, be sure to get up to speed with these docs and the files involved – it’ll help you understand the basic concepts.