Updating dimensions in your DWH – Python, Kettle and Athena

I’m quite reassured by both talks at BigDataLDN and other talks at Skillsmatter (RIP) which have been across many different technologies, and the reoccurring theme is building a data warehouse.

Now a few years ago everyone declared the DWH was dead. Well that couldn’t be further from the truth. It did fall poorly for some time, but it seems to me to have got a second wind.  It’s just virtualised now, but the modelling and concepts are more important than ever.

So; How do you maintain your dimensions? In a dimension record, your data can be versioned and you can choose whether you keep a history of your attributes (type2) or just overwrite with the latest value (type 1). There’s all sorts of combinations and variants of these too.

Now, Kettle or PDI (An industry standard Opensource ETL tool) has a lovely step which does ALL of this for you.  So, I can maintain a dimension with just two steps. It’s also incredibly rich, so i can choose on a field by field basis exactly what i want to happen. Here’s a picture:

Screenshot 2019-11-28 at 11.41.36

So, you can see here a table input, which runs a simple bit of SQL.  And a lookup/update step which has about 5 fields of configuration, and boom! We have a fully maintained and updated dimension. Now rinse and repeat.  Even a new user to Kettle could create this in 5-10 minutes.

Oh; And this is gloriously database independent too.  Doesn’t matter what the DB is, source or target, it’ll work. Perhaps your SQL may occasionally need to be tweaked, but nothing in Kettle would change.

And also, while i’m at it (this is NOT a sales pitch btw!) Kettle will also generate the SQL for the target tables. And it’ll insert a zero record for the case when you can’t find the dimension. (A quirk about dimensional modelling that you must ALWAYS have a key in the fact table)

Now these days, where it’s at is python. Not ETL tools. OK so how do we do this in Python? Well actually, it’s also extremely simple.  This time it took me about half an hour to get it working, but that is more reflective of my python knowledge than the tool itself.  So how does it work? Lets work through the code line by line:

The first thing, is install the library “pygrametl” via pip.  This seems to be a mature library, still occasionally updated, so seems a good one to use.

Then, in our code:

Screenshot 2019-11-28 at 12.04.42

So that’s the usual imports, etc, fine, boring..

Screenshot 2019-11-28 at 12.05.28

This is quite nice, here we’re defining two connections to postgres, and running some SQL. We’re also defining the mapping of the columns – something you don’t need to do in Kettle.  Nevertheless with just a couple of lines we have our connection and we’ve ran our SQL. Good stuff.

Now lets define the dimension itself:

Screenshot 2019-11-28 at 12.07.14

So there’s two here. The first one is type 1 (overwrite). The second is type 2 – update. (History) . So that’s also pretty simple, we’re just defining the fields and their uses. Good stuff. In fact you could actually generate this from metadata…  Ooh..

Next, let’s update the dimension!

Screenshot 2019-11-28 at 12.11.02

Again, that’s gloriously simple. For each row in the data, we simply make one call to our library to “ensure” the dimension. This then does either insert, update, or nothing.   (And then don’t forget to close our connections!)

Done!

So, now what? All of this is great, but there’s one niggling problem here – Both tools rely on the data being in a relational database – or more specifically one that supports update functions.  Athena (and other bigdata SQL engines on Hadoop) typically don’t support update so this doesn’t work.

That means you can either rebuild your whole dimension each time using inserts, or find another way.  And if you’re going to do that, you need to have all the data to do so – sometimes the source doesn’t actually have this data, and it’s data you’ve curated, so be sure to read the dimension first in that case, before clearing it down!

So one way i’ve seen, is to abuse SQL to look forwards (or backwards) over the data and replicate the valid from / valid to and version fields in a View.  Whilst this is really nice, as it hugely simplifies the stored data, it puts an awful lot of trust in the database being happy to run some more complex SQL.

So I think that leaves a hybrid – This is actually common in data architecture, and basically the advice is – use the right tool for the job. So in a hybrid setup your golden source of dimensional data is kept and maintained in an RDBMS, but then blatted over (Technical term sorry) to S3 or HDFS post updates.  Again, in Kettle this is something that is embarrassingly easy!  (And probably in python too)

Any other thoughts? How do you manage your dimensions when your DWH is based upon a filestore rather than an RDBMS?

What’s in a name? #PLUG

Good morning,

An off the cuff comment got me thinking the other day that should we change the name of the #Pentaho London Usergroup (meetup) to KLUG?  i.e. Kettle London Usergroup?

Well; Our keen supporters will remember at one point we tried to rebrand as an Opensource BI meetup, and we bought in all sorts of complimentary tools and talks. Not all opensource however!  This unfortunately didn’t work and just spread things too thin. There is value in meetups being focussed around a thin area.

So what does the word Pentaho vs the word Kettle mean?  Historically Pentaho was the company name of course, and also used interchangeably as the product name but Hitachi have now disowned the word Pentaho, and in fact, Pentaho swag is now banned in the offices.  So it means nothing to them. In my world it’s simple – its the original tech stack – So including the BI server, all modules within that, the Metadata layer, Reporting and Mondrian, and of course Kettle.  Oh, and ctools 🙂

Now; All of those modules apart from ctools are active independently from the BI server and used across various other stacks.  So “Pentaho” is just a collection of libraries really at its heart.  (Think Saiku, tibco etc)

Ctools on the other hand is still an extremely capable tool. There’s little else out there that is so flexible, despite it being positively ancient now. Just imagine if someone took the ctools architecture and rebuilt it for the modern times, it would be a killer app.

I digress. So, Pentaho is just a collective name for a bunch of BI tools!  Therefore, I think PLUG (for now) can remain our acronym! What do you think?  Any other options?

PS: Of course, with Kettle being renamed to Hop and moving to ASF, that presents another renaming opportunity, but somehow HLUG just doesn’t flow off the tongue correctly 🙂