#Serverless #AWS PDI

Hmm, what what?  Serverless PDI?

Yes, so serverless is *the* thing at the moment.  Partly driven by amazing advances in the devops space – Fundamentally we’ve all had enough of managing servers, patching etc. You know the story.

“Run code not computers”

Why to do this? – Simple – Integration. If you need to hook up 2 APIs of separate systems it’s actually pretty expensive to have a server sitting there running 24×7.  So what we want is to literally pay for the time we use and nothing more – We don’t want to have to startup and shutdown a whole server either!

Why Pentaho? The single most important argument is visual programming.  It’s faster to get started with PDI than it is with a scripted solution.  It’s more maintainable and it allows you to capitalise on general ETL skills.  (Experience of any ETL tool is enough to work with PDI) .  PDI has also done the boring input/output/API stuff, so all you need to focus on is your business logic. Simple!

So, how to do this? Well Amazon AWS Lambda is where to start.  I assume google cloud has a similar function, but I’ve already got stuff running in AWS so this was a no brainer.

The stats sound good. Upload your app and you only pay for run time, everything else is handled. There’s even something called API connect so you can trigger your ‘Functions’.  And finally – My favourite automation service Skeddly can also trigger AWS Lambda functions. Great!

There is one issue. The jar has to be less than 100mb. What! PDI is 1GB, how can that possibly make sense. Sure enough some googling shows lots of other people trying to use PDI in lamdba and finding this limit is far too low.

But; Matt Casters pointed out to me the kettle engine is only 6mb. What?  Really?  I took a look – and sure enough with a few dependencies thrown in you can build a PDI engine archive which only uses 22MB. We’re on.

To start, read these two pages:

Java programming model

Java packaging with Maven


  1. Create a pom.xml
  2. Add in your example java code
  3. Build the jar (mvn package).
  4. Remove any signed files: cd target; zip -d <file>.jar META-INF/*.RSA META-INF/*.DSA META-INF/*.SF
  5. Upload as a lambda function
  6. Set an environment variable KETTLE_HOME=/tmp (If you dont PDI will crash as the default home dir in lambda isn’t writable)
  7. TEST!

And here’s the proof:

Screen Shot 2017-04-25 at 12.53.39

Slightly disconcerting that it took 5.7s to run. On my laptop the same executed in 0.5s. I guess the lambda physical boxes are busy and low spec!

What’s next?

  1. Find a better way to package the ktr
  2. Hook the input file into PDI parameters
  3. Provide better output than “Done”!
  4. Setup with the API connect
  5. Schedule via Skeddly

I will be releasing all the code for this soon – In the mean time if anyone is particularly interested in this right now please do contact me. I think it’s a very interesting area and this simple integration opens up a vast amount of power.

13 thoughts on “#Serverless #AWS PDI

  1. That is really interesting but what about the source data that is outside Amazon (ie. Corporate data) that is in the order tenths or million of rows?

    • TBH I think this is most suitable for action based activity. i.e. a record has been updated so send a message to some other system. A file has been delivered so trigger a remote process etc etc. The way you can automatically hook into these events is excellent.

      If the data is “outside” I would guess you’re pulling it in to Amazon. I would assume AWS has VPN solutions for hooking into your architecture. But I leave that sort of thing to the devops!

      If you really need to transfer millions of rows continuously then you should look at a streaming platform such as kinesis. Admittedly there is some overlap, and you can implement streaming interfaces in Lambda it seems!

  2. Pingback: Serverless PDI in AWS – Building the jar | Codeks Blog

  3. Pingback: "Massive amounts of power for very little costs"

  4. Hi Dan,
    Do you have more information (or a tar) of the small pdi engine? 22mb instead of 1.somethingG sounds interesting 🙂 I tried the approach of Slawomir (but for pentaho 7), but get errors and a non-working engine… (but the error was fast).

  5. Hi Dan! Do you have any updates on this? Even if you don’t, could you share the code you got so far?

    I think its a subject of general interest and I’m willing to help from the point where you stopped!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s