#Serverless #AWS PDI

Hmm, what what? Serverless PDI?

Yes, so serverless is *the* thing at the moment. Partly driven by amazing advances in the devops space – Fundamentally we’ve all had enough of managing servers, patching etc. You know the story.

“Run code not computers”

Why to do this? – Simple – Integration. If you need to hook up 2 APIs of separate systems it’s actually pretty expensive to have a server sitting there running 24×7. So what we want is to literally pay for the time we use and nothing more – We don’t want to have to startup and shutdown a whole server either!

Why Pentaho? The single most important argument is visual programming. It’s faster to get started with PDI than it is with a scripted solution. It’s more maintainable and it allows you to capitalise on general ETL skills. (Experience of any ETL tool is enough to work with PDI) . PDI has also done the boring input/output/API stuff, so all you need to focus on is your business logic. Simple!

So, how to do this? Well Amazon AWS Lambda is where to start. I assume google cloud has a similar function, but I’ve already got stuff running in AWS so this was a no brainer.

The stats sound good. Upload your app and you only pay for run time, everything else is handled. There’s even something called API connect so you can trigger your ‘Functions’. And finally – My favourite automation service Skeddly can also trigger AWS Lambda functions. Great!

There is one issue. The jar has to be less than 100mb. What! PDI is 1GB, how can that possibly make sense. Sure enough some googling shows lots of other people trying to use PDI in lamdba and finding this limit is far too low.

But; Matt Casters pointed out to me the kettle engine is only 6mb. What? Really? I took a look – and sure enough with a few dependencies thrown in you can build a PDI engine archive which only uses 22MB. We’re on.

To start, read these two pages:

Java programming model

Java packaging with Maven

Then:

Create a pom.xml
Add in your example java code
Build the jar (mvn package).
Remove any signed files: cd target; zip -d <file>.jar META-INF/*.RSA META-INF/*.DSA META-INF/*.SF
Upload as a lambda function
Set an environment variable KETTLE_HOME=/tmp (If you dont PDI will crash as the default home dir in lambda isn’t writable)
TEST!

And here’s the proof:

Screen Shot 2017-04-25 at 12.53.39

Slightly disconcerting that it took 5.7s to run. On my laptop the same executed in 0.5s. I guess the lambda physical boxes are busy and low spec!

What’s next?

Find a better way to package the ktr
Hook the input file into PDI parameters
Provide better output than “Done”!
Setup with the API connect
Schedule via Skeddly

I will be releasing all the code for this soon – In the mean time if anyone is particularly interested in this right now please do contact me. I think it’s a very interesting area and this simple integration opens up a vast amount of power.

13 thoughts on “#Serverless #AWS PDI”

That is really interesting but what about the source data that is outside Amazon (ie. Corporate data) that is in the order tenths or million of rows?
J.

codek

April 26, 2017 at 8:05 am

TBH I think this is most suitable for action based activity. i.e. a record has been updated so send a message to some other system. A file has been delivered so trigger a remote process etc etc. The way you can automatically hook into these events is excellent.

If the data is “outside” I would guess you’re pulling it in to Amazon. I would assume AWS has VPN solutions for hooking into your architecture. But I leave that sort of thing to the devops!

If you really need to transfer millions of rows continuously then you should look at a streaming platform such as kinesis. Admittedly there is some overlap, and you can implement streaming interfaces in Lambda it seems!

Reply

Super idea. Looking forward to the followup\code release. Many thanks!

Pingback: Serverless PDI in AWS – Building the jar | Codeks Blog

Pingback: "Massive amounts of power for very little costs"

Hi Dan,
Do you have more information (or a tar) of the small pdi engine? 22mb instead of 1.somethingG sounds interesting 🙂 I tried the approach of Slawomir (but for pentaho 7), but get errors and a non-working engine… (but the error was fast).

Hi Dan ,

When you will be releasing the code . It will be really helpfull .

Dan the man. We need this code, please.

Hi Dan! Do you have any updates on this? Even if you don’t, could you share the code you got so far?

I think its a subject of general interest and I’m willing to help from the point where you stopped!

Tks.

Dan, Can you share the code for this please?

Hi Dan, Can you please share the code?

codek

March 30, 2021 at 12:41 pm

Keep an eye on Diethards blog – this is coming very soon.

Reply

Hi – Thank you for writing this. I’m interested in standing up such a project. Could I possibly get a look at your code for this example?

Jean Sagi

April 25, 2017 at 3:01 pm

That is really interesting but what about the source data that is outside Amazon (ie. Corporate data) that is in the order tenths or million of rows?
J.

- codek
  
  April 26, 2017 at 8:05 am
  
  TBH I think this is most suitable for action based activity. i.e. a record has been updated so send a message to some other system. A file has been delivered so trigger a remote process etc etc. The way you can automatically hook into these events is excellent.
  
  If the data is “outside” I would guess you’re pulling it in to Amazon. I would assume AWS has VPN solutions for hooking into your architecture. But I leave that sort of thing to the devops!
  
  If you really need to transfer millions of rows continuously then you should look at a streaming platform such as kinesis. Admittedly there is some overlap, and you can implement streaming interfaces in Lambda it seems!
  
Garry Bettle

April 26, 2017 at 6:31 am

Super idea. Looking forward to the followup\code release. Many thanks!

Pingback: Serverless PDI in AWS – Building the jar | Codeks Blog
Pingback: "Massive amounts of power for very little costs"
Jaap-Andre de Hoop

November 24, 2017 at 6:23 pm

Hi Dan,
Do you have more information (or a tar) of the small pdi engine? 22mb instead of 1.somethingG sounds interesting 🙂 I tried the approach of Slawomir (but for pentaho 7), but get errors and a non-working engine… (but the error was fast).

Shyam Prasath

November 28, 2017 at 12:52 pm

Hi Dan ,

When you will be releasing the code . It will be really helpfull .

josh

January 29, 2018 at 8:43 pm

Dan the man. We need this code, please.

José Filipe Neis

September 19, 2018 at 12:16 pm

Hi Dan! Do you have any updates on this? Even if you don’t, could you share the code you got so far?

I think its a subject of general interest and I’m willing to help from the point where you stopped!

Tks.

Madhava Mahishi

January 16, 2019 at 10:47 am

Dan, Can you share the code for this please?

NxP

March 30, 2021 at 12:33 pm

Hi Dan, Can you please share the code?

- codek
  
  March 30, 2021 at 12:41 pm
  
  Keep an eye on Diethards blog – this is coming very soon.
  
Ian

January 27, 2022 at 6:26 pm

Hi – Thank you for writing this. I’m interested in standing up such a project. Could I possibly get a look at your code for this example?

	Ian on #Serverless #AWS PDI
	codek on #Serverless #AWS PDI
	NxP on #Serverless #AWS PDI
	codek on Uploading files with CFR and…
	Andrey on Uploading files with CFR and…

Codeks Blog

Open source BI Consultant

#Serverless #AWS PDI

13 thoughts on “#Serverless #AWS PDI”

Leave a reply to Madhava Mahishi Cancel reply

Share this:

Related

13 thoughts on “#Serverless #AWS PDI”

Leave a reply to Madhava Mahishi Cancel reply