A note on garbage collection with Pentaho Data Integration / PDI / Kettle

Garbage Collection

PDI is a Java application (dur..) and this means memory within the java virtual machine is freed up using a process called garbage collection.

Now; Garbage collection is a wildly complicated thing and many people say that GC tuning is a black art. Whilst I wouldn’t say that, I would say that you should always start with the defaults and work from there – if something has no effect, then give it up.  One thing is true with garbage collection is the nature of the app (And the way it has been coded) has a significant impact.  So don’t think that because you’ve tuned a tomcat app before, that knowledge will apply to PDI – it won’t!

Why should I do this?

Well if you have no issues with PDI then you should not.  Simple!  But if you’re working with masses of data and large heap sizes then this will bite you at some point.

And before you come to tuning the GC, have you tried a bigger heap size first? (Xmx setting). This is not necessarily a panacea but if you’re doing a lot of sorting or simply a lot of steps and hence a lot of hops then you will need proportionately more memory.

How do I know there is a problem?

This is easy – if you have a process that seems to hang for significant amounts of time, then fly along, then pause again etc, this is most likely GC.  If your process trundles through at a consistent throughput then you’re probably doing ok.

What settings?

Well despite my saying above that you should tune for your application, there is one setting which you should most definitely apply for PDI and thats the concurrent collector.  Having just googled this to find a reference to it I’ve realised there is both a concurrent collector and a parallel collector, and hence i now need to go to another PC to check which it is I use

<short break insert piped jazz music here>

OK, found it:

-XX:+UseConcMarkSweepGC -verbose:gc =XX:+PrintGCTimeStamps -XX::+PrintGCDetails -XX:PrintTenuringDistribution -Xloggc:/tmp/gc.log

OK – so seems I need to do some research on the parallel collector then, has anyone used that?

Either way, there are 2 things in the options above:

  1. The instruction to the VM to use the new collector ( UseConcMarkSweepGC )
  2. Everything else to configure logging – note the configuration of the log file.

These settings need to be put somewhere where PDI picks them up every time, i.e. in the environment in $PENTAHO_JAVA_OPTIONS, or actually in the spoon/carte/pan/kitchen scripts.

It is important to enable GC logging so you can see whether or not you do have a GC problem. Generally if you have GC full collections of more than a few seconds you may have a problem. And if you see full GC taking minutes or hours then you definitely have an issue!  The other options that I use relating to the logging – they’re pretty self explanatory, and google/stackoverflow will give further detail

And that’s it – More later in the week on the topic of big data with PDI.

Advertisements

3 thoughts on “A note on garbage collection with Pentaho Data Integration / PDI / Kettle

  1. It turns out thanks to this stackoverflow post:

    http://stackoverflow.com/questions/220388/java-concurrent-and-parallel-gc

    Concurrent garbage collection by default turns on parallel collection.

    It’s also worth remembering that at somepoint the VM will always grind to a halt and have to do a “FULL” collection. This process is a last resort and can take some time – when this occurs your entire VM is blocked until the process completes. And this process is always single threaded.

  2. Hi codek,
    Amazing post the performance in JVM and the GC is a real problem . I’m tried to find the same information and experience for Tomcat performance for pentaho 5.3.

    • I’ve tuned Tomcat JVM extensively for a java service app before and I can tell you it’s very application specific. I’ve never came across the need to tune Pentaho itself other than in the early days when the defaults were useless. If you have a problem then I highly recommend connecting VisualVM to the server and seeing what is going on. You at least need to identify which part of the stack is contributing to the issues! VisualVM is an excellent tool that can very quickly give you insight to what the app is doing.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s