PDI is a Java application (dur..) and this means memory within the java virtual machine is freed up using a process called garbage collection.
Now; Garbage collection is a wildly complicated thing and many people say that GC tuning is a black art. Whilst I wouldn’t say that, I would say that you should always start with the defaults and work from there – if something has no effect, then give it up. One thing is true with garbage collection is the nature of the app (And the way it has been coded) has a significant impact. So don’t think that because you’ve tuned a tomcat app before, that knowledge will apply to PDI – it won’t!
Why should I do this?
Well if you have no issues with PDI then you should not. Simple! But if you’re working with masses of data and large heap sizes then this will bite you at some point.
And before you come to tuning the GC, have you tried a bigger heap size first? (Xmx setting). This is not necessarily a panacea but if you’re doing a lot of sorting or simply a lot of steps and hence a lot of hops then you will need proportionately more memory.
How do I know there is a problem?
This is easy – if you have a process that seems to hang for significant amounts of time, then fly along, then pause again etc, this is most likely GC. If your process trundles through at a consistent throughput then you’re probably doing ok.
Well despite my saying above that you should tune for your application, there is one setting which you should most definitely apply for PDI and thats the concurrent collector. Having just googled this to find a reference to it I’ve realised there is both a concurrent collector and a parallel collector, and hence i now need to go to another PC to check which it is I use
<short break insert piped jazz music here>
OK, found it:
-XX:+UseConcMarkSweepGC -verbose:gc =XX:+PrintGCTimeStamps -XX::+PrintGCDetails -XX:PrintTenuringDistribution -Xloggc:/tmp/gc.log
OK – so seems I need to do some research on the parallel collector then, has anyone used that?
Either way, there are 2 things in the options above:
- The instruction to the VM to use the new collector ( UseConcMarkSweepGC )
- Everything else to configure logging – note the configuration of the log file.
These settings need to be put somewhere where PDI picks them up every time, i.e. in the environment in $PENTAHO_JAVA_OPTIONS, or actually in the spoon/carte/pan/kitchen scripts.
It is important to enable GC logging so you can see whether or not you do have a GC problem. Generally if you have GC full collections of more than a few seconds you may have a problem. And if you see full GC taking minutes or hours then you definitely have an issue! The other options that I use relating to the logging – they’re pretty self explanatory, and google/stackoverflow will give further detail
And that’s it – More later in the week on the topic of big data with PDI.