#Neo4j #London #Meetup

Last night I attended the Neo4j meetup, which wasnt far from this pretty spectacular building:


Anyway, I digress.

So the talk was all about knowledge graphs, and was presented by Petra Selma who is driving the direction and development of Cypher – the neo4j query language.

So, some very interesting points were made, here are my highlights, in no particular order!

  • Neo4j and Edinburgh university are working to define and lock down the semantics of Cypher – or rather graph query language. The aim is to produce a standard that all vendors actually adhere to – Unlike SQL where every dialect is different. This is a noble aim, however if graph tech does take off, I can’t see it happening!
  • It’s quite curious that Cypher queries a graph, yet returns a table. This struck me as odd from the very start but subsequently Petra pointed out that in the next version you do have the option to return a graph – and indeed to build chains of queries.  Interesting stuff. (Composition was it called?)
  • Another interesting point – Typically when querying your graph it’s not uncommon to find unexpected insights – the whole “you dont know what you dont know”.  It’s hard to see from the query syntax how that is encouraged but I guess you need to delve deep into it to see.
  • When scaling out Neo4j they use causal consistency – so even if writes occur on different boxes, they are guaranteed to occur in the correct order.
    • This is related to another point – Neo4j seems very focussed on OLTP.  Insert speed. Acid etc.  It’ll be interesting to see how (if) that can also translate to a more analytic tool (which is the way they’re going now they’re moving to a graph “platform”
    • It’s very operationally focussed. All the connectors are geared towards keeping the Neo graph up to date in real time – presumably so that analytics etc are always up to date.  In that sense it’s more like another part of your operational architecture. It’s not like a datalake/warehouse.
    • Obviously there’s connectors for all sorts of sources. Plus you can use kettle where there isn’t – they didn’t mention that though!
    • However, in pointing out that you’re trying to move away from silo’d data etc, you are of course,  creating another silo, albeit one that reads from multiple other sources.
  • next versions will have support for tenancy, more data types, etc.  Multiple graphs. etc.
  • Indexing is not what you think – typically when querying a graph you find a node to start, and then traverse from there. So indexing is all about finding that initial node.
  • A really good point I liked a lot – the best graphs are grown and enriched organically.  As you load more data, enrich your graph. It’s a cycle
    • Additionally you can use ML (machine learning) to add labels to your graph.  Then the enriched graph becomes input to your ML again, and round you go!
    • So, start simple, and build up.  Let the benefits come.

All in all very interesting. It seems a tool well worth playing with, and kettle makes this super easy of course with the existing connectors developed by Know BI.  So have a go and see what you can find.  The barriers to starting are incredibly low.

I’m particularly interested in seeing where the putting relationships as first class citizens leads us – but i’m also curious to see how that fits alongside properties and data storage within the graph.  I can see some interesting examples with clinical data, and indeed, some fun examples in the beer world!

If you went, what did you think? Strike up a discussion on twitter!


Pentaho Security – Full JDBC – Passwords with Salts

Following on from this post:


you don’t always have users/passwords stored in LDAP.  Admittedly it seems this is more legacy these days, but imagine you have a webapp which all your (1000+) users are registered with and you want to share those credentials.  I was in EXACTLY this situation about 12 years ago, and we hit a snag – The password was hashed.  Luckily, thanks to spring security, this was quickly resolved by simply configuring a passwordEncoder.  NO code changes, nice!  (At the time none of this was documented!)

Now move things forward.  These days, passwords are not simply hashed, they are salted. This is primarily a reaction to an increase in compute speed making brute force/rainbow dictionary attacks a lot easier.

Ah ha you may think! Spring will handle it for us!  Well yes and no..

Firstly; This is not simply a password encoder.  Unfortunately you need access to the username and the encoder does not have this.  However; there is something else – You need to create a “Salt Source”.  Ideally you’d use the reflection one, and specify a userDetails property for the salt (e.g. username) BUT in this case the salt was assigned by the webapp…  And is in the users table.

As I understand it, the correct/clean way to do this would therefore be to override the Userdetails object and add support for getting/setting the salt on the object. Then you can use the reflection salt source, and boom.

However; With Pentaho, thats not so easy.

Instead, you can create a saltSource, something like this:

package org.dan.salts;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.springframework.security.authentication.dao.SaltSource;
import org.springframework.security.core.userdetails.UserDetails;

public class DBSaltSource implements SaltSource {

private PreparedStatement pstmt;
private Connection con;

private String dbdriver;
private String url;
private String username;
private String password;

private static final Log logger = LogFactory.getLog(DBSaltSource.class);

DBSaltSource() throws Exception


public Object getSalt(UserDetails userDetails)
** Use the userDetails.getUsername() in your query **
* It's easy! *
return null; 

* Getters and setters for properties in spring security bean xml.

public void setDbdriver(String dbdriver) { 
this.dbdriver = dbdriver;

public String getDbdriver() {
return this.dbdriver;

ETC ETC . (For all 4 properties)


So; Now what?

Well build your class, dump it into a jar, and throw it into the BA server.  At this point I’ll assume you’ve done all the JDBC configuration

Now, open up applicationContext-spring-security-jdbc.xml which you should already be familar with, and look at these changes:

 <bean id="authenticationProvider"
<property name="userDetailsService">
<pen:bean class="org.springframework.security.core.userdetails.UserDetailsService"/>
<property name="passwordEncoder">
<ref bean="jdbcPasswordEncoder" />
<property name="saltSource">
<bean class="org.dan.salts.DBSaltSource"> <!-- Not sure why, but the vars used below don't work here. Suspect because we're in the authenticationProvider scope? -->
<property name="dbdriver" value="net.sourceforge.jtds.jdbc.Driver"/>
<property name="url" value="jdbc:jtds:sqlserver://localserver:1433/adbsomewhere"/>
<property name="username" value="dbuser"/>
<property name="password" value="password"/>

Note two things:

We link our class which we coded above, to the “saltSource” for the authentication provider.

To see how this works, look at the code here:


Make sure you look at the right version of the code. Check the libs in Pentaho server to be sure.

Anyway you’ll see thats how the salt thingumabob works.  So now, we have our code, which gets a user specific salt and sends it to the password encoder.

If your lucky you can use a standard password encoder. If not, then you can customise one!  In doing so, pay very careful attention to encodings (base64) but also the charset. So in my case, the password was base64 encoded, SHA-256 hashed, but the original string that was hashed had to be UTF16_LE . (This equates to Encoding.UNICODE in C#)

In fact – this is a key learning here – before you even go near any custom encoders, or spring, make absolutely sure you can write the code to match the passwords in the database FIRST.  (You can do this in PDI, quick and easy).

One word of warning – All the encoder and saltSource stuff has changed in spring5, the passwordEncoder is now deprecated.  It doesn’t look like the solution above will work in the same way so as always, when coming to upgrade time you’ll have to test, and re-write these snippets of code. (No sign that there’s any plan to upgrade spring at the moment however)

Finally, huge thanks and shout out to Alex Schurman for spending 5 minutes guiding me along the way to a solution!

Lets also thank the #Opensource gods. None of this could have been done with closed proprietary code.

Pentaho Security – Hybrid LDAP / JDBC

Pentaho uses Spring security under the hood – Version 4.1.3 as of 8.0. You don’t really need to know much about this except it’s an industry standard (for java at least) security layer.

The great thing about that, is the flexibility it gives for users/tweakers of the Pentaho platform.

For the Pentaho developers (way back in the day) it also meant they didn’t have to re-invent the wheel, and also rather handily by following industry standard it’s better from a security standpoint – hence there’s been very FEW security vulnerabilities in the Pentaho platform.

Anyway – It’s very very common to see these things in virtually all environments

  • LDAP / Active Directory
  • Roles/Permissions available in a database.

Now, I’ve been at a few places where LDAP contains both the users (for authentication) and the roles (for authorisation).  And in those where they didn’t have the latter, we often recommend that LDAP is the right place for that.  In some places this was achieved by creating distribution groups in outlook (!)

However in a lot of environments it can be very hard / slow to get data in LDAP updated.  hence it may be nicer to store the authorisation data elsewhere, such as in a database.

Lo and behold! I was perusing the docs the other day, and this is clearly and concisely documented as a LDAP hybrid security option, read all about it here:


In fact, if you have to do any security configuration, LDAP or not, be sure to get up to speed with these docs and the files involved – it’ll help you understand the basic concepts.


RequireJS, JQuery Plugins and #Pentaho CDE

So, what seems like a year ago or so, but actually turns out to be 2015, Pedro Alves posted about this huge new change to CDF – Support for requireJS.  Great! Whats that then?

Well actually, one of the main advantages is embed-ability, and ability to communicate to other objects on the page.  This is great in theory, but in practice rarely used.  So it’s a shame that such a significant underlying change has to impact everyone.  It’s not a backwards compatible change.

However; Another advantage, although one that is forced upon us, is that all the modern components such as the templateComponent and possibly a few others now REQUIRE a requireJS dashboard. So we’ll all have to move eventually – So it’s not a question of choosing, it’s a migration job.  In reality, the way require handles the dependencies is much nicer, and does solve some headaches.  It’s interesting to see that sparkl (app builder) has not been modified to work in a requireJS paradigm yet.

One of the enormous benefits of CDF and a key point about the architecture is that it uses opensource libraries where possible – requirejs in fact being one of those!  So how do we use some of these additional libraries now?

Well the first thing, is that if your plugin is not available as an AMD module you have to create a shim, so here’s how this works, using jeditable as an example:

  1. Put jquery.jeditable.js into your solution, anywhere really, i put it in /public/dashboards/plugins
  2. Put this code in a resource section in your dashboard (no need to give it a name in this case)
var requireConfig = requireCfg.config;

if(!requireConfig['amd']) {
 requireConfig['amd'] = {};

if(!requireConfig['amd']['shim']) {
 requireConfig['amd']['shim'] = {};

requireConfig['amd']['shim']["cde/resources/public/dashboard/plugins/jquery.jeditable"] = {
 exports: "jQuery",
 deps: {
 "cdf/lib/jquery": "jQuery" 

], function($) {


Now there’s two things going on here.   You’re setting up your config first, then loading the shim.  The config is important because it defines that jeditable depends on jQuery, and it’s this that resolves issues with $ being uninitialised.

Note: I took the jquery.jeditable.js from CDF, rather than downloading the latest.  Seems to work, but like a lot of pentaho libraries it’s probably quite out of date.

Unfortunately this shim approach doesn’t always work – you just need to have a go. I found it didn’t work for bootstrap-editable for example.  This code appears to be exactly the same structure, but for now jeditable will do the job for me.

Anyway; How do you then use jeditable? Pretty simple. Create a Kettle transformation endpoint in your App Builder plugin, with 2 parameters – ID and value:


Then add some HTML in your dashboard:


Then add this into a dashboard_postinit function in the components section:

  name: 'paramvalue',
  id:   'paramid'

Note you must rename the parameters because CDA puts this ‘param’ onto them for you for some reason.  In my example above – Bow is the name of our app, and testpdiupdate is the transformation name.  Note: If you edit the transformation in-place don’t forget to click the refresh button on the endpoints screen otherwise the old endpoint code will run.


Thats it! Now run your dashboard. Click the field, change the value hit enter and watch your server logs.  Be sure when using this in production to apply a security check on any parameter values that are being submitted to be sure the user really does have permission to edit that field.  (This is bread and butter security stuff)


There is documentation on the old redmine site, but that’s gone now – I did find a version here, not sure for how long though.  There’s also a really good summary on the forums


Uploading files with CFR and Pentaho

For a long time Pentaho has had a plugin called CFR

This is actually a really great plugin – Check out the facilities it offers. Secure and easy file transfer to and from your server. Great.

The API is excellent – clearly well thought out. It even offers security!   Fantastic!  In true google style, it does a simple clear thing very well without being over complicated.  Precisely what you expect of a plugin.

However; The downside is that the embedded CDE components either don’t work at all, or are incredibly flaky/inflexible.  (They only recently got updated for Pentaho 8 and don’t seem to have been tested)

So; At the end of the day, the UI side is simple, why do you need one of these components. It’s the API that is the real value of CFR, so just use it directly.  Here’s how:

  • Make sure you’re NOT using a requireJS dashboard.
  • Import the jquery form resource:

Screen Shot 2018-04-13 at 15.38.05

  • Add a text component
  • Put this in the expression:
function() {
 var uploadForm = '<form id="uploadForm" action="http://SERVER:8080/pentaho/plugin/cfr/api/store" method="post" enctype="multipart/form-data">';
 uploadForm = uploadForm + '<p><input id="fileField" type="file" class="file" name="file"/>';
 uploadForm = uploadForm + '<input type="hidden" name="path" value=""/>';
 uploadForm = uploadForm + '<p><button type="submit" class="submitBtn">Upload File</button></form>';
 return uploadForm;
  • put this in the post execution
function() {
  dataType: 'json',
  success: function(res) { 
   var filename = $('#fileField').val().split(/[\\/]/).pop();
   alert("Success! " + filename + "-" + JSON.stringify(res)); 
   Dashboards.fireChange('paramUploadedFile', filename);
  error: function(res) {
   alert("Error:" + JSON.stringify(res));

Test it!

How does it work? Well the expression creates a standard HTML File upload form on your page. This is bog standard HTML, nothing unusual here. The hidden input field for the path can be set accordingly if you like (this is the target folder in CFR, I just used the root for now)

The Post Execution is where you use ajaxForm to hook into the form.  This is where you handle the response, errors and so on.  At this point, once your file has uploaded, you’ll probably want to hit a Sparkl endpoint to trigger loading of the file you’ve just uploaded.  Thats a simple runEndpoint call..


It makes more sense to control the UI from scratch anyway, rather than use a component – primarily because you gain 100% control.

How did I figure all this out? Pretty easy really – Just look at the source code (while it’s still available!)


For folk local to London, PLUG is on Monday April 23rd, Don’t miss it!


Second #ApacheBeamLondon Meetup

So, last night (11/1/18) I attended only the second ApacheBeamLondon meetup, and it was a very interesting affair.

Firstly – The venue – Qubit – right bang in the middle of covent garden, what a cool location. Not sure what they do – but the offices were pretty nice!

The first talk was about an implementation  of a money (unit based) tracking system called Futureflow – Implemented using ApacheBeam (or previously dataflow). The data is persisted in BigTable.  They are only interested in the flow of money, not who it goes between, and thus think they can allay any privacy or regulatory concerns. Using Pub/Sub they also think that makes it easy to get the data from the banks.

This is not dissimilar to another situation i’ve seen concerning grocery shopping data.  Again in that market to get access to the data can be very long winded.  By simplifying it up front for the supplier you’re more likely to succeed.

Developing in a pipeline is good because you solidify your inputs/outputs and then you can just get on with the boxes in the middle without affecting anyone else.  And it’s that box(s) in the middle that take the work!

There is some creative table design which trades storage for fast lookup – It’s a very dedicated data model for a row scan centric system.  But in their case they have to be able to show scale, so it must be considered up front.  The whole system relies on very fast transaction history lookup for a given unit of money.

The second talk from JB was a deep dive into IOs in Apache Beam.   This was very interesting and I was pleased the organisers combined a deeply technical talk, with a real use case talk.

Curiously I saw a lot of similarities between some of the internals of Pentaho PDI and some of the pcollection/ptransforms in Beam – In particular, a pcollection === rowset, and a ptransform === step.

Anyway it was very interesting to see how the guts of the IO steps work, and how batching is handled – including the new archicture for the SplittableDoFN.

There is even a mapreduce runner for Beam! Why? :

Makes sense – Especially when you think about those people who are stuck on older clusters, but want to prepare for an upgrade.

On the IO side i liked the model of a bounded or unbounded source.  Allows you to split the read over X readers and keep control of it.

There is a runner compatability matrix – but this just covers functionality NOT performance 🙂

Finally there was a really good discussion about TDD and mocking beam pipelines for unit testing. This should be easy and there’s nothing in beam to prevent it, but it seems it’s actually quite hard. (Although; The unit tests of beam itself make use of this technology) . Now just imagine if there was a product that explicitly supported unit testing from within – AND/OR provided examples, it would be amazing. I think it’s amazing and a great sign that this came up in the discussion.

So thanks to the speakers, organisers and sponsors, and well done for putting on such an interesting event.

See you all at PLUG in 2 weeks!

The single server in Pentaho 7.0 #topology


Just a quick one this on the move to a single server configuration in Pentaho 7.  This resolved a long running quirk with the Pentaho server stack, but don’t take that as a recommendation for installation in production!  It’s still very important to separate your DI and front end analytic workloads and I doubt we’ll see any other than the very smallest installations using the single server for both tasks simultaneously.

Separating the workload gives several important advantages:

  • independent scaling (reduced cost and no wasted resources)
  • security
  • protecting either side from over ambitious processing

Of course! Don’t take my word for it – Pedro said the same in the release announcement:

Screen Shot 2017-05-16 at 06.58.35

And luckily the Pentaho docs on the website give clear instructions for adding/removing plugins from the server – Key thing being don’t install PDD or PAZ on your DI server.

Final point – You can of course choose whether to extend the logical separation to the repository itself.  By separating the repository as well it gives you ultimate control over your system, even if for now it is hosted on the same database.