Dunnhumby and their use of GCP

So, a few days ago I came across this video, and as it happens Jamie is actually talking at the GCloud meetup in Manchester next week, It’s quite tempting to go in fact!

It’s a really interesting watch. Especially after I spent quite some time at Aimia wrestling with exactly the sort of problems that are described here!

Key points of interest (to me!) are really:

Transient (Ephemeral) clusters. Enabled via a shared hive metastore using Cloud SQL.  This is clever stuff.  So you run production jobs on their own clusters, which also means it’s one client, so single tenant removes so much complexity!  And you’re not going to have a flexible architecture where you can do that without using the cloud.

The usage stats are amazing! Just look at these numbers:

17k compute hrs/day
175 data scientists. about $7600 a day . (Thats pretty cheap! And remember its all traceable down to the job)
35 clients
~3k nodes.

Terraform is used to handle updates, and can build / tear down the entire cluster.

Oh; Another really interesting thing – each solution is spread across 2 GCP projects – one which is the store, and the other is called the runtime (all the ephemeral clusters etc) . Per client.

They don’t use bigquery (much). Partly because some of their clients prescribe different cloud providers.

A particularly amazing graph at the start shows how their costs reduced drastically once they introduced ephemeral clusters. Now this is particularly interesting – Cloud gives you the horsepower to throw resources at the problem and prove your product. THEN as the tech improves you’re able to achieve the same with less – which really must have profound implications for your business.


So some questions I thought of..


  • What about analytics across multiple clients?  Is that a strict no-no?  Not even aggregate comparisons etc?
  • Is there any standard data model? Or is each client different?
  • Cloud portable, but not cloud agnostic. Why’s that then? Is it because any agnostic layer will only ever support the lowest common denominator tech wise?
  • Do we see an end to needing to deploy network layers (firewalls etc) as things move more towards serverless?