So, a few days ago I came across this video, and as it happens Jamie is actually talking at the GCloud meetup in Manchester next week, It’s quite tempting to go in fact!
It’s a really interesting watch. Especially after I spent quite some time at Aimia wrestling with exactly the sort of problems that are described here!
Key points of interest (to me!) are really:
Transient (Ephemeral) clusters. Enabled via a shared hive metastore using Cloud SQL. This is clever stuff. So you run production jobs on their own clusters, which also means it’s one client, so single tenant removes so much complexity! And you’re not going to have a flexible architecture where you can do that without using the cloud.
The usage stats are amazing! Just look at these numbers:
17k compute hrs/day
175 data scientists. about $7600 a day . (Thats pretty cheap! And remember its all traceable down to the job)
295tb
35 clients
~3k nodes.
Terraform is used to handle updates, and can build / tear down the entire cluster.
Oh; Another really interesting thing – each solution is spread across 2 GCP projects – one which is the store, and the other is called the runtime (all the ephemeral clusters etc) . Per client.
They don’t use bigquery (much). Partly because some of their clients prescribe different cloud providers.
A particularly amazing graph at the start shows how their costs reduced drastically once they introduced ephemeral clusters. Now this is particularly interesting – Cloud gives you the horsepower to throw resources at the problem and prove your product. THEN as the tech improves you’re able to achieve the same with less – which really must have profound implications for your business.
So some questions I thought of..
- What about analytics across multiple clients? Is that a strict no-no? Not even aggregate comparisons etc?
- Is there any standard data model? Or is each client different?
- Cloud portable, but not cloud agnostic. Why’s that then? Is it because any agnostic layer will only ever support the lowest common denominator tech wise?
- Do we see an end to needing to deploy network layers (firewalls etc) as things move more towards serverless?
Hi Dan,
Thanks very much for posting this, its great to know that the stuff we’re doing here at dunnhumby is of interest to people.
I’ll try to address your questions:
What about analytics across multiple clients? Is that a strict no-no? Not even aggregate comparisons etc?
We certainly don’t mix-and-match data from different clients – that’s an absolute no-no. We don’t do much (any???) aggregated comparisons of our clients that I know of – I’d question the value of doing that to be honest – none of our clients are direct competitors of each other (as far as I know). If we were to, hypothetically, sign up two competitors such as (ooo) Next & Zara then perhaps there’d be scope to do such a thing but I’m sure that there’d be contractual obligations of using a client’s data for the benefit of another. In short, no, we don’t run analyses across multiple clients’ data.
One thing we do do internally is compare the cost of running our infrastructure for each of our clients – our architecture as described above makes it very easy for us to do that. Hence we know our infrastructure-costs-to-serve each of our clients and of course that plays into contract negotiations and things like that.
Is there any standard data model? Or is each client different?
Yes each client is different and yes we have a standard data model 🙂 I referred to it in the video above as our “common data model”.
Cloud portable, but not cloud agnostic. Why’s that then? Is it because any agnostic layer will only ever support the lowest common denominator tech wise?
A few answers to this. Firstly, is it really possible to be truly cloud agnostic? I’m not sure that it is, there will always be subtle nuances. We don’t know because we haven’t yet needed to move our architecture to a different cloud, it will be interesting if/when we do.
Secondly, we have to weigh up the benefits of agnosticism versus leveraging cloud-specific benefits/differnentiators. BigQuery would be one such example.
Do we see an end to needing to deploy network layers (firewalls etc) as things move more towards serverless?
I don’t think so, its still important to ringfence the stuff that we provide. If/when we’re truly serverless I’ll be able to answer this better 🙂