November | 2016 | Codeks Blog

So, a few weeks ago was PCM16 – A 3 day blur of beer, tech talk, beer and sight seeing in Antwerp, Belgium.

Before I give my view – do make sure you read others, there were 2 streams and so it was not possible to see everything! So, the others I’m aware of:

http://www.bizcubed.com.au/pcm-16/

Happy to add others! Actually I can’t even find the live blog link right now, but you’ll find it on Know.bi somewhere!

So how was the event? Well as always every year has a different theme. This year it was clearly all about the product. And it’s not just webspoon! There have been a slew of product upgrades in the last year, all very interesting.

And now the talks:

Pedro – Pentaho 7

Released in time for the event!
Analytics anywhere
This is a progression along that path
Hadoop security improvements (only CDH for now)
Single server in technology – now to make the product singular to the users.

Tom – Devops

Well we already saw this at PLUG!
Mentioned use of bigtop distro.
How is it nagios still looks as shite now as it did when it came out 10+ years ago? 🙂
PDI Charms available, BA being worked on (Juju)

Nelson/Miguel – Snapshots in hadoop

Definition discussion on “in” and “out” labels.
The approach only works if you can identify your “last” event in a chain – Unfortunately that’s not always possible.

Duarte – Viz API

v3 of API is nearly done
Lots of very interesting tech details – But the gist of it is it will become easier to just “plug in” your own visualisations
Not clear how many people *do* build their own visualisations though!
Probably be in the platform in 7.1, analyzer will switch shortly after
Good to see improvement and hard work going into the visualisation part of the product

Joao – CBF2

now shell script. (hmm!)
usual benefits of cbf – quick switching etc.
Deploys all your code – not sure if the JUJU charm does this?
CE and EE

Jens – PDI7

Metadata injection!
11 more steps to be supported in 7.1
Filter rows uses the same xml from the KTR to define the condition to inject. Cunning workaround to a tricky problem!
3 phases of metadata injection
- Simple bog standard example
- Data flow
- 2 phase
:8080/pentaho/kettle/listServices
JSON performance has improved again – you dont have to pass the massive blob of json onto the next step
COPY AND PASTE BUG FIXED!!!!!! This got a huge cheer
Labs – team are working with hitachi labs too. Exciting times

Matt Casters

In just 2-3 weeks since unit testing was shown at PLUG it’s moved on massively and is now fully usable!
300+ team / community using pdi at hitachi
PDI Spreading wildly.

Hiromu Mota – Webspoon

Hurrah, web based spoon is here
Amazing architecture solution – Essentially just run spoon on the server, then you don’t have to maintain another UI

Julien + gang

Showed a handy SSO solution for embedded analytics.

Wael – data science!

Same as PLUG
Everything we do as ETL guys, will be for the data scientists in the future!

Bart / @rvanbruggen

Great talk showcasing neo4j, and beer

So, on Thursday I attended BigDataWeek in London just down from Liverpool St. Station. This was the conference that we tied up this weeks Pentaho London usergroup with as well so we both shared some attention and advertising. Cunning eh?

Anyway herewith follows my brain dump / brief summary / key points from the tech stream. Yes – Tech stream only, I’ll not apologise for that!

So – A quick brain dump:

Shazam

Very clever data driven marketing
Amazing insights
Good numbers of people using it – Surprised.
Integration with FB advertising, Twitter integration on the way. Enables them to build a “discussion” with a brand.

Worldpay

A very familiar secure datalake story – Hortonworks this time.
Use your CC 3 times, and worldpay will by then have your details. ~47% of all uk transactions.
36M trx/day
Explicit policy to exploit open source
Team of 30
18 months, in production
Ability to burst to cloud (double tokenisation). Great idea!
Disks never leave the data center. Security obsessed!
Aim to have 70% permies, 10% consultancy, and 20% contractors. Not at those levels yet
Lots of history from legacy systems still to load
56 nodes, 20 cores, 12 x 4tb disks, 256gb ram. etc.

Barclays

This was interesting – Looking at graphs for recommendations. It’s the opposite of a recent project I worked on 🙂
They look at expected degrees of separation between businesses.
Insane number of paths – becomes a massive matrix
Some results surprising and not really believable
They didn’t seem to know who they would sell this data to. Audience asked twice.
Spark. No graph DB. Using a pagerank algo in there too somehow.

Google data platform

Looked a the streetview dataset – initially done for a laugh, but then with image recognition came a tonne of new opportunities
Ability to do massive crunching like that at scale and low cost
Pay only for what you need.
bigquery now supports full SQL hurrah. need to try this against mondrian…
Interesting discussion on dealing with the ever problematic issue of late arriving data in a streaming system
beam – batch AND stream.
mentioned loads of DI tools, inc Talend and various commercial ones.
spark cluster can be spun up in 90s.
good point about the history of google went from publishing papers, to then making those services available, and now doing both at the same time.

Smart Cities

@JonnyVoon
A great talk. Sadly he was up against a blockchain talk in the other room so a lot of people left.
It’s time to focus on the present
Buzzword bingo gives you the excuse to forget about the why
Example of the @seesense_cc (?) smart bikelight.
There is something called the London Datastore – This seems worth checking out! ***

Bigstep / Ansible

2 minutes, you can have your services deployed on bare metal. (faster)
Pay per second. Usual AAS stuff
Usual full app stacks available.
Interesting use of the “tree” unix command.
Safe scale down – ability to evict data from data nodes before shutting them down.
Unclear how they handle security, how do they securely wipe that data?
Unclear architecture diagrams
Lots of good learnings (see their slides)

Skyscanner Data Quality

The only talk of many at this event on data quality. That is not good!
Must define confidence in your numbers
Must measure quality
Huge costs are associated with bad data quality.
github.com/Quartz/bad-data-guide
They push all validation failures into it’s own topic so they can be fixed and re-published

Facebook

After lunch the tech room was re-invigorated and raring to go again!
Working on the “workplace” product which was originally an internal facebook tool
Scalable metrics platform
Bemoaning lack of open source data integration tools until recently. Seriously? Are facebook devs not allowed to use google?
Daily/Monthly accounting
Lots of standardised metrics.
Good metrics are not necessarily trivial to understand.
They’re hiring!
As the facebook ETL framework is not opensourced, they’re unlikely to opensource the metric framework
No screenshots. Boo.
They move fast – Faster even that a startup (Guy had startup experience)

TFL

This was very interesting. Lots of legacy systems, lack of visibility of data, and massive political challenges
Road space management
Still using lots of RDBMS
Pressure to publish data
Relying on opening data and letting integrators do the rest. IMHO this is risky and in the past i’ve seen it not work at all – but times change..
14K traffic sensors, at 4hz. 400M events/day
Use a lot of R and have a vibrant community of users
Check out their arch. diagrams – very interesting.
Problems are the lack of skills – not technology. Workaround to this: Use hackathons!

MapR

Healthcare in US example
Anomaly detection
ROI – $22 for every $1 spent. Nice!
Aadhar ID system (In India) – very similar to Aegate drug tracking system (but on the right technology this time!)
60% population, 20% reduction in fraud, $50B savings
pretty much a sales talk. No tech.

Telefonica

Struggling with so many disparate separate companies, different data systems, different tech. created a global corp model
hortonworks -they want to use as open tech as possible
so no SAS – R, Spark, Hadoop
mentions the title “Data Journalist” what is that!!?
also apologises for no tech detail
Talking about what level people try to connect, e.g. 2g, 3g, 4g, but that’s weird, surely everyone starts at 4g these days and works down.
they have a team to train local people to implement the global model
issues with getting data out/in some countries – meaning can’t use cloud

NATS – air traffic control

some low bandwidth data links from the planes make things interesting
real time of course
70 data scientists, 700 engineers
future prototypes on mapr, spark and bigstep
looking at gpu
3pb
Looking at how to monitise the data
cracked some complicated data. Encodings. Binary. Common in IOT it seems! Radar

Overall summary of the event

Good talks – would go next year
Good location
Good balance of people – business/tech, even the odd sales. And even a balance of women and men.

Finally, here’s some links they sent out to some of the presentations/talks – there’s lots of goodies within this!

KEYNOTES

TECHNICAL TRACK

BUSINESS TRACK

	Ian on #Serverless #AWS PDI
	codek on #Serverless #AWS PDI
	NxP on #Serverless #AWS PDI
	codek on Uploading files with CFR and…
	Andrey on Uploading files with CFR and…

Codeks Blog

Open source BI Consultant

Month: November 2016

#PCM16 – 9th community meetup!

#BigDataWeek London