So, on Thursday I attended BigDataWeek in London just down from Liverpool St. Station. This was the conference that we tied up this weeks Pentaho London usergroup with as well so we both shared some attention and advertising. Cunning eh?
Anyway herewith follows my brain dump / brief summary / key points from the tech stream. Yes – Tech stream only, I’ll not apologise for that!
So – A quick brain dump:
Shazam
- Very clever data driven marketing
- Amazing insights
- Good numbers of people using it – Surprised.
- Integration with FB advertising, Twitter integration on the way. Enables them to build a “discussion” with a brand.
Worldpay
- A very familiar secure datalake story – Hortonworks this time.
- Use your CC 3 times, and worldpay will by then have your details. ~47% of all uk transactions.
- 36M trx/day
- Explicit policy to exploit open source
- Team of 30
- 18 months, in production
- Ability to burst to cloud (double tokenisation). Great idea!
- Disks never leave the data center. Security obsessed!
- Aim to have 70% permies, 10% consultancy, and 20% contractors. Not at those levels yet
- Lots of history from legacy systems still to load
- 56 nodes, 20 cores, 12 x 4tb disks, 256gb ram. etc.
Barclays
- This was interesting – Looking at graphs for recommendations. It’s the opposite of a recent project I worked on 🙂
- They look at expected degrees of separation between businesses.
- Insane number of paths – becomes a massive matrix
- Some results surprising and not really believable
- They didn’t seem to know who they would sell this data to. Audience asked twice.
- Spark. No graph DB. Using a pagerank algo in there too somehow.
Google data platform
- Looked a the streetview dataset – initially done for a laugh, but then with image recognition came a tonne of new opportunities
- Ability to do massive crunching like that at scale and low cost
- Pay only for what you need.
- bigquery now supports full SQL hurrah. need to try this against mondrian…
- Interesting discussion on dealing with the ever problematic issue of late arriving data in a streaming system
- beam – batch AND stream.
- mentioned loads of DI tools, inc Talend and various commercial ones.
- spark cluster can be spun up in 90s.
- good point about the history of google went from publishing papers, to then making those services available, and now doing both at the same time.
Smart Cities
- @JonnyVoon
- A great talk. Sadly he was up against a blockchain talk in the other room so a lot of people left.
- It’s time to focus on the present
- Buzzword bingo gives you the excuse to forget about the why
- Example of the @seesense_cc (?) smart bikelight.
- There is something called the London Datastore – This seems worth checking out! ***
Bigstep / Ansible
- 2 minutes, you can have your services deployed on bare metal. (faster)
- Pay per second. Usual AAS stuff
- Usual full app stacks available.
- Interesting use of the “tree” unix command.
- Safe scale down – ability to evict data from data nodes before shutting them down.
- Unclear how they handle security, how do they securely wipe that data?
- Unclear architecture diagrams
- Lots of good learnings (see their slides)
Skyscanner Data Quality
- The only talk of many at this event on data quality. That is not good!
- Must define confidence in your numbers
- Must measure quality
- Huge costs are associated with bad data quality.
- github.com/Quartz/bad-data-guide
- They push all validation failures into it’s own topic so they can be fixed and re-published
- After lunch the tech room was re-invigorated and raring to go again!
- Working on the “workplace” product which was originally an internal facebook tool
- Scalable metrics platform
- Bemoaning lack of open source data integration tools until recently. Seriously? Are facebook devs not allowed to use google?
- Daily/Monthly accounting
- Lots of standardised metrics.
- Good metrics are not necessarily trivial to understand.
- They’re hiring!
- As the facebook ETL framework is not opensourced, they’re unlikely to opensource the metric framework
- No screenshots. Boo.
- They move fast – Faster even that a startup (Guy had startup experience)
TFL
- This was very interesting. Lots of legacy systems, lack of visibility of data, and massive political challenges
- Road space management
- Still using lots of RDBMS
- Pressure to publish data
- Relying on opening data and letting integrators do the rest. IMHO this is risky and in the past i’ve seen it not work at all – but times change..
- 14K traffic sensors, at 4hz. 400M events/day
- Use a lot of R and have a vibrant community of users
- Check out their arch. diagrams – very interesting.
- Problems are the lack of skills – not technology. Workaround to this: Use hackathons!
MapR
- Healthcare in US example
- Anomaly detection
- ROI – $22 for every $1 spent. Nice!
- Aadhar ID system (In India) – very similar to Aegate drug tracking system (but on the right technology this time!)
- 60% population, 20% reduction in fraud, $50B savings
- pretty much a sales talk. No tech.
Telefonica
- Struggling with so many disparate separate companies, different data systems, different tech. created a global corp model
- hortonworks -they want to use as open tech as possible
- so no SAS – R, Spark, Hadoop
- mentions the title “Data Journalist” what is that!!?
- also apologises for no tech detail
- Talking about what level people try to connect, e.g. 2g, 3g, 4g, but that’s weird, surely everyone starts at 4g these days and works down.
- they have a team to train local people to implement the global model
- issues with getting data out/in some countries – meaning can’t use cloud
NATS – air traffic control
- some low bandwidth data links from the planes make things interesting
- real time of course
- 70 data scientists, 700 engineers
- future prototypes on mapr, spark and bigstep
- looking at gpu
- 3pb
- Looking at how to monitise the data
- cracked some complicated data. Encodings. Binary. Common in IOT it seems! Radar
Overall summary of the event
- Good talks – would go next year
- Good location
- Good balance of people – business/tech, even the odd sales. And even a balance of women and men.
Finally, here’s some links they sent out to some of the presentations/talks – there’s lots of goodies within this!
KEYNOTES
- Mark van Rijmenam, Datafloq – Big Data Is Dead, Long Live Big Data
- Josh Partridge, Shazam – How Labels, Radio Stations, and Brands Leverage Shazam Data
- David Walker, Worldpay – Deploying Secure Operational Clusters at Worldpay
TECHNICAL TRACK
- Harry Powell & Raffael Strassnig, Barclays UK – Graph-Based Recommendations
- William Vambenepe, Google – The Next Generation Data Platform
- Jonny Voon, Innovate UK Smart Cities and the Buzz Word Bingo
- Marius Boeru, Bigstep – How to Automate Big Data with Ansible
- Scott Krueger, skyscanner – Does More Data Mean Better Decision Making?
- Roland Major, Transport for London – Cloud Search Secured
- Rob Anderson, MapR – Where Big Data Has an Intersection with Everyday Lives
- John Belchamber, Telefonica – New Data, New Strategies, New Opportunities
- Daimon Brown, NATS – Reducing Congestion at One of the World’s Busiest Airports
- Mishal Patel, NHS – Modernising Routine Breast Cancer Screening Using BigData
- Ingrid Funie, Imperial College London – Machine Learning and FPGA-Based Hardware Acceleration
- Chris von Csefalvay, Helioserv – Cats, and What They Tell Us about Big Dataand IoT
BUSINESS TRACK
- Nondas Sourlas, Bupa – Big Data in Healthcare
- Martin Goodson, Skimlinks – Ten Reasons Your Data Project Is Going to Fail
- Charlie Ballard, TripAdvisor – Tripadvisor & Constant Change: Building Relationships by Applying Big Data
- John Callan, Boxever Data and Analytics – The Fuel Your Brand, and Your Customers, Deserve
- Alex Bordei, Bigstep – Building Data Labs in the Cloud
- Amjad Zaim, Cognitro Analytics – How Deep Is Your Learning
- Vojta Roček, Trologic – Challenging Big Data
- Wael Elrifai, Pentaho – Big Data-Driven Business Innovation
- Deenar Toraskar, Think Reactive – Fast Data Key to Efficient Capital Management