Unstructured data, Apache Tika and Beer

Haha, that got ya. What?

Before I start – The whole “Unstructured” thing has been going around a while now, and the definition is a bit like “bigdata”. No one really knows what it means. But I think some people may say that a PDF is unstructured data.  ( I personally see very little unstructured data out there, it’s all fundamentally structured in some way, otherwise it can’t possibly have any meaning )

So, I came across Apache Tika and was itching to give it a go. Excellent I thought, this can easily be made into a plugin.

Ah, but hang on a minute, Matt has already done it:

So, I gave that a go and due to upgrading vfs to 2.0 in PDI6, the current marketplace plugin doesnt work. So.. Fork the repo, Clone it, install “gradle” (which err, looks a lot like ant right), change the code according to Pedros vfs post, and voila! A plugin that works.  Well sort of – somehow I lost all the resources, but meh, it works.

So what does Tika do? Well it reads anything, and for my purposes I fancy reading the DIY DOG pdf that brewdog recently released.  This is a MASSIVE document with 200+ recipes for the home brewer.

Now; Tika gives you the whole document, as either text, html, json or xml, in a SINGLE field. right, ok.  This means be careful with your preview tab, as it struggles to deal with 20mb in a single cell!

So first step, split it to separate lines. Then identify the beer, do some grouping, blah.  So what can you find out?

Well, surprisingly, Amarillo is the most popular hop.  Used in 53 of the beers. Here’s a snippet to show a bit more:

hops

Clearly I have some data quality issues there, as the first line is blank, and “Hop”, “Dry” or “Aroma” is not particularly useful.

Next steps? Improve the data quality of course, but then you could easily expose this data in a way to make it much more searchable than just in the PDF.  e.g. show me all the beers with a given IBU range, that contain one or more of a selection of hops, etc.

You can find my transformation in github as always!

By the way Tika seems to do an excellent job of reading the PDF, it’s pretty quick.  The only downside is that because the whole lot goes into one field there’s no “streaming” of the data.  If you could say give me the text line by line for example, that may be better for big documents.

Advertisements

Pentaho Community Meetup May 16 #iot

So, last night was our quarterly(ish) meetup for the Pentaho community and we welcomed Wael to the presenters club to talk about IOT and most interestingly the new lumada platform from Hitachi. (Cue much jokes about lambada…)

We then had Ricardo Pires (yes i got it right in the end) talking about fusion charts:

So, key take aways:

Waels explanation of the 5 Cs of IOT was excellent.  It is documented in a blog here: http://www.pentaho.com/blog/2016/04/29/getting-ahead-iot but good to see it explained face to face!  Funnily enough it seems various people have published the 7 Cs, and the 8 Cs of IOT.  Uh huh, this is a thing we do now then is it?…

 

Fusion charts was also great – recommend you checkout the plugin on the marketplace. Really good integration to CDE and some nice looking charts – especially the topology ones, very nice work there!  Thats clearly got a great place in the #iot world.

Needless to say we continued discussion in the pub afterwards and it was great to see so many new faces, and so many people joining into that discussion.

No date set yet for the next meetup – I’ll get something in the calendar soon – If anyone wants to talk then as usual let me know!