Haha, that got ya. What?
Before I start – The whole “Unstructured” thing has been going around a while now, and the definition is a bit like “bigdata”. No one really knows what it means. But I think some people may say that a PDF is unstructured data. ( I personally see very little unstructured data out there, it’s all fundamentally structured in some way, otherwise it can’t possibly have any meaning )
So, I came across Apache Tika and was itching to give it a go. Excellent I thought, this can easily be made into a plugin.
Ah, but hang on a minute, Matt has already done it:
Good to see @ApacheTika support in #Pentaho (by @mattyb149) I think that’s worth a try… https://t.co/ul0kcAm25w #hopeitworksin6
— Dan Keeley (@codek1) May 16, 2016
So, I gave that a go and due to upgrading vfs to 2.0 in PDI6, the current marketplace plugin doesnt work. So.. Fork the repo, Clone it, install “gradle” (which err, looks a lot like ant right), change the code according to Pedros vfs post, and voila! A plugin that works. Well sort of – somehow I lost all the resources, but meh, it works.
So what does Tika do? Well it reads anything, and for my purposes I fancy reading the DIY DOG pdf that brewdog recently released. This is a MASSIVE document with 200+ recipes for the home brewer.
Now; Tika gives you the whole document, as either text, html, json or xml, in a SINGLE field. right, ok. This means be careful with your preview tab, as it struggles to deal with 20mb in a single cell!
So first step, split it to separate lines. Then identify the beer, do some grouping, blah. So what can you find out?
Well, surprisingly, Amarillo is the most popular hop. Used in 53 of the beers. Here’s a snippet to show a bit more:
Clearly I have some data quality issues there, as the first line is blank, and “Hop”, “Dry” or “Aroma” is not particularly useful.
Next steps? Improve the data quality of course, but then you could easily expose this data in a way to make it much more searchable than just in the PDF. e.g. show me all the beers with a given IBU range, that contain one or more of a selection of hops, etc.
You can find my transformation in github as always!
By the way Tika seems to do an excellent job of reading the PDF, it’s pretty quick. The only downside is that because the whole lot goes into one field there’s no “streaming” of the data. If you could say give me the text line by line for example, that may be better for big documents.