Unstructured data, Apache Tika and Beer

Haha, that got ya. What?

Before I start – The whole “Unstructured” thing has been going around a while now, and the definition is a bit like “bigdata”. No one really knows what it means. But I think some people may say that a PDF is unstructured data. ( I personally see very little unstructured data out there, it’s all fundamentally structured in some way, otherwise it can’t possibly have any meaning )

So, I came across Apache Tika and was itching to give it a go. Excellent I thought, this can easily be made into a plugin.

Ah, but hang on a minute, Matt has already done it:

Good to see @ApacheTika support in #Pentaho (by @mattyb149) I think that’s worth a try… https://t.co/ul0kcAm25w #hopeitworksin6

— Dan Keeley (@codek1) May 16, 2016

So, I gave that a go and due to upgrading vfs to 2.0 in PDI6, the current marketplace plugin doesnt work. So.. Fork the repo, Clone it, install “gradle” (which err, looks a lot like ant right), change the code according to Pedros vfs post, and voila! A plugin that works. Well sort of – somehow I lost all the resources, but meh, it works.

So what does Tika do? Well it reads anything, and for my purposes I fancy reading the DIY DOG pdf that brewdog recently released. This is a MASSIVE document with 200+ recipes for the home brewer.

Now; Tika gives you the whole document, as either text, html, json or xml, in a SINGLE field. right, ok. This means be careful with your preview tab, as it struggles to deal with 20mb in a single cell!

So first step, split it to separate lines. Then identify the beer, do some grouping, blah. So what can you find out?

Well, surprisingly, Amarillo is the most popular hop. Used in 53 of the beers. Here’s a snippet to show a bit more:

hops

Clearly I have some data quality issues there, as the first line is blank, and “Hop”, “Dry” or “Aroma” is not particularly useful.

Next steps? Improve the data quality of course, but then you could easily expose this data in a way to make it much more searchable than just in the PDF. e.g. show me all the beers with a given IBU range, that contain one or more of a selection of hops, etc.

You can find my transformation in github as always!

By the way Tika seems to do an excellent job of reading the PDF, it’s pretty quick. The only downside is that because the whole lot goes into one field there’s no “streaming” of the data. If you could say give me the text line by line for example, that may be better for big documents.

4 thoughts on “Unstructured data, Apache Tika and Beer”

Hi Dan Keely,

Could you please share a sample code to understand how we call Tika in pentaho

Thanks
Ramthin Thilakan

codek

November 28, 2018 at 10:26 am

Just use the load text from file step.

Reply
- Ramthin Thilakan
  
  November 28, 2018 at 11:56 am
  
  Could you please guide me where i can find that step. I have installed the plugin from market place and tried to use some Javascripts to execute the codes but getting multiple errors related to the lib.
  
  Thanks in advance
  
  Ramthin T

Download it here:

https://s3.amazonaws.com/kettle-neo4j/load-text-from-file-plugin-2.0.1.zip

	Ian on #Serverless #AWS PDI
	codek on #Serverless #AWS PDI
	NxP on #Serverless #AWS PDI
	codek on Uploading files with CFR and…
	Andrey on Uploading files with CFR and…

Codeks Blog

Open source BI Consultant

Unstructured data, Apache Tika and Beer

4 thoughts on “Unstructured data, Apache Tika and Beer”

Leave a comment Cancel reply

Share this:

Related

4 thoughts on “Unstructured data, Apache Tika and Beer”

Leave a comment Cancel reply