Codeks Blog

Open source BI Consultant

Metadata

Post navigation

#Metadata acquisition (#IOT)

Posted by codek

0

So, everything is metadata driven these days, this concept is no longer new. This is especially true in the land of IOT where you have wildly varying sources of data, of all shapes and sizes.

That leaves an age old problem though – how do you populate your metadata?  If your data consists of 1,000’s of different values this is a significant issue.

Well in the olden RDBMS days, there’s a simple yet powerful plugin from Roland which exposes the JDBC metadata directly to PDI:

https://github.com/rpbouman/pentaho-pdi-plugin-jdbc-metadata

However; Despite years of promise of self describing systems, the horrendousness that was WSDL and more, we’ve actually gone backwards, and data is now going simpler and back to basic formats.

That may be as basic as a text file – in which case there is also a plugin to help you – the file meta plugin – this scans your text file and makes an attempt to guess:

  • separator
  • column headers
  • data types

Obviously it can only ever be a guess, but it’s better than nothing.  Maybe you’ll even have a spec, and maybe that spec will be up to date (lol, OK sorry, i know, that’s not going to happen.)

It’s possible you have an XML data source, perhaps for a more elderly system, or more likely these days we’re looking at JSON.

In the PDI XML world the XML input step is pretty mature. It’s easy to use with a great get nodes, and get paths button. Couldn’t be simpler.  Kinda sad though that the step matures, just as the format becomes obsolete..

In the json world, it’s not so good. The step is less mature and therefore doesn’t have these nice config features.  So given a json you know nothing about, what to do? Well the answer is to hit stackoverflow, grab a snippet of code to give you all the jsonPaths in a json file, and execute it in PDI!  See the samples here  (On a positive note, the performance of the json input step in pdi6+ is now up to an expected level)

So great, we’ve understood what’s in our 1000s of input files. Job done.

The use case is more nuanced than that.  Sure you can then use this scan to initialise your metadata repository, which in turn then configures your transformation via metadata injection of the json input step. (A PDI7 feature btw). But actually you can use this scan to check if the structure of your incoming files is as expected. You can diff the attributes and if you have new keys, add them to your metadata library. If you have missing keys you can raise an alert.  It’s important to *look* for change and deal with it rather than waiting to be notified of it. The latter will never happen!

 

 

 

Posted in Pentaho

Tagged Metadata, Pentaho

Jul·27

Post navigation

Blogroll

  • Diethard Steiner on BI
  • Diethard Steiner on BI (old)
  • Matt Casters on Data Integration
  • Pedro Alves on BI
  • Tom Barber

Recent Posts

  • PLUG 22 – It’s back!
  • #BigDataLDN 2021! DAY 1
  • Updating dimensions in your DWH – Python, Kettle and Athena
  • What’s in a name? #PLUG
  • Using #Kettle (#Hop) with #Apache #Airflow

@codek1 on Twitter

My Tweets

Recent Comments

Ian on #Serverless #AWS PDI
codek on #Serverless #AWS PDI
NxP on #Serverless #AWS PDI
codek on Uploading files with CFR and…
Andrey on Uploading files with CFR and…

Archives

  • February 2022
  • September 2021
  • November 2019
  • October 2019
  • July 2019
  • March 2019
  • February 2019
  • November 2018
  • October 2018
  • September 2018
  • May 2018
  • April 2018
  • January 2018
  • May 2017
  • April 2017
  • February 2017
  • January 2017
  • November 2016
  • September 2016
  • July 2016
  • May 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • October 2015
  • September 2015
  • July 2015
  • June 2015
  • April 2015
  • March 2015
  • February 2015
  • January 2015
  • November 2014
  • September 2014
  • June 2014
  • April 2014
  • March 2014
  • February 2014
  • December 2013
  • October 2013
  • June 2013

Blogroll

  • Pedro Alves on BI
  • Diethard Steiner on BI
  • Tom Barber
  • Diethard Steiner on BI (old)
  • Matt Casters on Data Integration
Create a free website or blog at WordPress.com.
Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy
  • Follow Following
    • Codeks Blog
    • Join 1,287 other followers
    • Already have a WordPress.com account? Log in now.
    • Codeks Blog
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar