Skip to content
Lean-Data Lean-Data

Menu

  • Architecture
    Design your foundation
  • Asset
    Build your value
  • Quality
    Improve & optimize
  • Migration
    Realize the value
  • Solutions
    Save time & money
    • LeanData Deviation Discovery
    • LeanData Extraction
  • Contact
    Background & contact

Metadata Management Category

Update: metadata overview simplified

Update: metadata overview simplified

Herewith an update to my earlier post where I introduced subcategories to different types of metadata. In the many discussions around becoming data driven, I have noticed that this overview helps focus and makes requirements identification easier.

All data starts with business metadata. This is the information we need to actually build a dataset. There is someone in the business who approved the collection and processing of data in the first place. He/she also provides requirements an descriptions on what he needs. The challenge is that this information is often not maintained throughout time which leads business metadata quality to decrease.

Become aware of the necessity and value of business metadata to enable support on data requests, make it findable and also understandable!

When we actually know what the business wants, we can design and implement this into physical form through technical metadata. We can now build the actual application or buy it of the shelf and map it to the business metadata.

Now that we know what data we need, what it means and have a place to store and process data; we can start doing business. This will generate operational metadata. This type of metadata is very valuable in monitoring our data processes. We get insights in what data is processed, how often, the speed and frequency. This is great input in analysing the performance of our IT landscape and see where improvements can be made. Further we monitor the access to systems and data. When we take it a step further we can even start analysing patterns and possibly spot odd behaviour as signals of threats to our data.

Step into the driving seat capturing and analysing your operational metadata and become pro-active in controlling your IT landscape!

Finally we can also take the social metadata as an inspiration. And this is where the actual value of your data becomes tangible. If value is determined as the benefit the user thinks he gains, the way that he uses the data is an indicator of value. Thus if we start measuring what data is used often by many users, this data must be important and valuable. So let’s invest in improving the quality of this data to improve the value created. Behaviour is also a good indicator to measure. How much time is spent on content and which content is skipped quickly. Apparently that content doesn’t match up with what the user is looking for.

Measure social metadata to analyse what data is used often by many. It is likely to be more valuable than other data.

Business metadata

Governance metadata
All metadata required to correctly control the data like retention, purpose, classifications and responsibilities.
– Data ownership & responsibilities
– Data retention
– Data sensitivity classifications
– Purpose limitations

Descriptive metadata
All metadata that helps understand and use and find the data.
– Business terms, data descriptions, definitions and business tags
– Data quality and descriptions of (incidental) events to the data
– Business data models & bus. lineage

Administrative metadata
All metadata that allows for tracking authorisations on data.
– Metadata versioning & creation
– Access requests, approval & permissions

Technical metadata

Structural metadata
All metadata that relates to the structure of the data itself required to properly process it.
– Data types
– Schemas
– Data Models
– Design lineage

Preservation metadata
All metadata that is required for assurance of the storage & integrity of the data.
– Data storage characteristics
– Technical environment

Connectivity metadata
All metadata that is necessary for exchanging data like API’s and Topics.
– Configurations & system names
– Data scheduling

Operational metadata

Execution metadata
All metadata generated and captured in execution of data processes.
– Data process statistics (record counts, start & end times, error logs, functions applied)
– Runtime lineage & ETL/ actions on data

Monitoring metadata
All metadata that keeps track of the data processing performance & reliability.
– Data processing runtime, performance &  exceptions
– storage usage

Controlling (logging) metadata
All metadata required for security monitoring & proof of operational compliance.
– Data access & frequency, audit logs
– Irregular access patterns

Social metadata

User metadata
All metadata generated by users of data to
– User provided content
– User tags (groups)
– Ratings & reviews

Behavior metadata
All metadata that can be derived from observation to
– Time content viewed
– Number of users/ views/ likes/ shares

3 points of attention that are underestimated for successful data catalogs

3 points of attention that are underestimated for successful data catalogs

You will be disappointed in your data catalog when the search results do not come close to your expectations. You have a quick scroll through the list of results and pause. Then you look for an option to apply filters you recognize. Then, probably within 5 minutes, you close the catalog screen never to return again.

Lost in search results

Data catalogs that I’ve come across return long lists and technical information. When you enter a search term like ‘customer’ or ‘cashflow’ you are flabbergasted by the tremendous amount of results.

Imagine you’re in the library:
Are you looking for chapters, paragraphs, words, font types & sizes or would you like to see titles, summary, author and source of a book when searching?

Hugo de Gooijer

Make the user journey as easy as possible

Data catalog vendors explain their strength in collecting all the technical metadata from a variety of sources. Putting it together in a big pile and maybe apply algorithms that suggest what sort of data it is. Though I agree that the foundation is based on complete and accurate technical metadata, this approach is not leading to a widely used data search function for the entire organization. This is because they pass by the needs of the user.

Here are 3 points of attention for the user experience:

  1. show relevant results – are the best hits shown first?
  2. show reliable results – are the results close to my search terms?
  3. organize the results – do I understand the result categories in business terms?

Below are some criteria to get you started:

Relevancy

Presentation form
Precision
Timeliness
Reasonableness

easy visualization
result close to my search
responsiveness and accessibility
result related to my search

Reliability

Completeness
Accuracy
Consistency
Currency

scope & source of datasets
correctness of metadata
same results each search
age, time of release, in sync with data

Result categories

The catalog must support the creation of custom categories. As each organization has his own ‘language’, datasets are quickly understood when you can put them into those categories. My advise is to use stable labels like business processes and not apply department or business domain names as these may change with a reorganization.

The data marketplace enables the success of your data lake(s)

The data marketplace enables the success of your data lake(s)

Many organisations are using, building or thinking of data lakes as a solution for their data needs. Nowadays they’re usualy setup in the cloud to meet the scaleability needs of high volume data processing.

How can we make data lakes (more) successful?

First let’s start with what a data lake is. As much has already been written on what a data lake is, let’s look at some references like Forbes:

If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state.

Forbes

And Transforming Data with Intelligence (TDWI):

In its extreme form, a data lake ingests data in its raw, original state, straight from data sources, without any cleansing, standardization, remodeling, or transformation. 

TDWI

Both distinguish raw untouched data from the original source and prepared data ready for analysis. Both the raw data from the sources as the data from warehouses used in reporting are valuable in the data lake. In a simplified way that shows following data distribution pattern.

A data distribution pattern

The value of data lakes is created through the analyses performed with the data for decision making. This requires data analists and scientists to transform the data in such a way they can make sense of it. Now here lies the challenge. The most time data analysts and scientists spend is on finding, understanding and preparing data.

Most data scientists spend only 20 percent of their time on actual data analysis and 80 percent of their time finding, cleaning, and reorganizing huge amounts of data.

Infoworld

The success of a data lake can be measured through the use of data by its consumers

The strength of a data lake is the ability to service large amounts of data. The success of data lake grows when the amount of data ingested by producers increases together with the number of consumers that use the data. However, there is a risk that this increase of data ingestion and consumption becomes disorganised, then limits the usability and reduces the success of the data lake.

Data lakes and data swamps are both data repositories, but data swamps are highly disorganised.

Information age

The common proposed solution in organizing data is to register all data in the lake in a catalog. I have found there are some serious limitations in this approach:

  • It is usually just an automated indexing of data which generates vast amounts of technical metadata. This is not easy to make sense of.
  • Implementing a data catalog on the data lake assumes that all data is known in there.
A data lake needs to maintain an overview of its data

With the increasing amount of data in a lake, it becomes more difficult to find data. That was essentially the issue for data analysts and scientists in the first place. Furthermore, we cannot assume that all data in the organisation is already ingested in the data lake. Thus if it’s not yet in the lake it cannot be found.

The key in a successful data lake is connecting supply and demand for data

That is why you should start with cataloguing data throughout the organisation. Then you can start ingesting data into a data lake:

  1. The first step therefore is to register the datasets a producer has to offer
  2. This is then published in an Entperise Data Catalog for data consumers to search
  3. When desired datasets are found, the consumer requests access to the data
  4. When the producer approves, his data is ingested into the data lake (if not already there) and access granted
The data market place is key in enabling value creation

Distinguish data inventories from the enterprise data catalog

As data is not created in the data lake and needs to be ingested from source applications, a data catalog on just the lake will always be incomplete. Therefore the data catalog is not an accurate name of the overview of data in the lake. As the data market place fulfils the business needs of data it provides the offerings of the organisations data; i.e. the data catalog. Then the data storage locations like a data should focus on what they have in stock which I’d like to call data inventories. With this distinction we have a more loosely coupled design between the business processes on data and the fulfilment needs. The data market place can support an essential part of the organisation: governance and compliance.

  • The producer knows his data and adds the necessary metadata like descriptions, contacts and which data is sensitive (like privacy data)
  • The consumer can now find the data, gain an overview of available data in the organisation and easily get in touch with the responsible persons to request access.
  • In the access request the consumer will share the purpose of the use of the data where the producer can evaluate if this is allowed.
  • The producer is now in control of his data, able to self-service approve data requests and knows who receives his data.
  • This enables the foundation of privacy by design.
The data market place is where value is created

The enterprise data catalog serves the data market place and is key in the success of your data lake

With this design in organising data we create a parallel with for example Amazon. They provide a platform where producers display their offerings in the webshop for customers. When a customer requests a product, an agreement is made and a delivery order sent to the warehouse. The fulfilment process starts from the most efficient warehouse that has the product on stock to ship to the customer.

So, if you want to make your data lake (more) successful, start with an enterprise data catalog to organise data supply and demand.

5 steps to begin collecting the value of your data

5 steps to begin collecting the value of your data

It’s not new that data is called ‘the new oil’. Organisations that are able to collect, organise and combine data effectively are in a good position to start creating new value. The question is when you can collect the value of data and how to get there.


The worlds most valuable resource is no longer oil but data

The Economist

If you are seeking to leverage the value data for your business, you’ll need to start managing your metadata. Metadata is the key enabler to help you optimise your processes, gain and stay in control of risks and enable new value creation. This blog gives a quick introduction into why you need metadata and where to start.

The value of your data is unlocked through metadata

Enable discovery and sharing of data shortening search times
Protect your investment in data due to staff turnover and enable reuse
Improve understanding and decision making with high quality data
Mitigate your risks and limit liability
Control what your data is used for and where it goes
Increase effectiveness & efficiency in collaboration
Reduce costs and create new value with faster development & innovation

These examples of value all sound nice, but aren’t easily achieved. Therefore we’ll go into more detail on what we need to organise to build the foundation of this value.


5 steps for managing your metadata

Working with large organizations with complex IT landscapes and data exchange, we’ve found there are couple of generic steps to take. When achieving this in a sustainable manner, you’ll be able to collect value from your data more rapidly:

1. Find the data your organisation needs
2. Understand that data
3. Know who is responsible for that data
4. Able to follow to source and end users of that data
5. Trust the data so they can use it without hesitation

Metadata provides insights
Metadata is key enabler of data value

So here is a way to approach these steps in practice:

1. Make data findable
As a start you’ll need to set priorities. Determine which data sources are part of your core operational & information processes. Focus on on indexing these first. This can be on a high level and doesn’t necessarily mean you need to index all attributes of all data objects. Just start by adding descriptions for data sets and add business tags to make them more easy to find. Further, people looking for data need to know they have a complete view, or at least know through which sources they are browsing.
2. Add descriptions & definitions
Next step is that they need to understand what they are looking at. For the most important data objects and key attributes you’ll need to add descriptions.
3. Make transparent who to contact
When people have found data that they think they can use, they’ll first need to get in touch with the owner to get access to the data. Governance metadata is therefore an essential part in creating value out of data.
4. Document data logistics to gain control
When people start using data, they’ll become dependent on changes in the data production flows. Changes need to be managed on their impact and thus it is necessary to follow where data comes from and who is consuming it.
5. Analyse data quality to make data usable
Finally this all adds up to trusting the data. Data quality measurements will allow people to assess whether they can use this data for their processes.

What metadata to collect

Now to make this work, we’ll need to collect metadata that help each of these steps. It will help discussion when you can categorize different metadata elements into groups. Lean Data suggests following 4 categories:

  • Technical Metadata
  • Business Metadata
  • Operational Metadata
  • Social Metadata

The technical metadata is the foundation. You’ll need to gather and organise this before you can make sense of it (see below for examples). The next step is adding the descriptions and governance metadata as business metadata for understanding and control. Now you can go to work and improve the data processing by capturing and monitoring the operational metadata. Now we have reached the stage where we can really start building value from data. Up to this point we have only enabled our ability to monitor and control our day to day business operations. The real value of data can be measured and improved when you start collecting the social metadata. Social metadata will tell you the actual use. Taking control of that data will help you developing your data as a business asset.


The real value of data can be measured through the use by its consumers

Lean Data

The actual use of data is a clear indication of whether the data is valuable. Measuring this social metadata helps you put focus on this data. Furthermore other valuable data sets that are not effectively used can be given attention to improve its use.

Metadata types within the 4 metadata categories

To get started, below is an overview of metadata types with examples that you could start organising. Start with a (data) process analysis and use your business objectives to determine your metadata needs:

Technical Metadata types

Connectivity metadata
– Source application name, location
Technical metadata
– Technical table & field name
– Data format (e.g. text, SPSS, Stata, Excel, tiff, mpeg, 3D, Java, FITS, CIF)- Compression or encoding algorithms
– Encryption and decryption keys
– Software (including release number) used to create or update the data
– Hardware on which the data were created
– Operating systems in which the data were created
– Application software in which the data were created
Structural metadata
– File relationships (e.g. child, parent & datasetgrouping)
Preservation metadata
– File format (e.g. .txt, .pdf, .doc, .rtf, .xls, .xml, .spv, .jpg, .fits)
– Significant properties
– Technical environment
– Fixity information

Business Metadata types

Business initiative metadata
– Business case (reference, contacts)
– Request purpose
Governance metadata
– Owner of the data
– Data purpose limitations
– Business rules & data retention
– Data classification (AIC & Privacy)
Descriptive metadata
– Name of creator of data set
– Name of author of the data
– Title of document/ data
– Data (as)set name & description
– Object name, description & definition
– Attribute functional name, definition & description
– Location of data
– Size of data
Administrative metadata
– Information about data creation
– Information about subsequent updates, transformation, versioning, summarization
– Descriptions of migration and replication
– Information about other events that have affected the files
– Access rights metadata

Operational Metadata types

Execution metadata
– Whether the process run failed or had warnings
– Which database tables or files were read from, written to, or referenced
– How many rows were read, written to, or referenced
– When the process started and finished
– Which stages and links were used
– The application that executed the process
– Any runtime parameters that were used by the process
– The events that occurred during the run of the process, including the number of rows written and read on the links of the process.
– The invocation ID of the job
– Any notes about running the process
Monitoring metadata
– Actual status of a data processing job (in progress, error, paused)
– Current runtime & estimated end time
– Completeness flag & percentage
– Data Disposal verification

Social Metadata types

Use metadata
– Circulation records
– Physical and digital exhibition records
– Content reuse and multiversioning information
– Search logs & parameters
– Data search results, filters and clicks
– Use and user tracking
– Data tags
– Excerpt / summary
– URL
– Number of users & viewing time
– User review & ranking of data
Controlling metadata
– Data access users & time
– Frequency of data access
– Time between data access attempts

Thank you for reading! If you would like support for your organisation, feel free to get in touch!

In another blog I’ll address what solutions and capabilities you’ll need in your architecture.

| © 2023 Lean-Data | KvK 71359338 | BTW NL001912597B29 |

Top