Many organisations are using, building or thinking of data lakes as a solution for their data needs. Nowadays they’re usualy setup in the cloud to meet the scaleability needs of high volume data processing.
How can we make data lakes (more) successful?
First let’s start with what a data lake is. As much has already been written on what a data lake is, let’s look at some references like Forbes:
If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state.Forbes
And Transforming Data with Intelligence (TDWI):
In its extreme form, a data lake ingests data in its raw, original state, straight from data sources, without any cleansing, standardization, remodeling, or transformation.TDWI
Both distinguish raw untouched data from the original source and prepared data ready for analysis. Both the raw data from the sources as the data from warehouses used in reporting are valuable in the data lake. In a simplified way that shows following data distribution pattern.
The value of data lakes is created through the analyses performed with the data for decision making. This requires data analists and scientists to transform the data in such a way they can make sense of it. Now here lies the challenge. The most time data analysts and scientists spend is on finding, understanding and preparing data.
Most data scientists spend only 20 percent of their time on actual data analysis and 80 percent of their time finding, cleaning, and reorganizing huge amounts of data.Infoworld
The success of a data lake can be measured through the use of data by its consumers
The strength of a data lake is the ability to service large amounts of data. The success of data lake grows when the amount of data ingested by producers increases together with the number of consumers that use the data. However, there is a risk that this increase of data ingestion and consumption becomes disorganised, then limits the usability and reduces the success of the data lake.
Data lakes and data swamps are both data repositories, but data swamps are highly disorganised.Information age
The common proposed solution in organizing data is to register all data in the lake in a catalog. I have found there are some serious limitations in this approach:
- It is usually just an automated indexing of data which generates vast amounts of technical metadata. This is not easy to make sense of.
- Implementing a data catalog on the data lake assumes that all data is known in there.
With the increasing amount of data in a lake, it becomes more difficult to find data. That was essentially the issue for data analysts and scientists in the first place. Furthermore, we cannot assume that all data in the organisation is already ingested in the data lake. Thus if it’s not yet in the lake it cannot be found.
The key in a successful data lake is connecting supply and demand for data
That is why you should start with cataloguing data throughout the organisation. Then you can start ingesting data into a data lake:
- The first step therefore is to register the datasets a producer has to offer
- This is then published in an Entperise Data Catalog for data consumers to search
- When desired datasets are found, the consumer requests access to the data
- When the producer approves, his data is ingested into the data lake (if not already there) and access granted
Distinguish data inventories from the enterprise data catalog
As data is not created in the data lake and needs to be ingested from source applications, a data catalog on just the lake will always be incomplete. Therefore the data catalog is not an accurate name of the overview of data in the lake. As the data market place fulfils the business needs of data it provides the offerings of the organisations data; i.e. the data catalog. Then the data storage locations like a data should focus on what they have in stock which I’d like to call data inventories. With this distinction we have a more loosely coupled design between the business processes on data and the fulfilment needs. The data market place can support an essential part of the organisation: governance and compliance.
- The producer knows his data and adds the necessary metadata like descriptions, contacts and which data is sensitive (like privacy data)
- The consumer can now find the data, gain an overview of available data in the organisation and easily get in touch with the responsible persons to request access.
- In the access request the consumer will share the purpose of the use of the data where the producer can evaluate if this is allowed.
- The producer is now in control of his data, able to self-service approve data requests and knows who receives his data.
- This enables the foundation of privacy by design.
The enterprise data catalog serves the data market place and is key in the success of your data lake
With this design in organising data we create a parallel with for example Amazon. They provide a platform where producers display their offerings in the webshop for customers. When a customer requests a product, an agreement is made and a delivery order sent to the warehouse. The fulfilment process starts from the most efficient warehouse that has the product on stock to ship to the customer.
So, if you want to make your data lake (more) successful, start with an enterprise data catalog to organise data supply and demand.