“Our data lake turned into a data swamp!” We are hearing this observation more often. Why is this happening and what can be done to mitigate?
Schema-less databases are embarking on the next phase in their classic technology-hype cycle.
Many customers are suggesting to us that they are currently on the second downward slope towards disillusionment with their data lakes. Data lakes can grow fast. This is a sign of their success. Without the constraints of a rigid schema, new data can be added very easily. Unfortunately, this growth is often accompanied by entropy caused by the collision with messy real-world realities of the complex nuances of the multiple meanings of overlapping data sources in the lake.
In the early years of the hype cycle of NoSQL platforms (Hadoop, MongoDB, Cassandra and many others) it was common to hear predictions of these exciting new database systems displacing vast numbers of legacy relational databases. Although data lakes are simplifying some warehousing problems, generally this has not happened in practice. Oracle and SQL Server still dominate and in data management circles it is dawning that data lakes and relational databases need to co-exist rather than rip and replace each other.
So how can we prevent this entropy caused by rapid addition of data sources in our lakes?
1. Ownership and stewardship
The data management system should allocate owners to all the data sources in the lake. Just because a data lake is schema-less does not mean it should be owner-less. A person who understands the source of the data needs to steward it in the lake.
2. Apply semantics to the lake
A common complaint from business users is that they do not know the meaning of the lake data. Attributes can have multiple meanings from the perspective of different consumers. The data management system needs to manage a sematic layer on top of the physical store.