Data management software vendors are notorious for excessively complicated, jargon-filled or confused explanations of their technology and its benefits. This blog is a succinct, simple summary of how data management has evolved to where we are now over the years.
Data management has been through generations of improvements over decades. At the start of this story, the first iterations of what we may recognise as data management systems came about because of reporting requirements. I know this to be true because, in 1996, as a fresh-faced database developer, I built one of these myself. Many systems had (and still do have) rather bad built-in reporting capabilities. Better reports were needed without impacting the production systems. These early solutions tended to make direct copies of all the data held in different production systems onto a separate database server, usually creating identical copies of the structures. This way, fixed snapshots (say monthly) could be captured for reporting purposes so that month’s reports could run on data that was not being changed. As a second copy was made, there was no risk of production systems being slowed down by crunching reports directly on them. The component that fetched the data from the source systems was hardcoded and called a “Loader”. If the source system store changed, the loader would also have to be altered by a developer. The big problem with this idea surfaces when the business asks for a report that needs data from two or more systems to be blended together. Since the reporting server contains separate copies from each system, the data is not integrated.
The next generation in this line is the Fixed Schema Data Warehouse. These generally huge and expensive solutions are provided with a database structure that is independent from all the providing systems. They usually come supplied with an industry-specific structure intended to support all firms in that industry. The task becomes how to “map” all the data into the warehouse structure. The loaders are replaced by ETL (Extract-Transform-Load) components, often provided by the big RDBMS vendors (Oracle, Microsoft). This generation solution is generally terrible in practice. The schema can be rigid, expensive to change. Getting data in and out of it is always extremely slow (sometimes hours and hours). These data warehouse systems also tend to struggle with data atomicity. This means, for the reports to make sense, it is necessary to ensure that the data is always in a consistent state. For example, all the trades from yesterday when added up equal exactly today’s opening positions. Since in the real world, data quality is patchy and generally comes in batches in different sequences during each day, this requirement of the warehouse creates operational headaches and can often lead to unacceptable costs. These systems made an attempt to allow report writers to integrate data from different sources but usually this was achieved by requiring a database programmer to write a “view” on top of the tables. These views, with their dreadful RDBMS-specific stored procedures are extremely expensive to maintain and agonisingly slow in practice.
The next in line includes Misato Data Hub. Such systems do not have a fixed schema – the data model that is provided with the system is logically abstracted from the underlying storage and very easily changed. The mapping requirements are simplified by allowing schema-less and unstructured data to be processed alongside traditional rectangular tables. Getting data in to the system uses modern Big Data principles, rather than old-fashioned ETL components or feed loaders. These systems can also allow a partial, rather than complete, mapping to be configured, so 100% of the data does not need to be copied in to the data model, drastically reducing implementation time. These systems are event-driven rather than featuring traditional clunky batch processing. They also allow flexible blending of different data sources into composites, driven by user-configured business rules, rather than developer built database views. This solution is also differentiated from its previous generation solutions by being far more user-defined. The demands on IT specialists to set it up are far lower. In a nutshell, they empower the people who know the data best.