To grasp the concept of Data Vault, and specifically DV 2.0, let’s start by setting some context. Organizations – whether they be companies, universities, government institutions, or non-profits – all share the need to better understand their operations and find ways to improve. Companies want to increase profits and market share. Universities want to ensure they are admitting the right students and optimizing their resources to deliver the best education possible. The context, then, is to think about our organizations as businesses running a set of operating processes vis-a-vis customers (whether or not we are in an actual business, a non-profit, or something else, and whether we call our customers “customers,” members, patients, or some other name). For the sake of discussion, let’s refer to all of our organizations as businesses.
Most, if not all, of the information we need to fully comprehend what is going on in our business is buried in our computer systems that we’ve installed and utilize to run the company. These include CRM, ERP, HR, POS, and a host of other transactional systems. To suck out the relevant information from all these systems and take action is a grand effort fueling a multi-billion-dollar global market that can be summed up in one word: analytics.
Building and deploying a sustainable, productive analytical environment for our dynamic organization with potentially millions of daily transactions, in one form or another, becomes job one for those of us held responsible for maintaining the “data platform.” If we do it right, we will keep a robust, malleable data platform to satisfy the complete stack of needs for reporting, analysis, predictive analytics, and data science to those on the business-facing side of the org chart that have these duties. The platform must be robust and malleable because the needs we must satisfy are ever changing.
Updated Vault Methodology: Robust and Malleable
I am not throwing the words robust and malleable around casually. By robust, I mean reliable, active, prospering. By malleable, I mean extensible, having a capacity for adaptive change. Reliable and extensible. Active and adaptable. That’s it. That’s the data platform we want.
In real terms, we’re up against two challenges.
One: as the business grows and changes, so do the needs of our business analysts and data consumers. One day they need this kind of data, presented that way, while the next day they seemingly need the same data presented a different way, or different data altogether. Keeping up with them requires a can-do spirit and a flexible approach to our development/delivery process. Sometimes, the analysts don’t even know what they want until they see it. That much more, then, is the burden to deliver quickly, fail fast when necessary, and get on to the next deliverable. It’s a never-ending quest to keep up with the latest business requirements.
Two: as the business grows and changes, so do the sources of data available for us to manage in our platform. Quarter by quarter, year by year, the business gets more complex. Systems change or are added in Sales, Marketing, Finance, etc. And every time a new system comes along, so does the thirst to better understand the intrinsic data, and that data’s relationship to other data. At each step, we need to consider audit trails and access to trustworthy historical data, not just the transactions of today.
The “Best Practices Greatest Hits” Album
Imagine having the chance to sit down with thousands of data architects who have faced these challenges over the last twenty years. You could ask each one of them, “How did you do it?” Those that succeeded would tell their tale of how they loaded data from transactional systems; how they organized their data into a model of some sort in the analytics data platform; how they kept track of historical data; how they were able to add new sources of data quickly and reliably; and how they kept up with the ever-changing needs of the business-facing data consumers, report writers, and analysts.
Data Vault 2.0 – in its purest form – is a greatest hits album of all the tales of success rendered from these thousands of architects who have come before us.
Data Vault 2.0 answers the call, once and for all, in aggregating the best practices for methodology, architecture, and modeling. It is a tried and tested approach to building and deploying a data platform that is robust and malleable. And like making a fine loaf of sourdough bread, all you have to do is follow the recipe, and you will get a successful outcome.
Invented by Dan Linstedt, it is more evolutionary than singular. The Data Vault 2.0 Standard (i.e., recipe) borrows from thought leaders such as Bill Inmon and Ralph Kimball; mathematical contributions such as set logic and hashing algorithms; and rational approaches to development processes such as Agile and automation of recurring patterns.
Perhaps most compelling, Data Vault 2.0 is complete and comprehensive. It’s not a recipe that leaves out steps (what temperature should I set the oven?) or presents ambiguity (what do you mean, add “a little” salt?). Follow the Data Vault 2.0 standard verbatim and you are destined for success.
The Original Fundamental Invention: The Data Vault Model
Data modeling, for those that practice the art form, is held as a complex endeavor to capture as much understanding of the business practices and rules of engagement as possible, and reflect those practices in the way data is organized for analytics. There are usually three steps: the conceptual, the logical, and the physical. Entity relationship diagrams are drafted to pursue the first two paths, leaving to the architect to “map” the logical relationship into physical, populated tables in a data platform of their choosing.
One school of thought, historically, is to organize the data platform as a “data warehouse” – following normalizing principles to reduce the duplication of data; avoid data anomalies; ensure referential integrity; and simplify data management. Sounds great, except that the business people who want to utilize the data usually don’t think this way.
Thus, the Data Mart star-schema and its denormalized tables join the party, organizing physical data into facts about the business, analyzed by attributes called dimensions. Think sales data (fact), as analyzed by region or city (dimensions) and you’ve got the idea. Business people, generally, love it. And this too sounds great, until we realize that all these tables have to be re-engineered every time a new change comes along. Robust and malleable? No.
So, enter the Data Vault. The core idea, still practiced today, is to deploy a tidy data model that tempers the chaos. One: let’s make the front-end, business side of the platform look like a star schema the way the analysts want it. We’ll call that the Information Mart Layer. Two: let’s make the back-end consistent with business concepts, as well as avoid the entity relationship diagrams and mappings from the logical to the physical. Three: let’s organize the data much like shelves in the supermarket. Going to the store to buy bread? Look for the sign that says “Bread” at the end cap of the aisle and you have a pretty good chance of finding it.
This “key” to finding the bread is the same construct for a Data Vault. We call it a Business Key and organize the keys in a Hub table, rather than on a sign at the end cap of the aisle. The things we want to know about the data, called the Attributes, are maintained in something called a Satellite table. And once a transaction occurs – meaning somebody actually buys a loaf of bread, or a delivery truck drops off a pallet on the back dock – we capture that transactional relationship in a Link table.
That’s it. Three kinds of tables define and encompass the entire business, overlayed with a dimensional layer for easy access for data consumers and analysts. Simple. That’s a Data Vault model at its core.
Data Vault 2.0 Takes It To A Whole Other Level
Data Vault 2.0, as the numerical designation implies, is an advanced version of the Data Vault standard. Now not just a data model, Data Vault 2.0 encompasses the architecture and methodology described above as part of the recipe for success. We load data in parallel. We keep track of business rules as they change over time, so we can retroactively run reports on yesterday’s data with today’s rules. Or we can bring forward yesterday’s rules to today’s data. It’s version control combined with keeping true, historical records of everything.
When new data sources show up to the dance, we quickly discern whether the data and its form fit well into existing Hubs and Satellites, and if not, we create new ones. There is no concept of rework or re-engineering. Best practices, as discovered and tested over the last two decades, now held in one place called Data Vault 2.0.
Enterprise Data, Data Warehousing, and The Punchline
One might argue, “Isn’t it possible to have a successful outcome by building a data platform without Data Vault 2.0?” The answer is, yes…well, perhaps. You might employ an architect that is wholly capable of designing and deploying a data platform that is robust and malleable based on their own experience, biases, and knowledge. But that is an awful lot of pressure to put on your architect. Is the risk warranted given the greater agenda to deliver business value from the data platform with the least cost and minimal time? Rather than the ideas of one architect, wouldn’t we prefer an aggregation of best practices by thousands? It is on the path to delusion to suggest that one is cleverer than many.
If you are an executive, business manager, analyst, or data consumer, Data Vault 2.0 is important to you for the sheer reason that the proven methodology is the least risky way to deliver a robust and malleable data platform. And if you are a practitioner or manager looking after your company’s data platform, Data Vault 2.0 offers you a prescriptive, complete, comprehensive methodology to build and deploy a functional model, scalable architecture, and physical implementation resilient to, well, just about anything the business can throw at you.