Staging
Staging is essentially a landing zone for the majority of the data that will enter the Data Vault.
It often doesn’t contain any historical data, and the data mirrors the schema of the source systems. We want to ingest data from the source system as fast as possible, so only hard business rules are applied to the data (i.e. anything that doesn’t change the content of the data).
The Staging area can also be implemented in what is known as a Persistent Staging Area (PSA). Here, historical data can be kept for some time in case it is needed to resolve issues or referenced back to. A PSA is also a great option to use as a foundation for a Data Lake! You won’t want all use-cases hitting your enterprise data warehouse (EDW), so having a PSA/Data Lake is a great capability to enable Data Science, Data Mining, and other Machine Learning use-cases.
Ideally, the pipelines that ingest data into Staging should be generatable and as automated as possible. We shouldn’t be wasting a lot of time ingesting data into the Data Vault. Most of our time should be spent working with the business and implementing their requirements in Information Marts.
Enterprise Data Warehouse
Raw
Raw is where our main Data Vault model lives (Hubs, Links, Satellites).
Data is ingested in the Raw layer directly from the Staging layer, or potentially directly into the Raw layer when handling real-time data sources. When ingesting into the Raw layer, there should also be no business rules applied to the data.
Ingesting data into Raw is a crucial step in the Data Vault architecture and must be done correctly to maintain consistency. As was mentioned earlier for Staging, these Raw ingestion pipelines should be generatable and as automated as possible. We shouldn’t be handwriting SQL statements to do the source to target diffs. One incorrect SQL statement and you will have unreliable and inconsistent tables. The genius of Data Vault is that it enables highly repeatable and consistent patterns that can be automated and make our lives a lot easier and more efficient.
Business Vault
The Business Vault is an optional tier in the Data Vault where the business can define common business entities, calculations, and logic. This could be things like Master Data or creating business logic that is used across the business in various Information Marts. These things shouldn’t be implemented in every information mart differently, it should be implemented once in the Business Vault and used multiple times through the Information Marts.
Metrics Vault
The Metrics Vault is an optional tier used to hold operational metrics data for the Data Vault ingestion processes. This information can be invaluable when diagnosing potential problems with ingestion. It can also act as an audit trail for all the processes that are interacting with the Data Vault.
Information Delivery
Information Marts
The Informational Marts are where the business users will finally have access to the data. All business rules and logic are now applied into these Marts.
For implementing business rules and logic, the Data Vault Methodology also leans heavily on the use of SQL Views over creating pipelines. Views enable developers to very rapidly implement and iterate with the business on requirements when implementing Information Marts. Having too many pipelines is also just more things to maintain and worry about rerunning. Business users can query Views knowing they are always accessing the latest data.
So does this mean I have to fit all my business logic into Views now?
No — The preference of the Data Vault methodology leans towards using Views, but there are certain things Views aren’t the right fit for (i.e. extremely complex logic, machine learning, etc.). If it feels like a struggle trying to get SQL to perform your business logic, a View probably isn’t the right tool. For these cases, a traditional pipeline is going to be your best bet.
If all my business logic is in Views, isn’t that going to slow down my BI reports?
Like anything else, it depends. There are many considerations from the size and volume of the data, the complexity of the business logic, to the database technology and capabilities of that system.
Most times Views will perform just fine and meet the needs of most business needs. If, however, Views aren’t performing for your use-case, the Data Vault methodology offers more advanced structures known as Point In Time (PIT) and Bridge tables which can greatly improve join performance. As a last resort, Data Vault methodology states we can materialize our data (i.e. materialized view or new table).
The concept of an Information Mart is also a logical boundary. Your Data Vault can also be used to populate other platform capabilities such a NoSQL, graph, and search. These can still be considered as a form of information mart. These external tools would typically be populated using ETL pipelines.
Error Marts
Error Marts are an optional layer in the Data Vault that can be useful for surfacing data issues to the business users. Remember that all data, correct or not, should remain as historical data in the Data Vault for audit and traceability.
Metrics Marts
The Metics Mart is an optional tier used to surface operational metrics for analytical or reporting purposes.
Moving Forward With Your Data Platform
Choosing the right warehousing architecture for your enterprise isn’t only about ease of migration or implementation.
The foundation you build will either support or inhibit business users and drive or limit business value. Utilizing Data Vault may not be traditional, but it could be exactly what you need for your business.
Looking for More Information on Implementing a Data Vault?
We know it can be a challenge to build a data platform that maintains clean data, caters to business users, and drives efficiency.
If you have more questions or don’t quite know where to get started with building or managing a data platform using Data Vault, please reach out and talk to one of our experts!