Wednesday, September 19, 2012

Styles of Data Vault modeling PART I: Overview

Introduction

Sometimes we see that given the same input and process, the end result still differs between situations. This also happens when creating (data) models from requirements, and hence also with Data Vault data models. As long as this is intentional and can be explained this does not have to pose an issue. But often it is an indication of lack of knowledge or common understanding. It is usually quite hard to track down where the assumptions differ based on just the end result, and this is no different with the Data Vault approach. One of the main reasons for this difference lies in the somewhat loose definition of the Data Vault, it allows for some variation. This has led to several (closely related) styles of modeling a Data Vault.

In this post I'll introduce different modeling styles of Data Vault I've found to date. We will focus on the modeling part, and not on the architecture part. So we are not discussing differences between source, business or raw data vaults, but mainly different ways to transform the same starting (modeling) scenario into a Data Vault.

The Data Vault approach

Data Vault is a modeling approach/method that assists with creating the central data repository in an Enterprise Data Warehouse. It works on the basis that there is some hub/spoke EDW architecture defined (e.g. an Inmon DW 2.0 architecture). In this architecture the central repository is there for data management while the outlying spokes (Data marts) are used for data delivery to users. Its basic assumption is that you begin by focusing on getting source data in your EDW (not cleansed and restructured data). 
Data Vault however forces you to adapt all your data to fit into the three basic building blocks that are allowed when creating a Data Vault. This means source data should then be reshaped so that it fits those basic data entity building blocks. These blocks are called Hubs, Links and Satellites. 
These building blocks have a very basic makeup: 

     Hubs are just lists of elementary/independent key values. They implement independent keys. Since Data Vault is interested in the big picture, these are called 'business keys' and they represent common identifiers around the business (invoice number, customer number, employee number).
     Links are associations between hubs
     Satellites track attributes associated with hubs and links through time.

Besides the building blocks there are also loading patterns for your central EDW repository. These follow from the definition of data entities in combination with the requirement of loading the source data unaltered into the Data Vault.

Data Vault modeling Styles

However, this recipe is just a global pattern. For example, it is does not work as a complete and detailed (semantic-equivalent) structural data model transformation recipe if we compare it to the transformation algorithms we use within FCO-IM and other Fact oriented Modeling techniques (FOM’s).
Besides the flexibility in definition and design of the Data Vault standard other factors contribute to variations in the Data Vault approach.
  1. There is a large history on using similar temporal data modeling approaches with OLTP systems like Veldwijk’s HTC.
  2. Additional requirements on flexibility and reuse of data and metadata
  3. The advent of new hardware and software like database appliances and in-memory technology
  4. The rise of similar EDW oriented modeling approaches like Anchor Modeling
This has led to the rise of what can be loosely called different flavors or styles of Data Vault modeling (especially with regards to end dating links). These can be described as varieties in the underlying structural model transformation strategies. What I present here is by no means a complete list, but it attempts to identify the most important structurally different varieties we currently see to date. We can explore these different varieties of the Data Vault by looking at the definition and transformation process.

Classical (aka Dimensional) Data Vault

In the Classical (I describe this as the ‘Dimensional’) Data Vault all links represent n-ary associations between hubs. These links combine all hub references in a source tabel, both hub references that are part of the source table key, as references that are not part of any key. Because of the auditability requirment we need a “driving” key in the link (which is in fact a representation of the source table key) for proper temporal data management. Such a Link can be considered a “proto” fact table, containing all dimensional references (although time independent) but without the measurements. It will usually be pretty wide. It tries to minimize joins at the expense of duplicating rows when non-driving key elements are updated in the link. This style of Data Vault modeling lends itself for easy staging to a dimensional layer, but link re-usability for e.g. a Business Rule Vault is tricky.

Anchorized Data Vault a.k.a. Anchor Vault

The Anchor Vault focuses on modeling Anchors (note, these are related, but not identical to the Anchors in Anchor Modeling). These can be independent (Hubs) and dependent (links). They are purely driven by source key columns. Here we basically put each source key in its own hub or link and its dependent attributes in a satellite. The main reason for doing this is easier transformation, Anchor/key re-usability and easy loading and managing the temporal aspects of your data (e.g. timeline completion and standardization). Those considering complex loading (e.g. Change Data Capture), or abundant re-use of entities for a Business Rule Vault, should consider an Anchor Vault.

Strong Anchor Vault

With the Strong Anchor Vault independent foreign keys (foreign key elements not part of a candidate key) get their own many-2-many link. This means we transform one-to-many relationships (foreign key references) in the source to many to many link in the Data Vault. Naturally many-to-many relationships also get their own link. Changes to cardinalities, higher order relationships and relationship inheritance can be elegantly modeled this way. Because each independent foreign key must have its own link (together with its own transaction satellite) we create considerably more join paths. You also need to change your end dating process. Note that I do not consider horizontally grouping these many-2-many links together is a good idea (but elaborating on that point merits another post). Note that “merging relationships” (actually subtyping relationships)  with different cardinality becomes possible (but not required). Either over time or over similar relationships (through abstraction).
A tool that uses strong anchorization is Quipu (www.datawarehousemanagement.org)

Weak Anchor Vault

The Weak Anchor Vault differs from the Strong Anchor Vault by creating foreign key links (end dated links) for independent foreign keys relationships (they actually resemble satellites holding references to other Hubs). These can either be singleton or grouped. Here we track changes to Anchor references over time. We can group/ungroup independent foreign keys across foreign key Links. This decreases the number join paths and load complexity in comparison with the Strong Anchor Vault, but flexibility is slightly decreased. This and similar styles have been in use for OLTP systems for many decades to implement bi-temporal operational information systems. It has been independently advocated and implemented by several people like RenĂ© Veldwijk (Veldwijk's HTC: http://www.faapartners.com/downloads ), Harm van der Lek (Generic Data Vault) and many others under a variety of different names. BIready (www.BIready.com) is a tool that internally uses a weak Anchor Vault.

Elementary Anchor Vault

The elementary Anchor Vault re-factors each element into a separate entity. This means that each attribute has its own satellite (singleton satellites and singleton foreign key links). This can be done on a Weak or Strong Anchor Vault. The effect is more or less an Anchorized and temporalized version of 6NF. It is still not an Anchor Model because for that you need to refactor all key attributes/relationships into their own entity.

Related modeling techniques

There are several related modeling techniques to Data Vault. They all are concerned with temporalization and (key) separation. Non-temporal (key) information is stored in entities that are used as 'Anchors' for temporal entities. I call this class of temporal modeling techniques 'Anchor Style Temporal Data Modeling' (not to be confused with Anchor Modeling, which is just one variety).

Next
In the next posts I will discuss the different styles in more detail and show some examples.

No comments: