Thursday, April 18, 2013

The Rule & the Raw & the Business (Data) Vault: Data Vault Architecture and Business Rules

Introduction

Every now and then the discussion on derived data, centralization of data logic and the definition of the (Raw) Data Vault pops up. Both +Daniel Linstedt's supercharge book, +Hans Patrik Hultgren's Agile Date warehousing book and +Ronald Damhof's EDW papers (see the papers section of his his linkedin profile) describe varieties of the concept of storing derived data into a Data Vault oriented EDW. They also blogged about this subject and the terminology regarding the Raw Vault. See Ronald's (and Dan's) and Hans's blog post. In this blog post I'll discuss these views and my (and +Ronald Damhof's) take on this architecture pattern.

Centralized Logic

Often it makes sense to centralize not only the storage of data, but also some (all?) of the additional logic required to process the data before it goes out to the users/data marts/bi systems. This logic is often classified as 'Business Rules', 'Calculations' or 'Data Quality Rules'.

Business Rules?

From an information system perspective a formalized business rule is a constraint expressed on data. This means that also keys and domain definitions are business rules.
We personally prefer to talk about 2 different kinds of business rules: structural and non structural.

Structural Business Rules

Structural business rules are rules that structure elementary information. They contain the basic building blocks for creating data models. 
  1. Elementary structure rules like domain, key, dependency, foreign key/subset constraint and basic entity (relation/fact type).
  2. Model derivation/transformation rules to create new semantical/logical equivalent (elementary) model structures.

Non Structural Business Rules

Non structural business rules define calculations/derivations, constraints & Data Quality rules and structural repair rules. Of those the structural repair rules are the most interesting. These are rules that change/extend date models (create new structural business rules!) based on derived data/calculations. So we have.
  1. Derived data/calculations
  2. Constraints/Data Quality rules
  3. Structural repair rules

Derived Data and Business rules

We see not a big difference in handling non-structural business rule data and raw (source system) data. A business rule is just a small (source) data model that happens to derive its data from the Data Vault and not externally. It is (normally) loaded in the same manner and into the same kind of objects as with a raw Data Vault. This means from a metadata perspective we can add traceability information, and we have better control over the rule/data/metadata that that of externally sourced information systems, but that's it. Dan's "Business Data Vault" suggests the business is in the lead on that data (and business rule definitions). While this is often the case, business rules can also have a more technical origin like performance (aggregation satellites), so this is a bit suggestive. We prefer the term (Business) Rule Vault instead to emphasize the importance of the rule processing over the exact business definition, specification and naming.

The (Business) Rule Vault

For us all DV objects that are only/mainly driven by business rules belong to the Rule Vault. Note that since Rule Vault objects can(and often do) directly relate/depend on entities in the Raw Vault you should not see them as 2 completely different and independent (storage) tiers, they are in the same tier, just 2 different logical (sub)layers within the Data Vault. This separation is just a soft one. We can load the same DV entity from a rule or raw source, either in consecutive order or sometimes even at the same time. Our metadata will keep the records auditable according to the source, but assigning DV entities exclusively to the Rule or Raw Vault is not always possible or sensible, esp not over time.

The Raw (Data) Vault

The classical source driven but business oriented (Raw)Data Vault, the 'Raw Vault' for short, is the classical Data Vault. It is organized around integration through 'business keys', but also directly related/auditable through to the source (data). We'll forego the discussion here on (Single) Source Data Vaults driven by the primary/surrogate key and debunk Source Data Vaults, also called Stage Data Vaults, in another post.

The (Complete) Data Vault: Rule Vault + Raw Vault:

So for us the (Complete) 'Data Vault' combines the Business Rule (Data) Vault, the "Rule Vault' for short, and the 'Raw' Data Vault, or Raw Vault.

Business (Data) Vault and Business View Perspectives

But even with the Rule Vault we still need to connect our business rule results to the business. For this we use the "Business View" concept of +Ronald Damhof It brings together Rule and Raw Vault objects to present a consistent business view on all the data (a 'truth' in classical EDW thinking). Business Views are (conceptual) (business) data models bereft on detailed structural, temporal and other transformations/derivations. These business views can be based on available enterprise, industry or source system information models, greatly reducing overall data and business rule complexity.

Perspectives

These Business Views can be implemented through several perspectives in different structures (Dimensional, Normalized and Data Vault!) and with different temporal aspects (now, point-in time, period). We define a Business Data Vault as a Business View perspective in Data Vault format. These Business Views sit between our Data Vault and our Data Marts as a decoupling layer. This layer is driven by information models managing our different truths as different business views. Since this layer provides a semantic abstraction it is still appropriate even if there is no structure change from the Data Vault to the business view perspective. A Business Data Vault is still a complete Data Vault transformation from a business information  model, albeit (usually) a fully virtual one!). Interestingly, of all the perspectives of a Business View, the Data Vault perspective is usually the first and foremost that is implemented.

Data (Definition) Architecture: Concerning Layers

When we look at the layers of the overall EDW Data Architecture we see a 3 layered approach, a Data Vault layer concerning storage and processing, a Business View layers concerning abstraction and consistency and an Access layer (Data Mart/Staging Out) concerning access, aggregation and navigation.


Data Logistics Architecture: Tiering your Data

EDW Logistical Functions: Acquiring(staging), Storing(Data Vault) and Access and delivery of data.
Separate from the data layers we also have the data tiers. These are connected to the data logistics functions like acquiring, storing, propagating and data delivery. From the Data Architecture perspective we want to minimize logistics through maximizing virtualization/derivation. We can infer that at minimum we need 1 data tier (apart from e.g. a staging tier). In practice we can create many tiers in all the data layers. We can even have tiers that cross layers. Since most EDW architectures (inadvertently) combine/entwine these two architecture aspects we'll discuss the connection between tiers and layers in a future blog post.

Conclusion

Personally we don't make too much of the distinction between Raw and Rule Vault, except in the metadata. We do think a (formal) business view layer (preferably in Data Vault format) is an important, nay essential, part of any Data Vault Architecture. (c)

©+Martijn Evers and +Ronald Damhof (w. thanks to +Tom Breur)


No comments: