Wednesday, November 21, 2012

One (Metadata) Column To Rule Them All








One Column to rule them all,
One Column to find them,
One Column to bring them all  
and to the metadata bind them!






 Introduction

There is usually ONE column that is all over your Data warehouse. It might be called process_id, run_id, step_id or audit_id and it usually points over to a set of (logging) tables that hold all kinds of metadata for your Data warehouse platform. While this column is sometimes seen as mandatory (e.g. with Anchor Modeling or Data Vault) it is in general always seen as a good idea to have for any Data warehouse. There is usually just one column, but sometimes there are more (a composite metadata key) . While this column usually shows the actual load/ETL metrics for load batches/actions its implementation and definition varies from implementation to implementation.

The One (Metadata) Column pattern

I think that we should try to standardize on the semantics of such a column. This is easy if you have a good metadata model. The intersection of the static metadata (models,entities)  and metrics metadata (load runs and load cycles) with the ETL control metadata (etl moldules, etl step numbers, etl step order) should always lead you to the metadata on the lowest grain where you can define your most detailed process information. On this grain I would do my central logging and create the One Metadata Column (as a surrogate key). The metadata column's meaning is then directly related to this logging entity, and should not change during the operation of the Data warehouse. This does imply that a well designed and complete metadata model that covers all (definition, operation, automation and management) aspects is essential for having a good usable   metadata column. Since the Column is defined as the surrogate key on the lowest grain of your model, there can only be one. If your metadata model does not have a lowest grain (but e.g. several disparate ones) you apparently have gaps in your model and you need to extend/complete it before you can define your One Column.

Such a column has many advantages because it decouples your data and metadata to the maximum extent without losing information. It insulates the data from the metadata model changes through the use of a surrogate key, which then becomes the one point of connection bewteen these 2 models.
A few disadvantages to this approach are the more complex queries to tie data and metadata together, and coping with real time or near real time data proceses.

My preference of the column name is audit_id because such a column does not have to reflect a load run or process run, and not even an ETL step (but even an action within a load step).

I would think this pattern is esp. important with e.g. Data Vault to maintain audit-ability  so we not only know when but exactly what was loaded, how and form where. For DWH Automation efforts I think a metadata model and this pattern are very important aspects to implement consistently.

Metadata Model Example

I've created a minimal metadata model as an example. Here we see a very simple model that can be used in such a pattern. The following tables are in this model:
  1. Data Objects: Here we store source and target tables
  2. Load templates and Load template versions: Here we store which load templates/processes are defined
  3. Load Batches: A Load batch is a collection of load actions/directives
  4. Load Cycles: Each batch is loaded in regulated intervals. This is recorded in the load cycles.
  5. Load directives: Records which entity is loaded from which source under a certain batch and with a certain template.
  6. Load Actions: Records on which cycle which load directive was executed.

The column we need here is the Load action id. It is the surrogate key of the table with the lowest/widest granularity in the metadata model. Referring to this column we can answer all our metadata oriented questions around when, where, who, how and what.


Conclusion

The One Metadata Column Pattern is a useful and often implemented pattern within Data warehousing. I think it would be better if there is a consistent way in which to define and implement this. I hope this post helps to understand how to do this in a more structured fashion.

Tuesday, November 20, 2012

Colors of The Data Vault

Introduction

Most Data Vault modelers color code their Data vault models. Alas, the chosen colors usually differ between (groups of) modelers. Hans Hultgren was one of the first to start color coding practice (see http://www.youtube.com/watch?v=kRoDRlj8_YU ), but others like Kasper de Graaf did also use these color coding as well.

The first practice to my knowledge was the following coding:
  • Hubs: Blue
  • Links: Red
  • Satellites: Yellow/Orange or Yellow/Green or Green
  • Reference Entities: None, but I use an unobtrusive grey/purple
I also use a shape coding as well:
  • Hubs: square/cube
  • Links: Oblong/oval
  • Sats: Rectangle/flat
  • Reference entity: circle
  • Hierarchical Link: pyramid
  • Same-As/From To link: Arrow like construct

Using Color Coding Practices

Color Coding can not only be done on the basic DV entities, but also on attributes, relationships and keys and hence on any style (Normalized/Dimensional) and other modeling technique (FOM's like FCO-IM). They provide an interesting visualization of modeling and transformation strategies. One of the best diagramming techniques to use color coding are FOM's like FCO-IM because it let's you track your transformation rules and you only need to color code roles and entities (whereas in ER you will be color coding keys, attributes and relationships)

Color Coding Data Models

Color coding data models is i fact a transformation strategy visualization method. It should rely on classification metadata of the underlying data model.
I'll show here some examples of color coding data models.

FCO-IM


In FCO-IM you color Nominalized fact types (hubs, links and cross reference tables) or all the roles of the fact type. You can aslo color code UC's role connectors as well

Data Vault

Here we usually just color code the basic entities.

DV Skeleton

The Dv skeleton is just the DV without the satellites.


3NF

In 3NF each and every entity can produce a hub, link and sat.


DV source system analysis

Here I show part of a detailed color coding of a 3NF source system that needs to be transformed to a Data Vault. See that not only entities, but also keys, attributes and relationships have extensive color coding as well as corresponding classification metadata.

Dimensional Model

A dimensional model shows a link as the basis for a fact table, with measures coming from a satellite. Dimensions are usually hubs with their accompanying satellites.

I hope this short overview will show you how you can use color coding the enhance the usefulness of your data (vault) model diagrams.

Monday, November 19, 2012

Implementation data modeling styles

Introduction

Business Intelligence specialists are often on the lookout for better way to solve their data modeling issues. This is especially true for Data warehouse initiatives where performance, flexibility and temporalization are primary concerns. They often wonder which approach to use, should it be Anchor Modeling, Data Vault, Dimensional or still Normalized (or NoSQL solutions, which we will not cover here)? These are modeling techniques focus around implementation considerations for Information system development. They are usually packed with an approach to design certain classes of information systems (like Data warehouses) or are being used in very specific OLTP system design. The techniques focus around physical design issues like performance and data model management sometimes together with logical/conceptual design issues like standardization, temporalization and inheritance/subtyping.

Implementation Data Modeling

Implementation Data Modeling techniques (also called physical data modeling techniques) come in a variety of forms. Their connection is a desire to pose modeling directives on the implemented data model to overcome several limitations of current SQLSDBMSes. While they also might address logical/conceptual considerations, they should not be treated like a conceptual or logical data model. Their concern is implementation. Albeit often abstracted from specific SQL DBMS platforms they nonetheless need to concern themselves with implementation considerations on the main SQL platforms like Oracle and Microsoft SQL Server. These techniques can be thought of as a set of transformations from a more conceptual model (usually envisaged as an ER diagram on a certain 'logical/conceptual' level but see this post for more info on "logical" data models )

Classes of Data Modeling Styles

It would be good if we could analyze, compare and classify these techniques. This way we can assess their usefulness and understand their mechanics. Apart from the methodology/architecture discussion around information system design and modeling (e.g. for Data warehousing: Kimball, Inmon, Linstedt or a combination) there is not a lot of choice. If we ignore abstractions like Entity/Attribute/Value patterns, grouping (normalization/denormalization), key refactoring, versioning, timestamping and other specific transformation techniques we end up with nominally 3 styles of modeling that are being used right now: Anchor Style, Normalized and Dimensional. All three relate to an overall model transformation strategy (from a logical/conceptual data model), and all three can come in highly normalized varieties and some even in strongly "denormalized" (better is to talk about grouping like in FOM/NIAM/FCO-IM)  ones (and anything in between). This means that the definition of each style lies in a basic transformation strategy which then is further customized using additional transformations, especially grouping.

Grouping vs. Normalization

Normalization can be seen from 2 sides. As an algorithm to decrease unwanted dependencies going from 1NF up to 6NF, or as a grouping strategy on a highly normalized design (6NF or ONF) grouping down to 5NF or lower. Besides being a model transformation process it is also used as a name for a family of related data models using ONLY this model transformation process.

Normalization Style

Standard normalized data can be in either 1NF to 6NF (which is the highest normal form). Normalization is the simplest and most fundamental family of transformation styles. It is currently mainly used for OLTP systems.

Anchor Style Modeling

What I call Anchor style modeling (also called key table modeling) is becoming more and more accepted, especially with Data warehousing. Data Vault and Anchor Modeling are both methods that rely in this family of techniques. The basic idea is to split (groups of) keys into their own entities (called hubs, anchors or key tables) and split of all other info in additional entities (non key table entities called leafs, satellites etc.). Another property of this family is that temporal registration is never done on key tables but only on the non key table entities. From there the different techniques and standards, as well as goals, diverge leading to several related techniques that are used to design Data warehouses as well as OLTP systems in different fashions.

Dimensional Modeling

From a transformation perspective Dimensional modeling is based on mapping basic 5NF models as directed graphs to a forest of trees where the root nodes are fact tables and all other nodes being dimensions. Also, all history is kept in dimensions using basic versioning. Fact tables then do a versioned join to the dimensions.

Implementation Modeling style properties

Apart from Normalization, implementation modeling styles have been mainly invented to overcome issues in database software, e.g. current mainstream DBMSes. They make (sometimes serious) compromises on aspects like normalization and integrity against performance and maintenance. Dimensional modeling overcomes some performance and some simple temporal limitations, and also covers some limitations of BI tools (mainly pattern based query generation), while at the same time isn't hampering usage and understanding too much. Anchor Style techniques help out with serious temporalization inside DBMSes while at the same time add extra flexibility to mutations of the data model schema.

Temporalization and Schema Flexibility

In a future post I'll discuss this classification's effect on temporalization, schema management and schema flexibility in more detail.

One Model to rule Them All

Ideally we would like to have a generic 'super' (conceptual) modeling technique or style that is able to direct and control all these techniques. A base (conceptual) modeling style that allows us to generate/transform to any target physical modeling style. We could use 5NF from the Relational Model, but there are also other considerations like the fact that Anchor Modeling more related to 6NF. We will explore this notion of a super model in an upcoming post.

Conclusion

Most modeling techniques can be expressed as transformation strategies, which in a generalized form describe families of related styles of implementation data modeling techniques. Apart from their corresponding implementation approaches these 3 styles actually cover the backbone of most implementations of current information systems be it Data warehouses or OLTP systems.

Friday, November 2, 2012

Housebuilding, Data Architecture and Data warehousing




Introduction

Traditional Data warehousing focuses on designing and developing information systems to support an organization's BI initiatives and BI tooling. With the advent of new techniques, technologies, architectures and approaches we see the line between Data warehousing and other classes of information systems (transaction processing, interfacing, analyzing) blurring more and more.

Building a House

Building up your IT structure is a lot like building a house. The same goes for your data architecture. However, a data architecture is usually not a building built from scratch by an Architect, but a building made of a lot of prefab elements like rooms (applications), plumbings (Interfaces) and foundations (IT infrastructure). The end result depends, but it is usually more comparable to the Winchester Mystery House, or an Escher drawing than a regular building we work or live in. 
All elements we buy are usually standardized by the supplier or builder, but still needs serious adaptation before it will fit in your data 'house'. Even then, it won't look pretty since size, look and makeup will usually differ considerably between elements from different suppliers (and even from the same supplier). 'Window dressing' will only make it palatable by the casual observer, not by the persons living inside.

Data Warehousing

While a Data warehouse can be seen as just another room (application), there are some special considerations. Data warehouses are supporting information systems, more a support and bridging structure for your house. They are usually custom built, bridging gaps between standard elements and supporting new types of rooms. This is different from most of the other elements/rooms of your 'data' house (Even if the Data warehouse itself is made up of standard elements). From this new support structure your BI applications (new rooms!) can now be built. 
If our other (prefab) elements where well designed and structured in a way to connect seamlessly we might have no need of this new type custom support structure, but in practice this level of design is difficult and not implemented very much.

But why focus on a support structure for certain types of rooms (applications)? A normal house has just one intergated support structure that supports all the rooms and plumbings. Given current state of IT this overall support structure is usually built around the (IT) organization and is almost always an exercise in custom building, not an element of your 'data house' you can just buy directly of the shelf.

Building Better 'Data (Ware)houses'

Instead of looking directly for a Data warehouse as a supporting structure, we should look at supporting our total Data Architecture, which could include subjects like Data Warehousing, Federation/Virtualization, Data storage, Business Intelligence, Data Migration or data interfacing. We  should focus on building a whole 'Data house' with just one 'supporting structure', of which our familiar 'Data warehouse' can be an intgrated part instead of just another element grafted on our already baroque 'Data house' landscape.

Tuesday, October 23, 2012

Kinds of Keys: On the Nature of Key Classifications



Introduction

Many data and information modelers talk about all kinds of keys (or identifiers. I'll forego the distinction for now). I hear them they talk about primary keys, alternate keys, surrogate keys, technical keys, functional keys, intelligent keys, business keys (for a Data Vault), human keys, natural keys, artificial keys, composite keys, warehouse keys or Dimensional Keys (or Data Warehousing)  and whatnot. Then a debate rises on the use (and misuse) of all these keys. For example, questions that often pop up, is a primary key always a natural key or a surrogate key? Can you have a surrogate key without a natural key? Do we need to define all keys in the DBMS layer? These discussions show that designing and developing data models is in a sad state of affairs. The foremost question we should actually ask ourselves: can we formally disambiguate kinds of keys (at all)?

The Relational Model

If we take a look at the relational model, we only see candidate keys. They are defined as a minimal set of 1 or more attributes that are unique for each tuple (record) in a relation. If we want to discern more than just keys, we need to look further. (The relational model also defines Superkeys. These do not need to contain a minimal set of attributes). So the basic formal notion is that a key is a (candidate) key, and no other formal distinction is possible. When we talk about different kinds of keys we base our nomenclature on properties and behavior of the candidate keys.

Of all kinds of keys, the primary key and the surrogate key gained the most discussion. Since the RM treats all keys as equal we formally do not have a primary key, it is a choice we make in the selection, and as such we might treat this key slightly different from all other available keys in a relation. If we look at 5NF, a lot of relations should only have 1 key at most, while others might have several but none would be deemed primary. The discussion around primary keys stems more from SQL NULL problems, foreign key constraints and implementing surrogate keys.

If we treat surrogate keys as an arbitrary candidate key, we actually decrease the normalization of a data model from say 3NF to 2NF. This has led to the special and strict treatment we have for surrogate keys as to prevent this from leading to unwanted issues. Surrogate keys should be seen as not an extra key but as a surrogate for an existing key whose semantics and behavior is strictly regulated by the DBMS system.
With temporal data models surrogate keys are sometimes used as "versioning keys", but semantics and implementation of such keys is very tricky. Dimensional surrogate keys are even trickier since they do version tracking in reverse (related to a fact table).

Key behavior

Mostly kinds of keys signify certain behavior or certain properties of the key in question. A primary key is often designated as the key to implement foreign key constraints. A Composite key is a candidate key that consists of at least 2 attributes, while a dependent key is a candidate key where a subset is also a foreign key. Surrogate keys are keys that are system generated and (mostly) immutable and whose visibility and scope lies inside the DBMS system. Business Keys on the other hand are independent keys that are not designated as a surrogate key whose familiarity, uniqueness, visibility and scope span an entire enterprise. This way we try to find amongst others the actual identifiers that represent the reification of real world entities or abstract concepts. Human keys for example are actually not keys in the strict sense, but are non-enforceable candidate keys that heuristically identify tuples, especially those related to real world entities. They are related to the process of relating (UoD) entities to tuples. E.g. using full name and full address to identify and relate a real world person to a tuple representing that person in a database and uniquely identifying that person by a Person ID.

But what happens when a data model actually shows a system generated and (mostly) immutable key that is not a surrogate for another key? Is it still a surrogate key or not? The problem lies in the fact that we can apply any behavior and implement any property for a key. Keys can be implemented as hybrid between 'surrogate' and 'natural'. The question of good design is of course another matter, but for this post I focus on the current situation we see in our information systems today.

Key properties

Basically a keys behavior, expressed in the keys properties like context, visibility, familiarity, stability, immutability, uniqueness, compactness, auditability, dependency and its scope that are really defining it's kind. For some keys additional properties around the process of identification and reification from the  Universe of Discourse also play a role. These are things like type of control checks, related heuristic/human keys, Chance of duplication, chance of doubling, etc. IMO All these properties define a key instead of arbitrary "key classification". This is especially important when discussing pro and contra on certain keys, styles of modeling or key implementation considerations.

Key Ontology

So we can classify keys based on (standardized) properties. Based on certain properties we can state that a key is of a certain kind. This also means a key can be of several kinds at once (e.g. a composite business key is both a business key and a composite key). This way we can describe all kinds of keys, even those kinds we have no naming for. The matrix of kinds of keys and their properties describes a kind of 'key space'. This detailed description of kinds of keys allows us to manage, design and implement keys. It can be used in Data Quality Initiatives, Data Modeling, Data Integration and Master Data Management. Note that an adequate list of properties and hence a good ontology can be quite specific for a data model/system/organization (although a simple generic/standard one could probably be designed).

Key Taxonomy

Basically, most data (model) users (including data modelers, database developers etc.) like to have a small, simple, consistent  and non overlapping set of key types (or classes) instead of a detailed ontology where keys can be of several kinds. For this we can try to set up a simple classification scheme based on our ontology. Here we create a small set key types and assign each key we have to just 1 type. This classification facilitates easy defining, modeling and managing of key definitions. 

For example: within a Data Vault analysis, transformation and model generation you can classify the following types of keys: Business Key, Alternate Business Key, Dependent Key, Link Key, Temporal Key, Driving Key, Data Vault ID and Source Surrogate Key. For each key in any of the models I use I then try to classify each and every key to one of these types. For the source systems not under control of the Data Vault this is always a difficult matching process between the properties of the source key and the Data Vault classification scheme.

Such a scheme can NOT be used for analyzing the keys, since there is no guarantee a key will neatly fall in one of the existing categories of keys. For this we have the Key Ontology and the key matrix.

Conclusion

It is quite hard to discuss key classifications without good definitions and descriptions. While some kind of keys might be better understood than others, we have difficulty classifying all kinds of keys we find 'in the wild'. A property and behavior driven Key Ontology and Key matrix can be used to define, design and map out our 'key space'. From there we can try to create a consistent and accurate key classification scheme that will aid us in using, comparing and managing keys.


Thursday, October 11, 2012

The structure of the MATTER DWH program

Introduction

In this blog post I'll show the most recent version of our detailed  track and course overview. Note that this is still a concept and is subject to change (although I hope not significantly). I will not discuss the contents of individual courses or tracks in this post, but instead focus on the overall structure of the program.

"Broad where we must, Deep where we can"

As blogged earlier, the MATTER program has the following tracks:


The program's track design reflects one of the basic principles of our program: " Broad where we must, Deep where we can". Because of this we decided to have a 3 layered approach in our program.
Layer 1 is broad and teaches basic and advanced methodology/technology tracks for Data Vault, Anchor Modeling and Temporal Data modeling. The Architectural and ETL track have been integrated with the Data Vault track because there is a lot of overlap there. We don't assume all students want to follow all the layer 1  courses when doing the program, and some students might only want to follow some of the courses in this layer without following the rest of the program. This layer is probably the most dynamic one as well. We want to offer possibilities to extend knowledge without trying to teach everything. Expect more (optional) courses in the next iteration of the program (basic Data Vault, Dimensional Modeling, Agile etc.) with also additional partnerships.
In the 2nd layer we go deeper and try to connect knowledge from layer 1.  We will still focus on specific methodology and architecture discussions. The main track here is on Information Modeling. Also the Architecture course is formally in this layer.
 The Master Class is where things come together and we tie all kinds of knowledge from the different tracks into a more consistent whole. We will discuss mechanics of creating modeling approaches by understanding modeling techniques and their place within (DWH) Architectures. We will also focus on the relation between architecture, methodology and modeling.

Course Overview

The following table shows all courses of the MATTER program, their intended level, prerequisites and number of days.

ID
Name
Track
lvl
Program Prerequisite                        Days
TEMP INTRO (I)
Time in the Database
TEMP
250
SQL, ER,3(T)NF, SQL
1
TEMP MOD (II)
Temporal Data Modeling
TEMP
350
TEMP INTRO (I) + Dimensional Modeling
1
AM MOD (I)
Anchor Modeling
AM
300
TEMP INTRO (I)
2
AM ADV  (II)
Anchor Modeling Architecture
AM
350
AM MOD (I)
1
AM IMPL  (III)
AM Implementation
AM
400
AM ADV (II)
1
AM CERT
AM Certification
AM

AM ADV (II)
½
AM ADV CERT
Advanced AM Certification
AM

AM CERT+AM PROJECT
½
DV MOD (I)
Advanced DV Modeling
DV
350
Professional DV knowledge (e.g. DV Cert, DV course+DV exp)
2
DV ARCH (II)
DV+DWH  Architecture
DV
350
DV MOD (I)
3
DV ADV (III)
DV Advanced Architecture & Modeling
DV
400
DV ARCH (II)
2
FCO-IM INTRO (I)
Hands on FCO-IM
FCO-IM
250
Basic database modeling/design
3
FCO-IM CASE (II)
FCO-IM case
FCO-IM
300
FCO-IM INTRO (I)
3
FCO-IM TRANS (III)
Transformation algorithms+case
FCO-IM
400
FCO-IM CASE (II)+ Dimensional Modeling Mod 2of3: DV MOD (I),AM MOD (I) ,TMP MOD (I)
4
DWH MASTERCLASS
Fact oriented + Temp. DWH Arch.
MSTR
450
FCO-IM TRANS (III) + : AM ADV (II)  or DV ARCH (II)
2
EXAM & CERT
Examination+CERT
MSTR

MASTERCLASS
1

In other blog posts I'll discuss tracks and courses in more detail. (if you have specific questions or interest in a course just contact me.)

Course Dependencies

The following diagram shows the detailed dependencies between the tracks and their courses. For example, we assume students following the FCO-IM transformation course understand two out of three data modeling techniques from our program (Anchor Modeling, Data Vault or generic Temporal Data Modeling). As you can see from the dependency arrows we combine knowledge gained from information modeling track and architecture courses in our masterclass on modeling and automation to understand advanced architectural concepts which should underpin our data warehouse designs.


Course subject matter matrix

We also have a detailed subject matter matrix per course to give an indication on what type of subjects will be taught in each course.
Course\MATTER
Modeling
Metadata
Arch
Autom
Temp
Trans
EDW
RDBMS
Impl
Related
TEMP INTRO (I)
-
-
-
-
++
-
-
++
++
-
TEMP MOD (II)
+
+
+
+
++
++
-
+
+
-
AM MOD (I)
++
-
+
-
-
+
-
+
+
-
AM ADV  (II)
+
+
+
-
++
+
+
-
-
+
AM IMPL  (III)
-
-
+
+
+
-

++
++

AM CERT
++
+
+
-
+
+
-
-
-
-
AM ADV CERT
++
+
+
-
+
+
+
-
-
-
DV MOD (I)
++
+
+
+
++
++
+
+
++
-
DV ARCH (II)
+
++
++
++
+
+
++
++
+
+
DV ADV (III)
++
++
++
+
+
++
++
-
+
++
FCO-IM INTRO (I)
++
-
-
-
-
++
-
-
-
-
FCO-IM CASE (II)
+++
-
-
+
-
+
-
+
+
-
FCO-IM TRANS (III)
++
+
+
+
+
+++
-
+
+
-
DWH “MASTERCLASS”
++
++
+++
++
+
+++
++
+
-
+++
EXAM and CERT
++
+
+++
+
++
+++
++
-
-
+



Course Subjects:

We'll do a LOT of (different styles/techniques/approaches) modeling within the courses of the program.

  • Metadata
    We'll discuss the important concepts around metadata modeling and usage.
  • Architecture
    We put current and future (Data warehouse/ETL) architecture and their construction in perspective.
  • Automation
    Covers all subjects related to tooling and automation of Data warehouse design and implementation
  • Temporalization
    All time related subjects like timelines and their implementations.
  • (Model) Transformation
    This subject is mainly model 2 model transformation like 3NF to a Data Vault
  • Enterprise Data Warehouse
  • (R)DBMS systems
    This stands for understanding Relational theory as well as actual DBMS systems like Oracle and SQL Server
  • Implementation
    Here we will focus on implementation details on e.g. SQL or ETL tools.
  • Related knowledge
    Here we put all related non core subjects that we will address like:

    • Enterprise and Information Architecture
    • Data Quality
    • Data Management and Data Governance
    • Security and Privacy
    • Agile development
    • NoSQL and Big Data
    • Business Rules