Tuesday, October 23, 2012

Kinds of Keys: On the Nature of Key Classifications



Introduction

Many data and information modelers talk about all kinds of keys (or identifiers. I'll forego the distinction for now). I hear them they talk about primary keys, alternate keys, surrogate keys, technical keys, functional keys, intelligent keys, business keys (for a Data Vault), human keys, natural keys, artificial keys, composite keys, warehouse keys or Dimensional Keys (or Data Warehousing)  and whatnot. Then a debate rises on the use (and misuse) of all these keys. For example, questions that often pop up, is a primary key always a natural key or a surrogate key? Can you have a surrogate key without a natural key? Do we need to define all keys in the DBMS layer? These discussions show that designing and developing data models is in a sad state of affairs. The foremost question we should actually ask ourselves: can we formally disambiguate kinds of keys (at all)?

The Relational Model

If we take a look at the relational model, we only see candidate keys. They are defined as a minimal set of 1 or more attributes that are unique for each tuple (record) in a relation. If we want to discern more than just keys, we need to look further. (The relational model also defines Superkeys. These do not need to contain a minimal set of attributes). So the basic formal notion is that a key is a (candidate) key, and no other formal distinction is possible. When we talk about different kinds of keys we base our nomenclature on properties and behavior of the candidate keys.

Of all kinds of keys, the primary key and the surrogate key gained the most discussion. Since the RM treats all keys as equal we formally do not have a primary key, it is a choice we make in the selection, and as such we might treat this key slightly different from all other available keys in a relation. If we look at 5NF, a lot of relations should only have 1 key at most, while others might have several but none would be deemed primary. The discussion around primary keys stems more from SQL NULL problems, foreign key constraints and implementing surrogate keys.

If we treat surrogate keys as an arbitrary candidate key, we actually decrease the normalization of a data model from say 3NF to 2NF. This has led to the special and strict treatment we have for surrogate keys as to prevent this from leading to unwanted issues. Surrogate keys should be seen as not an extra key but as a surrogate for an existing key whose semantics and behavior is strictly regulated by the DBMS system.
With temporal data models surrogate keys are sometimes used as "versioning keys", but semantics and implementation of such keys is very tricky. Dimensional surrogate keys are even trickier since they do version tracking in reverse (related to a fact table).

Key behavior

Mostly kinds of keys signify certain behavior or certain properties of the key in question. A primary key is often designated as the key to implement foreign key constraints. A Composite key is a candidate key that consists of at least 2 attributes, while a dependent key is a candidate key where a subset is also a foreign key. Surrogate keys are keys that are system generated and (mostly) immutable and whose visibility and scope lies inside the DBMS system. Business Keys on the other hand are independent keys that are not designated as a surrogate key whose familiarity, uniqueness, visibility and scope span an entire enterprise. This way we try to find amongst others the actual identifiers that represent the reification of real world entities or abstract concepts. Human keys for example are actually not keys in the strict sense, but are non-enforceable candidate keys that heuristically identify tuples, especially those related to real world entities. They are related to the process of relating (UoD) entities to tuples. E.g. using full name and full address to identify and relate a real world person to a tuple representing that person in a database and uniquely identifying that person by a Person ID.

But what happens when a data model actually shows a system generated and (mostly) immutable key that is not a surrogate for another key? Is it still a surrogate key or not? The problem lies in the fact that we can apply any behavior and implement any property for a key. Keys can be implemented as hybrid between 'surrogate' and 'natural'. The question of good design is of course another matter, but for this post I focus on the current situation we see in our information systems today.

Key properties

Basically a keys behavior, expressed in the keys properties like context, visibility, familiarity, stability, immutability, uniqueness, compactness, auditability, dependency and its scope that are really defining it's kind. For some keys additional properties around the process of identification and reification from the  Universe of Discourse also play a role. These are things like type of control checks, related heuristic/human keys, Chance of duplication, chance of doubling, etc. IMO All these properties define a key instead of arbitrary "key classification". This is especially important when discussing pro and contra on certain keys, styles of modeling or key implementation considerations.

Key Ontology

So we can classify keys based on (standardized) properties. Based on certain properties we can state that a key is of a certain kind. This also means a key can be of several kinds at once (e.g. a composite business key is both a business key and a composite key). This way we can describe all kinds of keys, even those kinds we have no naming for. The matrix of kinds of keys and their properties describes a kind of 'key space'. This detailed description of kinds of keys allows us to manage, design and implement keys. It can be used in Data Quality Initiatives, Data Modeling, Data Integration and Master Data Management. Note that an adequate list of properties and hence a good ontology can be quite specific for a data model/system/organization (although a simple generic/standard one could probably be designed).

Key Taxonomy

Basically, most data (model) users (including data modelers, database developers etc.) like to have a small, simple, consistent  and non overlapping set of key types (or classes) instead of a detailed ontology where keys can be of several kinds. For this we can try to set up a simple classification scheme based on our ontology. Here we create a small set key types and assign each key we have to just 1 type. This classification facilitates easy defining, modeling and managing of key definitions. 

For example: within a Data Vault analysis, transformation and model generation you can classify the following types of keys: Business Key, Alternate Business Key, Dependent Key, Link Key, Temporal Key, Driving Key, Data Vault ID and Source Surrogate Key. For each key in any of the models I use I then try to classify each and every key to one of these types. For the source systems not under control of the Data Vault this is always a difficult matching process between the properties of the source key and the Data Vault classification scheme.

Such a scheme can NOT be used for analyzing the keys, since there is no guarantee a key will neatly fall in one of the existing categories of keys. For this we have the Key Ontology and the key matrix.

Conclusion

It is quite hard to discuss key classifications without good definitions and descriptions. While some kind of keys might be better understood than others, we have difficulty classifying all kinds of keys we find 'in the wild'. A property and behavior driven Key Ontology and Key matrix can be used to define, design and map out our 'key space'. From there we can try to create a consistent and accurate key classification scheme that will aid us in using, comparing and managing keys.


Thursday, October 11, 2012

The structure of the MATTER DWH program

Introduction

In this blog post I'll show the most recent version of our detailed  track and course overview. Note that this is still a concept and is subject to change (although I hope not significantly). I will not discuss the contents of individual courses or tracks in this post, but instead focus on the overall structure of the program.

"Broad where we must, Deep where we can"

As blogged earlier, the MATTER program has the following tracks:


The program's track design reflects one of the basic principles of our program: " Broad where we must, Deep where we can". Because of this we decided to have a 3 layered approach in our program.
Layer 1 is broad and teaches basic and advanced methodology/technology tracks for Data Vault, Anchor Modeling and Temporal Data modeling. The Architectural and ETL track have been integrated with the Data Vault track because there is a lot of overlap there. We don't assume all students want to follow all the layer 1  courses when doing the program, and some students might only want to follow some of the courses in this layer without following the rest of the program. This layer is probably the most dynamic one as well. We want to offer possibilities to extend knowledge without trying to teach everything. Expect more (optional) courses in the next iteration of the program (basic Data Vault, Dimensional Modeling, Agile etc.) with also additional partnerships.
In the 2nd layer we go deeper and try to connect knowledge from layer 1.  We will still focus on specific methodology and architecture discussions. The main track here is on Information Modeling. Also the Architecture course is formally in this layer.
 The Master Class is where things come together and we tie all kinds of knowledge from the different tracks into a more consistent whole. We will discuss mechanics of creating modeling approaches by understanding modeling techniques and their place within (DWH) Architectures. We will also focus on the relation between architecture, methodology and modeling.

Course Overview

The following table shows all courses of the MATTER program, their intended level, prerequisites and number of days.

ID
Name
Track
lvl
Program Prerequisite                        Days
TEMP INTRO (I)
Time in the Database
TEMP
250
SQL, ER,3(T)NF, SQL
1
TEMP MOD (II)
Temporal Data Modeling
TEMP
350
TEMP INTRO (I) + Dimensional Modeling
1
AM MOD (I)
Anchor Modeling
AM
300
TEMP INTRO (I)
2
AM ADV  (II)
Anchor Modeling Architecture
AM
350
AM MOD (I)
1
AM IMPL  (III)
AM Implementation
AM
400
AM ADV (II)
1
AM CERT
AM Certification
AM

AM ADV (II)
½
AM ADV CERT
Advanced AM Certification
AM

AM CERT+AM PROJECT
½
DV MOD (I)
Advanced DV Modeling
DV
350
Professional DV knowledge (e.g. DV Cert, DV course+DV exp)
2
DV ARCH (II)
DV+DWH  Architecture
DV
350
DV MOD (I)
3
DV ADV (III)
DV Advanced Architecture & Modeling
DV
400
DV ARCH (II)
2
FCO-IM INTRO (I)
Hands on FCO-IM
FCO-IM
250
Basic database modeling/design
3
FCO-IM CASE (II)
FCO-IM case
FCO-IM
300
FCO-IM INTRO (I)
3
FCO-IM TRANS (III)
Transformation algorithms+case
FCO-IM
400
FCO-IM CASE (II)+ Dimensional Modeling Mod 2of3: DV MOD (I),AM MOD (I) ,TMP MOD (I)
4
DWH MASTERCLASS
Fact oriented + Temp. DWH Arch.
MSTR
450
FCO-IM TRANS (III) + : AM ADV (II)  or DV ARCH (II)
2
EXAM & CERT
Examination+CERT
MSTR

MASTERCLASS
1

In other blog posts I'll discuss tracks and courses in more detail. (if you have specific questions or interest in a course just contact me.)

Course Dependencies

The following diagram shows the detailed dependencies between the tracks and their courses. For example, we assume students following the FCO-IM transformation course understand two out of three data modeling techniques from our program (Anchor Modeling, Data Vault or generic Temporal Data Modeling). As you can see from the dependency arrows we combine knowledge gained from information modeling track and architecture courses in our masterclass on modeling and automation to understand advanced architectural concepts which should underpin our data warehouse designs.


Course subject matter matrix

We also have a detailed subject matter matrix per course to give an indication on what type of subjects will be taught in each course.
Course\MATTER
Modeling
Metadata
Arch
Autom
Temp
Trans
EDW
RDBMS
Impl
Related
TEMP INTRO (I)
-
-
-
-
++
-
-
++
++
-
TEMP MOD (II)
+
+
+
+
++
++
-
+
+
-
AM MOD (I)
++
-
+
-
-
+
-
+
+
-
AM ADV  (II)
+
+
+
-
++
+
+
-
-
+
AM IMPL  (III)
-
-
+
+
+
-

++
++

AM CERT
++
+
+
-
+
+
-
-
-
-
AM ADV CERT
++
+
+
-
+
+
+
-
-
-
DV MOD (I)
++
+
+
+
++
++
+
+
++
-
DV ARCH (II)
+
++
++
++
+
+
++
++
+
+
DV ADV (III)
++
++
++
+
+
++
++
-
+
++
FCO-IM INTRO (I)
++
-
-
-
-
++
-
-
-
-
FCO-IM CASE (II)
+++
-
-
+
-
+
-
+
+
-
FCO-IM TRANS (III)
++
+
+
+
+
+++
-
+
+
-
DWH “MASTERCLASS”
++
++
+++
++
+
+++
++
+
-
+++
EXAM and CERT
++
+
+++
+
++
+++
++
-
-
+



Course Subjects:

We'll do a LOT of (different styles/techniques/approaches) modeling within the courses of the program.

  • Metadata
    We'll discuss the important concepts around metadata modeling and usage.
  • Architecture
    We put current and future (Data warehouse/ETL) architecture and their construction in perspective.
  • Automation
    Covers all subjects related to tooling and automation of Data warehouse design and implementation
  • Temporalization
    All time related subjects like timelines and their implementations.
  • (Model) Transformation
    This subject is mainly model 2 model transformation like 3NF to a Data Vault
  • Enterprise Data Warehouse
  • (R)DBMS systems
    This stands for understanding Relational theory as well as actual DBMS systems like Oracle and SQL Server
  • Implementation
    Here we will focus on implementation details on e.g. SQL or ETL tools.
  • Related knowledge
    Here we put all related non core subjects that we will address like:

    • Enterprise and Information Architecture
    • Data Quality
    • Data Management and Data Governance
    • Security and Privacy
    • Agile development
    • NoSQL and Big Data
    • Business Rules