Tuesday, October 23, 2012

Kinds of Keys: On the Nature of Key Classifications



Introduction

Many data and information modelers talk about all kinds of keys (or identifiers. I'll forego the distinction for now). I hear them they talk about primary keys, alternate keys, surrogate keys, technical keys, functional keys, intelligent keys, business keys (for a Data Vault), human keys, natural keys, artificial keys, composite keys, warehouse keys or Dimensional Keys (or Data Warehousing)  and whatnot. Then a debate rises on the use (and misuse) of all these keys. For example, questions that often pop up, is a primary key always a natural key or a surrogate key? Can you have a surrogate key without a natural key? Do we need to define all keys in the DBMS layer? These discussions show that designing and developing data models is in a sad state of affairs. The foremost question we should actually ask ourselves: can we formally disambiguate kinds of keys (at all)?

The Relational Model

If we take a look at the relational model, we only see candidate keys. They are defined as a minimal set of 1 or more attributes that are unique for each tuple (record) in a relation. If we want to discern more than just keys, we need to look further. (The relational model also defines Superkeys. These do not need to contain a minimal set of attributes). So the basic formal notion is that a key is a (candidate) key, and no other formal distinction is possible. When we talk about different kinds of keys we base our nomenclature on properties and behavior of the candidate keys.

Of all kinds of keys, the primary key and the surrogate key gained the most discussion. Since the RM treats all keys as equal we formally do not have a primary key, it is a choice we make in the selection, and as such we might treat this key slightly different from all other available keys in a relation. If we look at 5NF, a lot of relations should only have 1 key at most, while others might have several but none would be deemed primary. The discussion around primary keys stems more from SQL NULL problems, foreign key constraints and implementing surrogate keys.

If we treat surrogate keys as an arbitrary candidate key, we actually decrease the normalization of a data model from say 3NF to 2NF. This has led to the special and strict treatment we have for surrogate keys as to prevent this from leading to unwanted issues. Surrogate keys should be seen as not an extra key but as a surrogate for an existing key whose semantics and behavior is strictly regulated by the DBMS system.
With temporal data models surrogate keys are sometimes used as "versioning keys", but semantics and implementation of such keys is very tricky. Dimensional surrogate keys are even trickier since they do version tracking in reverse (related to a fact table).

Key behavior

Mostly kinds of keys signify certain behavior or certain properties of the key in question. A primary key is often designated as the key to implement foreign key constraints. A Composite key is a candidate key that consists of at least 2 attributes, while a dependent key is a candidate key where a subset is also a foreign key. Surrogate keys are keys that are system generated and (mostly) immutable and whose visibility and scope lies inside the DBMS system. Business Keys on the other hand are independent keys that are not designated as a surrogate key whose familiarity, uniqueness, visibility and scope span an entire enterprise. This way we try to find amongst others the actual identifiers that represent the reification of real world entities or abstract concepts. Human keys for example are actually not keys in the strict sense, but are non-enforceable candidate keys that heuristically identify tuples, especially those related to real world entities. They are related to the process of relating (UoD) entities to tuples. E.g. using full name and full address to identify and relate a real world person to a tuple representing that person in a database and uniquely identifying that person by a Person ID.

But what happens when a data model actually shows a system generated and (mostly) immutable key that is not a surrogate for another key? Is it still a surrogate key or not? The problem lies in the fact that we can apply any behavior and implement any property for a key. Keys can be implemented as hybrid between 'surrogate' and 'natural'. The question of good design is of course another matter, but for this post I focus on the current situation we see in our information systems today.

Key properties

Basically a keys behavior, expressed in the keys properties like context, visibility, familiarity, stability, immutability, uniqueness, compactness, auditability, dependency and its scope that are really defining it's kind. For some keys additional properties around the process of identification and reification from the  Universe of Discourse also play a role. These are things like type of control checks, related heuristic/human keys, Chance of duplication, chance of doubling, etc. IMO All these properties define a key instead of arbitrary "key classification". This is especially important when discussing pro and contra on certain keys, styles of modeling or key implementation considerations.

Key Ontology

So we can classify keys based on (standardized) properties. Based on certain properties we can state that a key is of a certain kind. This also means a key can be of several kinds at once (e.g. a composite business key is both a business key and a composite key). This way we can describe all kinds of keys, even those kinds we have no naming for. The matrix of kinds of keys and their properties describes a kind of 'key space'. This detailed description of kinds of keys allows us to manage, design and implement keys. It can be used in Data Quality Initiatives, Data Modeling, Data Integration and Master Data Management. Note that an adequate list of properties and hence a good ontology can be quite specific for a data model/system/organization (although a simple generic/standard one could probably be designed).

Key Taxonomy

Basically, most data (model) users (including data modelers, database developers etc.) like to have a small, simple, consistent  and non overlapping set of key types (or classes) instead of a detailed ontology where keys can be of several kinds. For this we can try to set up a simple classification scheme based on our ontology. Here we create a small set key types and assign each key we have to just 1 type. This classification facilitates easy defining, modeling and managing of key definitions. 

For example: within a Data Vault analysis, transformation and model generation you can classify the following types of keys: Business Key, Alternate Business Key, Dependent Key, Link Key, Temporal Key, Driving Key, Data Vault ID and Source Surrogate Key. For each key in any of the models I use I then try to classify each and every key to one of these types. For the source systems not under control of the Data Vault this is always a difficult matching process between the properties of the source key and the Data Vault classification scheme.

Such a scheme can NOT be used for analyzing the keys, since there is no guarantee a key will neatly fall in one of the existing categories of keys. For this we have the Key Ontology and the key matrix.

Conclusion

It is quite hard to discuss key classifications without good definitions and descriptions. While some kind of keys might be better understood than others, we have difficulty classifying all kinds of keys we find 'in the wild'. A property and behavior driven Key Ontology and Key matrix can be used to define, design and map out our 'key space'. From there we can try to create a consistent and accurate key classification scheme that will aid us in using, comparing and managing keys.