Data Matters (Unseen): 2014

Friday, June 27, 2014

Short cutting Temporal Joins

Introduction

When you create temporal data models, either with Data Vault, Anchor modeling or any other way, you end up with several tables with timelines in them. These timelines often need to be combined for going to Data Marts or other processing. This combining is done with a temporal JOIN

Example

Table Product
Table Product Cost
Table Product Price
Calculating Product Revenue = Sales Price- Manufacturing Cost

Conceptually the query will look like this (using the Allen operator)

SELECT "product cost".cost                            Cost,

       "product price".price                          AS Price,

       "product price".price - "product cost".cost    AS Revenue,

       "product cost".period INTERSECTS "product price".period AS Period

FROM   "product cost"

       INNER JOIN "product price"

               ON "product price".id = "product cost".id

WHERE  "product cost".period OVERLAPS "product price".period

The issues here are twofold, how to implement this in plain SQL, and what to define for Revenue for the omissions. Here we could state that the revenue is undefined, so in this case the first period is not available. Alas, this is not always the correct result.

SQL Implementation

There are a lot of ways to construct an SQL query that implements an intersection of 2 timelines. The most simplest are checking overlap between each 2 periods. This will quickly create unwieldy queries when more timelines need to be intersected. A more sophisticated approach rewrite, given x,y,z are periods:

y OVERLAPS y OVERLAPS y to NOT EMPTY(x INTERSECTS y INTERSECTS z) and to

say a = (x INTERSECTS y INTERSECTS z) then a.start_data

max(x.start_date,y.start_date,z.start_date) to min(x.end_date,y.end_date,z.end_date) is a valid period, which means:
max(x.start_date,y.start_date,z.start_date) < min(x.end_date,y.end_date,z.end_date)

Basic Greatest/Least SQL Query

Timeline Alignment

All these kinds of queries only work correctly when the different tables have aligned their timelines (have the same start and end dates for each key). For the end date this works ou as long am 'high end date' is used. For start dates additional records need to be created with empty/default/null values to align on a 'low start date' (say 1753-01-01).

DBMS implementation

These types of queries work in Both Oracle, Teradata and SQL Server. In Oracle and Teradata these queries can be used because the functions greatest and least are defined. In SQL Server however this is not the case.

SQL Server (T-SQL) implementation

To create this type of query in SQL Server we need both a Least and Greatest function. we can either code a .NET assembly for this, or an user defined function. Alas the udf won't perform at all. A trick here is to rewrite a scalar udf to an inline view udf. These will get optimized and peform OK. We only need to (CROSS) APPLY them to the query to get our wanted result.
One of the issues here is that you cannot use the helper functions in indexed views, you need to substitue the case statement, and duplicate it both in the select and where clause.

Helper functions

Here are the SQL Server helper functions greatest and least and their usage pattern.

CREATE FUNCTION [dbo].[ivf_least] (@1 datetime=null,@2 datetime = null ,@3 datetime = null,@4 datetime=null,@5datetime = null)
RETURNS TABLE WITH SCHEMABINDING
RETURN
(WITH coal(a1,a2,a3,a4,a5)
AS (SELECT COALESCE(@1,@2,@3,@4,@5),COALESCE(@2,@3,@4,@5,@1),COALESCE(@3,@4,@5,@1,@2),COALESCE(@4,@5,@1,@2,@3),COALESCE(@5,@1,@2,@3,@4))
SELECT case when a1<=a2 and a1<=a3 and a1<=a4 and a1<=a5 then a1
when a2<=a3 and a2<=a4 and a2<=a5 then a2
when a3<=a4 and a3<=a5 then a3
when a4<a5 then a4
else a5 end as out from coal);
GO
CREATE FUNCTION [dbo].[ivf_greatest] (@1 datetime=null,@2 datetime = null ,@3 datetime = null,@4 datetime=null,@5datetime = null)
RETURNS TABLE WITH SCHEMABINDING
RETURN
(WITH coal(a1,a2,a3,a4,a5)
AS (SELECT COALESCE(@1,@2,@3,@4,@5),COALESCE(@2,@3,@4,@5,@1),COALESCE(@3,@4,@5,@1,@2),COALESCE(@4,@5,@1,@2,@3),COALESCE(@5,@1,@2,@3,@4))
SELECT case when a1>=a2 and a1>=a3 and a1>=a4 and a1>=a5 then a1
when a2>=a3 and a2>=a4 and a2>=a5 then a2
when a3>=a4 and a3>=a5 then a3
when a4>a5 then a4
else a5 end as out from coal);
GO

Using this functions in the following way:

CREATE TABLE SAT1 (DV_ID int,START_DT datetime,END_DT datetime) PRIMARY KEY (DV_ID,START_DT)
GO
CREATE TABLE SAT2 (DV_ID int,START_DT datetime,END_DT datetime) PRIMARY KEY (DV_ID,START_DT)
GO
select
SAT1.DV_ID,
GREATEST_DT.out as START_DT,
LEAST_DT.out as END_DT
from SAT1 inner join SAT2 on SAT1.DV_ID=SAT2.DV_ID
cross apply ivf_greatest(SAT1.START_DT,SAT2.START_DT,NULL,NULL,NULL) as GREATEST_DT
cross apply ivf_least(SAT1.END_DT,SAT2.END_DT,NULL,NULL,NULL) as LEAST_DT
where
GREATEST_DT.out < LEAST_DT.out -- assuming an [Closed,Open) period
GO