Friday, June 27, 2014

Short cutting Temporal Joins

Introduction

When you create temporal data models, either with Data Vault, Anchor modeling or any other way, you end up with several tables with timelines in them. These timelines often need to be combined for going to Data Marts or other processing. This combining is done with a temporal JOIN

Example

  • Table Product
  • Table Product Cost
  • Table Product Price
  • Calculating Product Revenue = Sales Price- Manufacturing Cost



Conceptually the query will look like this (using the Allen operator)

SELECT "product cost".cost                            Cost,

       "product price".price                          AS Price,

       "product price".price - "product cost".cost    AS Revenue,

       "product cost".period 
INTERSECTS "product price".period AS Period

FROM   "product cost"

       INNER JOIN "product price"

               ON "product price".id = "product cost".id

WHERE  "product cost".period 
OVERLAPS "product price".period


The issues here are twofold, how to implement this in plain SQL, and what to define for Revenue for the omissions. Here we could state that the revenue is undefined, so in this case the first period is not available. Alas, this is not always the correct result.


SQL Implementation

There are a lot of ways to construct an SQL query that implements an intersection of 2 timelines. The most simplest are checking overlap between each 2 periods. This will quickly create unwieldy queries when more timelines need to be intersected. A more sophisticated approach rewrite, given x,y,z are periods:
y OVERLAPS y OVERLAPS y to NOT EMPTY(x INTERSECTS y INTERSECTS z) and to
say a = (x INTERSECTS y INTERSECTS z) then a.start_data

max(x.start_date,y.start_date,z.start_date) to min(x.end_date,y.end_date,z.end_date) is a valid period, which means:
max(x.start_date,y.start_date,z.start_date) < min(x.end_date,y.end_date,z.end_date)

Basic Greatest/Least  SQL Query

Timeline Alignment

All these kinds of queries only work correctly when the different tables have aligned their timelines (have the same start and end dates for each key). For the end date this works ou as long am 'high end date' is used. For start dates additional records need to be created with empty/default/null values to align on a 'low start date' (say 1753-01-01).

DBMS implementation

These types of queries work in Both Oracle, Teradata and SQL Server. In Oracle and Teradata these queries can be used because the functions greatest and least are defined. In SQL Server however this is not the case.

SQL Server (T-SQL) implementation

To create this type of query in SQL Server we need both a Least and Greatest function. we can either code a .NET assembly for this, or an user defined function. Alas the udf won't perform at all. A trick here is to rewrite a scalar udf to an inline view udf. These will get optimized and peform OK. We only need to (CROSS) APPLY them to the query to get our wanted result.
One of the issues here is that you cannot use the helper functions in indexed views, you need to substitue the case statement, and duplicate it both in the select and where clause.

Helper functions

Here are the SQL Server helper functions greatest and least and their usage pattern.
  1. CREATE FUNCTION [dbo].[ivf_least] (@1 datetime=null,@2 datetime = null ,@3 datetime = null,@4 datetime=null,@5datetime = null)
  2. RETURNS TABLE WITH SCHEMABINDING
  3. RETURN
  4. (WITH coal(a1,a2,a3,a4,a5)
  5. AS (SELECT COALESCE(@1,@2,@3,@4,@5),COALESCE(@2,@3,@4,@5,@1),COALESCE(@3,@4,@5,@1,@2),COALESCE(@4,@5,@1,@2,@3),COALESCE(@5,@1,@2,@3,@4))
  6. SELECT case when a1<=a2 and a1<=a3 and a1<=a4 and a1<=a5 then a1  
  7. when a2<=a3 and a2<=a4 and a2<=a5 then a2
  8. when a3<=a4 and a3<=a5 then a3
  9. when a4<a5 then a4
  10. else a5 end as out from coal);
  11. GO
  12. CREATE FUNCTION [dbo].[ivf_greatest] (@1 datetime=null,@2 datetime = null ,@3 datetime = null,@4 datetime=null,@5datetime = null)
  13. RETURNS TABLE WITH SCHEMABINDING
  14. RETURN
  15. (WITH coal(a1,a2,a3,a4,a5)
  16. AS (SELECT COALESCE(@1,@2,@3,@4,@5),COALESCE(@2,@3,@4,@5,@1),COALESCE(@3,@4,@5,@1,@2),COALESCE(@4,@5,@1,@2,@3),COALESCE(@5,@1,@2,@3,@4))
  17. SELECT case when a1>=a2 and a1>=a3 and a1>=a4 and a1>=a5 then a1  
  18. when a2>=a3 and a2>=a4 and a2>=a5 then a2
  19. when a3>=a4 and a3>=a5 then a3
  20. when a4>a5 then a4
  21. else a5 end as out from coal);
  22. GO

Using this functions in the following way:


  1. CREATE TABLE SAT1 (DV_ID int,START_DT datetime,END_DT datetime) PRIMARY KEY (DV_ID,START_DT)
  2. GO
  3. CREATE TABLE SAT2 (DV_ID int,START_DT datetime,END_DT datetime) PRIMARY KEY (DV_ID,START_DT)
  4. GO
  5. select
  6.         SAT1.DV_ID,
  7.         GREATEST_DT.out as START_DT,
  8.         LEAST_DT.out as END_DT
  9. from SAT1 inner join SAT2 on SAT1.DV_ID=SAT2.DV_ID
  10. cross apply ivf_greatest(SAT1.START_DT,SAT2.START_DT,NULL,NULL,NULL) as GREATEST_DT
  11. cross apply ivf_least(SAT1.END_DT,SAT2.END_DT,NULL,NULL,NULL) as LEAST_DT
  12. where
  13. GREATEST_DT.out < LEAST_DT.out -- assuming an [Closed,Open) period
  14. GO