Sunday, November 25, 2018

Easy running totals with windowing functions in PostgreSQL




Back in the pre windowing function days, if you wanted to do a running count, you either had to run a subquery or you could use a variable. This was slow because for each row the query that did the sum would be executed. With windowing functions in PostgreSQL, this is now running much faster. 

Let's take a look, first create the following table


CREATE Temp TABLE test(Id int,SomeDate date, Charge decimal(20,10));


insert into test
values( 1,to_date('20120101','YYYYMMDD'),1000);
insert into test
values( 1,to_date('20120401','YYYYMMDD'),200);
insert into test
values( 1,to_date('20120501','YYYYMMDD'),300);
insert into test
values( 1,to_date('20120601','YYYYMMDD'),600);
insert into test
values( 2,to_date('20120101','YYYYMMDD'),100);
insert into test
values( 2,to_date('20120101','YYYYMMDD'),500);
insert into test
values( 2,to_date('20120101','YYYYMMDD'),-800);
insert into test
values( 3,to_date('20120101','YYYYMMDD'),100);


let's check that data we just inserted into the temporary table


SELECT * FROM test


The output looks like this

Id SomeDate Charge
1 2012-01-01 1000.0000000000
1 2012-04-01 200.0000000000
1 2012-05-01 300.0000000000
1 2012-06-01 600.0000000000
2 2012-01-01 100.0000000000
2 2013-01-01 500.0000000000
2 2014-01-01 -800.0000000000
3 2012-01-01 100.0000000000


What we want is the following

id StartDate Enddate         Charge         RunningTotal
1 2012-01-01 2012-03-31 1000.0000000000 1000.0000000000
1 2012-04-01 2012-04-30 200.0000000000 1200.0000000000
1 2012-05-01 2012-05-31 300.0000000000 1500.0000000000
1 2012-06-01 9999-12-31 600.0000000000 2100.0000000000
2 2012-01-01 2012-12-31 100.0000000000 100.0000000000
2 2013-01-01 2013-12-31 500.0000000000 600.0000000000
2 2014-01-01 9999-12-31 -800.0000000000 -200.0000000000
3 2012-01-01 9999-12-31 100.0000000000 100.0000000000

For each row, we want to have the date that the row starts on and also the date when it ends, we also want a running total as well. If there is no row after the current row for that id, we want the end date to be 9999-12-31.

So we will use a couple of functions. The first one is LEAD, LEAD accesses data from a subsequent row in the same result set without the use of a self-join. So the LEAD part looks like this

LEAD((SomeDate + -1 * INTERVAL '1 day' ),1,'99991231') OVER (PARTITION BY id ORDER BY SomeDate) as Enddate,

What we are doing is subtracting 1 from the date in the subsequent row (SomeDate + -1 * INTERVAL '1 day' )
We are using 1 as the offset since we want to apply this to the next row. Finally if there is no subsequent row, we want to use the date 9999-12-31 instead of NULL

To do the running count, we will do the following

SUM(Charge) OVER (PARTITION BY id ORDER BY SomeDate
     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
          AS RunningTotal

What this means in English is for each id ordered by date, sum up the charge values for the rows between the preceding rows and the current row. Here is what all that stuff means.

ROWS BETWEEN
Specifies the rows that make up the range to use as implied by

UNBOUNDED PRECEDING
Specifies that the window starts at the first row of the partition. UNBOUNDED PRECEDING can only be specified as window starting point.

CURRENT ROW
Specifies that the window starts or ends at the current row when used with ROWS or the current value when used with RANGE.
CURRENT ROW can be specified as both a starting and ending point.

And here is the query


SELECT id, someDate as StartDate,
LEAD((SomeDate + -1 * INTERVAL '1 day' ),1,'99991231')
 OVER (PARTITION BY id ORDER BY SomeDate) as Enddate,
  Charge,
  SUM(Charge) OVER (PARTITION BY id ORDER BY SomeDate 
     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 
          AS RunningTotal
  FROM test
  ORDER BY id, SomeDate


And running that query, gives us the running count as well as the end dates 

id StartDate Enddate         Charge         RunningTotal
1 2012-01-01 2012-03-31 1000.0000000000 1000.0000000000
1 2012-04-01 2012-04-30 200.0000000000 1200.0000000000
1 2012-05-01 2012-05-31 300.0000000000 1500.0000000000
1 2012-06-01 9999-12-31 600.0000000000 2100.0000000000
2 2012-01-01 2011-12-31 100.0000000000 100.0000000000
2 2012-01-01 2011-13-31 500.0000000000 600.0000000000
2 2012-01-01 9999-12-31 -800.0000000000 -200.0000000000
3 2012-01-01 9999-12-31 100.0000000000 100.0000000000


Here is what it looks like if you execute the query in PGAdmin



If you don't want the last row to have the end date filled in, just omit the default value in the LEAD function. Instead of

LEAD((SomeDate + -1 * INTERVAL '1 day' ),1,'99991231')

Make it

LEAD((SomeDate + -1 * INTERVAL '1 day' ),1)

Here is the whole query again


SELECT id, someDate as StartDate,
LEAD((SomeDate + -1 * INTERVAL '1 day' ),1)
 OVER (PARTITION BY id ORDER BY SomeDate) as Enddate,
  Charge,
  SUM(Charge) OVER (PARTITION BY id ORDER BY SomeDate 
     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 
          AS RunningTotal
  FROM test
  ORDER BY id, SomeDate
And here is what the output looks like after we made the change
As you can see the rows for an id that doesn't have a row with a date greater than the current date will have a null end date
That is all for this post

No comments: