Showing posts with label PostgreSQL. Show all posts
Showing posts with label PostgreSQL. Show all posts

Tuesday, April 30, 2019

A quick and easy way to count the percentage of nulls without a where clause in PostgreSQL


This came up the other day, someone wanted to know the percentage of NULL values in a column

Then I said "I bet you I can run that query without using a NULL in the WHERE clause, as a matter of fact, I can run that query without a WHERE clause at all!!"

I created a SQL Server version of this post here: How to count NULLS without using IS NULL in a WHERE clause, this is the PostgreSQL version


To start, first create this table and verify you have 9 rows

CREATE TABLE foo(bar int);
INSERT INTO foo values(1),(null),(2),(3),(4),
 (null),(5),(6),(7);

SELECT * FROM foo;

Here is what the output should look like



To get the NULL values and NON NULL values, you can do something like this


SELECT COUNT(*) as CountAll FROM foo WHERE bar IS NOT NULL;
SELECT COUNT(*) as CountAll FROM foo WHERE bar IS  NULL;

However, there is another way

Did you know that COUNT behaves differently if you use a column name compared to when you use *

Take a look

SELECT COUNT(*) as CountAll, 
  COUNT(bar) as CountColumn
FROM foo;

If you ran that query, the result is the following

CountAll    CountColumn
----------- -----------
9           7

Here is what it looks like in pgAdmin3




Let's see what the PostgreSQL documentation has to say


COUNT(*) number of input rows

COUNT(expression) number of input rows for which the value of expression is not null

This is thus indeed documented behavior, count(*) counts all rows, count(column) skips the rows with NULLS

So now, lets change our query to return the percentage of non null values in the column

Here is what the query looks like

SELECT COUNT(*) as CountAll, 
  COUNT(bar) as CountColumn, 
  (COUNT(bar)/COUNT(*))*100 as PercentageOfNonNullValues 
FROM foo;

Running that gives us this output

CountAll    CountColumn percentageOfNonNullValues
----------- ----------- ---------------------------------------
9           7                  0

Here is what it looks like in pgAdmin3



As you can see, we have a 0 in the percentage column. The reason is that if you do math with integers, you will get an integer back


We can fix this easy by multiplying one of the integers by 1.0, this will convert the integer to a numeric data type . So instead of this line

 (COUNT(bar)/COUNT(*))*100 as PercentageOfNonNullValues 

We will change it to be like this

(COUNT(bar)*1.0/COUNT(*))*100 as PercentageOfNonNullValues


Here is the changed query

SELECT COUNT(*) as CountAll, 
  COUNT(bar) as CountColumn, 
  (COUNT(bar)*1.0/COUNT(*))*100 as PercentageOfNonNullValues 
FROM foo;

Here is the output

CountAll    CountColumn percentageOfNonNullValues
----------- ----------- ---------------------------------------
9           7           77.7777777777800


Here is what it looks like in pgAdmin3



Instead of doing * 1.0, you can also explicitly cast the integer like this

(CAST (COUNT(bar) as numeric(20,10))/COUNT(*))*100 as PercentageOfNonNullValues

The query now looks like this


SELECT COUNT(*) as CountAll, 
  COUNT(bar) as CountColumn, 
  (CAST (COUNT(bar) as numeric(20,10))/COUNT(*))*100 as PercentageOfNonNullValues
FROM foo;

Here is what it looks like in pgAdmin3






That's it for this short post.. hopefully you knew this, if not, then you know it now  :-)

Monday, February 18, 2019

Calculating Sexy Primes, Prime Triplets and Sexy Prime Triplets in PostgreSQL


The other day I was reading something on Hackernews and someone posted a link to a Sexy Primes wikipedia article.  I looked at that and then decided to do this in SQL Server because.. why not? Then I decided to see how different this would be to do in PostgreSQL.  For the first method to create the prime numbers it's different. For the method with the CTE it is very similar


From the Sexy Primes wikipedia link: https://en.wikipedia.org/wiki/Sexy_prime


In mathematics, sexy primes are prime numbers that differ from each other by six. For example, the numbers 5 and 11 are both sexy primes, because 11 minus 5 is 6.

The term "sexy prime" is a pun stemming from the Latin word for six: sex.

If p + 2 or p + 4 (where p is the lower prime) is also prime, then the sexy prime is part of a prime triplet.

Ok I did a couple of versions of this and over the weekend.. here is what I ended up with

So first we need a table that will just have the prime numbers

I decided to populate a table with numbers from 2 till 500 and then use the sieve of Eratosthenes method to delete the non primes

This will look like this

Create this table

CREATE  TABLE  PrimeNumbers  (N INT);


In one window run this to create the function/proc

CREATE OR REPLACE FUNCTION MakePrime() RETURNS void AS $$
DECLARE I integer := 2;
BEGIN
WHILE I <= SQRT(500) LOOP
    DELETE FROM PrimeNumbers WHERE N % I = 0 AND N > I;
    I := I + 1 ; 
END LOOP;

END;

$$ LANGUAGE plpgsql;


In a another window populate the table by making the call to the function
 

INSERT  INTO PrimeNumbers(n)
SELECT N
 FROM (SELECT generate_series(2,500) as n) x;


SELECT MakePrime() ; -- Yes that is a proc call

SELECT * FROM PrimeNumbers


Thinking about it a little more I decided to do it with a CTE instead of a loop with delete statements, if your tables will be big then the delete method is probably better... it's for you to test that out :-)

What we are doing is a NOT EXISTS query against the same cte and we are filtering out numbers that are greater than the number in the current row and are not divisible by the current number


CREATE TABLE IF NOT EXISTS  PrimeNumbers  (N INT);


;WITH cte AS (
  SELECT * FROM generate_series( 2, 500 )  n
  )

INSERT INTO PrimeNumbers
SELECT n
FROM cte
WHERE NOT EXISTS (
  SELECT n FROM  cte as cte2
WHERE cte.n > cte2.n AND cte.n % cte2.n = 0)
;

SELECT * FROM PrimeNumbers;

If we run that last select statement, we should have 95 rows

2
3
5
7
 .....
 .....
463
467
479
487
491
499

Now that we have our table filled with prime numbers till 500, it's time to run the queries

Sexy prime pairs
The sexy primes (sequences OEIS: A023201 and OEIS: A046117 in OEIS) below 500 are:

(5,11), (7,13), (11,17), (13,19), (17,23), (23,29), (31,37), (37,43), (41,47), (47,53), (53,59), (61,67), (67,73), (73,79), (83,89), (97,103), (101,107), (103,109), (107,113), (131,137), (151,157), (157,163), (167,173), (173,179), (191,197), (193,199), (223,229), (227,233), (233,239), (251,257), (257,263), (263,269), (271,277), (277,283), (307,313), (311,317), (331,337), (347,353), (353,359), (367,373), (373,379), (383,389), (433,439), (443,449), (457,463), (461,467).


Here is that query for the sexy prime pairs

-- 46 rows.. sexy primes
SELECT t1.N,t2.N 
 FROM PrimeNumbers t1
join PrimeNumbers t2 on t2.N - t1.N = 6 
order by 1

It's very simple.. a self join that returns rows where the number from one table alias and the number from the other table alias differ by 6




Prime triplets
The first prime triplets below 500 (sequence A098420 in the OEIS) are

(5, 7, 11), (7, 11, 13), (11, 13, 17), (13, 17, 19), (17, 19, 23), (37, 41, 43), (41, 43, 47), (67, 71, 73), (97, 101, 103), (101, 103, 107), (103, 107, 109), (107, 109, 113), (191, 193, 197), (193, 197, 199), (223, 227, 229), (227, 229, 233), (277, 281, 283), (307, 311, 313), (311, 313, 317), (347, 349, 353), (457, 461, 463), (461, 463, 467)

A prime triplet contains a pair of twin primes (p and p + 2, or p + 4 and p + 6), a pair of cousin primes (p and p + 4, or p + 2 and p + 6), and a pair of sexy primes (p and p + 6).

So we need to check that the 1st and 3rd number have a difference of 6, we also check that that difference between number 1 and 2 is 2 or 4.  That query looks like this


-- 22 rows.. Prime Triplets
SELECT t1.N AS N1,t2.N AS N2, t3.N AS N3
 FROM PrimeNumbers t1
join PrimeNumbers t2 on t2.N > t1.N 
join PrimeNumbers t3 on t3.N - t1.N = 6
and t3.N > t2.N
and t2.n - t1.n IN (2,4)
order by 1

Here is what it looks like from pgAdmin






Sexy prime triplets
Triplets of primes (p, p + 6, p + 12) such that p + 18 is composite are called sexy prime.  p p, p+6 and p+12 are all prime, but p+18 is not

Those below 500 (sequence OEIS: A046118) are:

(7,13,19), (17,23,29), (31,37,43), (47,53,59), (67,73,79), (97,103,109), (101,107,113), (151,157,163), (167,173,179), (227,233,239), (257,263,269), (271,277,283), (347,353,359), (367,373,379)


The query looks like this.. instead of a self join, we do a triple self join, we also check that p + 18 is not a prime number in the line before the order by

-- 14 rows.. Sexy prime triplets
SELECT t1.N AS N1,t2.N AS N2, t3.N AS N3
 FROM PrimeNumbers t1
join PrimeNumbers t2 on t2.n - t1.n = 6
join PrimeNumbers t3 on t3.N - t1.N = 12
and t3.N > t2.N
AND NOT EXISTS( SELECT null FROM PrimeNumbers p WHERE p.n = t1.n +18)
order by 1



And that's it for this post.  If you are interested in the SQl Server version, you can find it here: Calculating Sexy Primes, Prime Triplets and Sexy Prime Triplets in SQL Server


More PostgreSQL posts can be found here:  /label/PostgreSQL

Monday, December 3, 2018

Using PostgreSQL's Interval to mimic SQL Server's DATEADD function




I wanted to do some date calculations in PostgreSQL and was doing some research on if something like the DATEDIFF function that exists in SQL Server is available in PostgreSQL. These notes are mostly for me so I can refer back to them..but maybe they are useful for someone else as well

In SQL Server to add dates, minutes and other fractions of a date to a date, you can use the DATEADD function

Here are some quick examples, if you want to run this in SQL Server, create this table with one row first

CREATE  TABLE test(SomeDate date);

INSERT INTO test
VALUES('20120101');


And here is a simple DATEADD query, that adds 1 or 2 days to a date by using both the datepart and the abbreviated datepart

SELECT SomeDate,
  DATEADD(dd,1,SomeDate) as Interval1dd,
  DATEADD(dd,2,SomeDate) as Interval2dd,
  DATEADD(day,1,SomeDate) as Interval1Day,
  DATEADD(day,2,SomeDate) as Interval2Day
FROM test


That will give us the following output

SomeDate Interval1dd Interval2dd Interval1Day Interval2Day
2012-01-01 2012-01-02 2012-01-03 2012-01-02 2012-01-03


If you want to go negative, all you have to do is place a minus sign in front of the number

SELECT SomeDate,
  DATEADD(dd,-1,SomeDate) as Interval1dd,
  DATEADD(dd,-2,SomeDate) as Interval2dd,
  DATEADD(day,-1,SomeDate) as Interval1Day,
  DATEADD(day,-2,SomeDate) as Interval2Day
FROM test

Here is the output of that query

SomeDate Interval1dd Interval2dd Interval1Day Interval2Day
2012-01-01 2011-12-31 2011-12-30 2011-12-31 2011-12-30

Here are all the valid datepart arguments in SQL Server


datepart Abbreviations
year yy, yyyy
quarter qq, q
month mm, m
dayofyear dy, y
day dd, d
week wk, ww
weekday dw, w
hour hh
minute mi, n
second ss, s
millisecond ms
microsecond mcs
nanosecond ns


Now let's take a look at PostgreSQL

In PostgreSQL, there is no DATEPART function, but you can use interval literals to accomplish something that behaves the same

Let's do something similar like we did with the SQL Server queries, first create this temp table with one row


CREATE Temp TABLE test(SomeDate date);


INSERT INTO test
VALUES( to_date('20120101','YYYYMMDD'));



Now let's run this query

SELECT SomeDate,(SomeDate + 1 * INTERVAL '1 Day' ) as Interval1Day,
  (SomeDate + 1 * INTERVAL '1 D')  as IntervalD,
  (SomeDate + 2 * INTERVAL '1 Day' ) as Interval2Times1Day,
  (SomeDate + 1 * INTERVAL '2 Day' ) as Interval2Days,
  (SomeDate + 2 * INTERVAL '2 Days' ) as Interval2Times2Days,
  (SomeDate +  INTERVAL '1 Day' )    as IntervalD
FROM test



Here is the output

"2012-01-01";"2012-01-02 00:00:00";"2012-01-02 00:00:00";"2012-01-03 00:00:00";"2012-01-03 00:00:00";"2012-01-05 00:00:00";"2012-01-02 00:00:00"

When you copy from pgAdmin , you don't get the column aliases, so below is a screenshot of what it looks like(Click on the image for a bigger sized image)





As you can see there are two parts where you can supply a number

I think I prefer the top one from the query below, since it resembles the DATEPART function more

(SomeDate + 2 * INTERVAL '1 Day' ) as Interval2Times1Day,
(SomeDate + 1 * INTERVAL '2 Day' ) as Interval2Days,



But as you saw,it was possible to add 4 days by using a one in both places like shown below

(SomeDate + 2 * INTERVAL '2 Days' ) as Interval2Times2Days,

And of course if you want, you can just use the number inside the string like in this example below

(SomeDate +  INTERVAL '1 Day' )    as IntervalD


It's up to you, but I don't like changing numbers inside a string

To do negative numbers, you just change the positive number to a negative number, here is the same query from before but now with negative numbers

SELECT SomeDate,(SomeDate + -1 * INTERVAL '1 Day' ) as Interval1Day,
  (SomeDate + -1 * INTERVAL '1 D')    as IntervalD,
  (SomeDate + -2 * INTERVAL '1 Day' ) as Interval2Times1Day,
  (SomeDate + -1 * INTERVAL '2 Day' ) as Interval2Days,
  (SomeDate + -2 * INTERVAL '-2 Days')as Interval2Times2Days,
  (SomeDate +  INTERVAL '-1 Day' )    as IntervalD
FROM test

Here is the output
"2012-01-01";"2011-12-31 00:00:00";"2011-12-31 00:00:00";"2011-12-30 00:00:00";"2011-12-30 00:00:00";"2012-01-05 00:00:00";"2011-12-31 00:00:00"

As you can see that all works as expected, did you notice that the we get +4 when we do -2 * -2?

Here is the output also from pgAdmin so that you can see the column aliases


Besides using days, you can also use these parts of a date


Abbreviation Meaning
Y Years
M Months (in the date part)
W Weeks
D Days
H Hours
M Minutes (in the time part)
S Seconds


Let's take a look by using some of these in a query


SELECT SomeDate,(SomeDate + 1 * INTERVAL '1 Day' ) as Interval1Day,
  (SomeDate + 1 * INTERVAL '1 Week' ) as Interval1Week,
  (SomeDate + 1 * INTERVAL '1 Month' ) as Interval1Month,
  (SomeDate + 1 * INTERVAL '1 Year' ) as Interval1Year
  
FROM test
UNION ALL
SELECT SomeDate,(SomeDate + 1 * INTERVAL '1 D' ) as Interval1Day,
  (SomeDate + 1 * INTERVAL '1 W' ) as Interval1Week,
  (SomeDate + 1 * INTERVAL '1 M' ) as Interval1Month,
  (SomeDate + 1 * INTERVAL '1 Y' ) as Interval1Year
  
FROM test

Here is the output

"2012-01-01";"2012-01-02 00:00:00";"2012-01-08 00:00:00";"2012-02-01 00:00:00";"2013-01-01 00:00:00"
"2012-01-01";"2012-01-02 00:00:00";"2012-01-08 00:00:00";"2012-01-01 00:01:00";"2013-01-01 00:00:00"


Here is the output also from pgAdmin so that you can see the column aliases



As you can see when using M it used minute not month. I would recommend to always use the full name and not the abbreviated part so as not to create confusion

That's all for this post

Sunday, November 25, 2018

Easy running totals with windowing functions in PostgreSQL




Back in the pre windowing function days, if you wanted to do a running count, you either had to run a subquery or you could use a variable. This was slow because for each row the query that did the sum would be executed. With windowing functions in PostgreSQL, this is now running much faster. 

Let's take a look, first create the following table


CREATE Temp TABLE test(Id int,SomeDate date, Charge decimal(20,10));


insert into test
values( 1,to_date('20120101','YYYYMMDD'),1000);
insert into test
values( 1,to_date('20120401','YYYYMMDD'),200);
insert into test
values( 1,to_date('20120501','YYYYMMDD'),300);
insert into test
values( 1,to_date('20120601','YYYYMMDD'),600);
insert into test
values( 2,to_date('20120101','YYYYMMDD'),100);
insert into test
values( 2,to_date('20120101','YYYYMMDD'),500);
insert into test
values( 2,to_date('20120101','YYYYMMDD'),-800);
insert into test
values( 3,to_date('20120101','YYYYMMDD'),100);


let's check that data we just inserted into the temporary table


SELECT * FROM test


The output looks like this

Id SomeDate Charge
1 2012-01-01 1000.0000000000
1 2012-04-01 200.0000000000
1 2012-05-01 300.0000000000
1 2012-06-01 600.0000000000
2 2012-01-01 100.0000000000
2 2013-01-01 500.0000000000
2 2014-01-01 -800.0000000000
3 2012-01-01 100.0000000000


What we want is the following

id StartDate Enddate         Charge         RunningTotal
1 2012-01-01 2012-03-31 1000.0000000000 1000.0000000000
1 2012-04-01 2012-04-30 200.0000000000 1200.0000000000
1 2012-05-01 2012-05-31 300.0000000000 1500.0000000000
1 2012-06-01 9999-12-31 600.0000000000 2100.0000000000
2 2012-01-01 2012-12-31 100.0000000000 100.0000000000
2 2013-01-01 2013-12-31 500.0000000000 600.0000000000
2 2014-01-01 9999-12-31 -800.0000000000 -200.0000000000
3 2012-01-01 9999-12-31 100.0000000000 100.0000000000

For each row, we want to have the date that the row starts on and also the date when it ends, we also want a running total as well. If there is no row after the current row for that id, we want the end date to be 9999-12-31.

So we will use a couple of functions. The first one is LEAD, LEAD accesses data from a subsequent row in the same result set without the use of a self-join. So the LEAD part looks like this

LEAD((SomeDate + -1 * INTERVAL '1 day' ),1,'99991231') OVER (PARTITION BY id ORDER BY SomeDate) as Enddate,

What we are doing is subtracting 1 from the date in the subsequent row (SomeDate + -1 * INTERVAL '1 day' )
We are using 1 as the offset since we want to apply this to the next row. Finally if there is no subsequent row, we want to use the date 9999-12-31 instead of NULL

To do the running count, we will do the following

SUM(Charge) OVER (PARTITION BY id ORDER BY SomeDate
     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
          AS RunningTotal

What this means in English is for each id ordered by date, sum up the charge values for the rows between the preceding rows and the current row. Here is what all that stuff means.

ROWS BETWEEN
Specifies the rows that make up the range to use as implied by

UNBOUNDED PRECEDING
Specifies that the window starts at the first row of the partition. UNBOUNDED PRECEDING can only be specified as window starting point.

CURRENT ROW
Specifies that the window starts or ends at the current row when used with ROWS or the current value when used with RANGE.
CURRENT ROW can be specified as both a starting and ending point.

And here is the query


SELECT id, someDate as StartDate,
LEAD((SomeDate + -1 * INTERVAL '1 day' ),1,'99991231')
 OVER (PARTITION BY id ORDER BY SomeDate) as Enddate,
  Charge,
  SUM(Charge) OVER (PARTITION BY id ORDER BY SomeDate 
     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 
          AS RunningTotal
  FROM test
  ORDER BY id, SomeDate


And running that query, gives us the running count as well as the end dates 

id StartDate Enddate         Charge         RunningTotal
1 2012-01-01 2012-03-31 1000.0000000000 1000.0000000000
1 2012-04-01 2012-04-30 200.0000000000 1200.0000000000
1 2012-05-01 2012-05-31 300.0000000000 1500.0000000000
1 2012-06-01 9999-12-31 600.0000000000 2100.0000000000
2 2012-01-01 2011-12-31 100.0000000000 100.0000000000
2 2012-01-01 2011-13-31 500.0000000000 600.0000000000
2 2012-01-01 9999-12-31 -800.0000000000 -200.0000000000
3 2012-01-01 9999-12-31 100.0000000000 100.0000000000


Here is what it looks like if you execute the query in PGAdmin



If you don't want the last row to have the end date filled in, just omit the default value in the LEAD function. Instead of

LEAD((SomeDate + -1 * INTERVAL '1 day' ),1,'99991231')

Make it

LEAD((SomeDate + -1 * INTERVAL '1 day' ),1)

Here is the whole query again


SELECT id, someDate as StartDate,
LEAD((SomeDate + -1 * INTERVAL '1 day' ),1)
 OVER (PARTITION BY id ORDER BY SomeDate) as Enddate,
  Charge,
  SUM(Charge) OVER (PARTITION BY id ORDER BY SomeDate 
     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 
          AS RunningTotal
  FROM test
  ORDER BY id, SomeDate
And here is what the output looks like after we made the change
As you can see the rows for an id that doesn't have a row with a date greater than the current date will have a null end date
That is all for this post

Friday, November 23, 2018

Happy Fibonacci day, here is how to generate a Fibonacci sequence in PostgreSQL


Image by Jahobr - Own work, CC0, Link


Since today is Fibonacci day I decided to to a short post about how to do generate a Fibonacci sequence in PostgreSQL. But first let's take a look at what a Fibonacci sequence actually is.

In mathematics, the Fibonacci numbers are the numbers in the following integer sequence, called the Fibonacci sequence, and characterized by the fact that every number after the first two is the sum of the two preceding ones:

 1, 1, 2, 3, 5, 8, 13, 21, 34, ...

Often, especially in modern usage, the sequence is extended by one more initial term:

0, 1, 1, 2, 3, 5, 8, 13, 21, 34, ...

November 23 is celebrated as Fibonacci day because when the date is written in the mm/dd format (11/23), the digits in the date form a Fibonacci sequence: 1,1,2,3.

So here is how you can generate a Fibonacci sequence in PostgreSQL, you can do it by using s recursive table expression.  Here is what it looks like if you wanted to generate the Fibonacci sequence to up to a value of 1 million

;WITH RECURSIVE Fibonacci (Prev, Next) as
(
     SELECT 0, 1
     UNION ALL
     SELECT Next, Prev + Next
     FROM Fibonacci
     WHERE Next < 1000000
)
SELECT Prev as Fibonacci
     FROM Fibonacci
     WHERE Prev < 1000000




That will generate a Fibonacci sequence that starts with 0 if you need the Fibonacci sequence to start at 1, all you have to do is replace the 1 to 0 in the first select statement

;WITH RECURSIVE Fibonacci (Prev, Next) as
(
     SELECT 1, 1
     UNION ALL
     SELECT Next, Prev + Next
     FROM Fibonacci
     WHERE Next < 1000000
)
SELECT Prev as Fibonacci
     FROM Fibonacci
     WHERE Prev < 1000000


Here is what it looks like in PGAdmin when you run the query



Happy Fibonacci day!!


Here is pretty much the same post that I created for SQL Server: Happy Fibonacci day, here is how to generate a Fibonacci sequence in SQL