Solved: Re: Data step first.variable equivalent in Proc SQL

Suzy_Cat · Posted 02-18-2021 12:31 AM

Hi wonderful helpers,

Can anyone help me to get the equivalent Proc SQL code to achieve what first.variable can get in data steps?

The idea is to only take the first record for each PID (no matter how many obs, and how they been sorted) using Proc SQL code.

Here is the sample code to generate the datasets to play:

data have;    
   infile datalines;  
   input PID $1-7 code $8 +1 date date9.; 
   datalines;          
AAAACd	1	13Aug2012
AAAACd	3	13Aug2012
AAAACd	5	13Aug2012
bbbbbd	1	14Dec2018
bbbbbd	2	14Dec2018
ggggec	0	23Apr2012
ggggec	4	23Apr2012
;                          

data have;
set have;
format date date9.;
run;

data want;
set have;
if first.PID;
by pid;
run;

Kurt_Bremser · Posted 02-22-2021 07:00 AM

As was shown, MONOTONIC() is unreliable when used in conjunction with a HAVING clause.

By splitting the SQL into two steps, it works, but just look at this:

data Test;
do I=1 to 1e7;
output;
output;
end;
run;

data Test_first;
set Test;
by I;
if first.I;
run;

proc sql;
create table Test_monotonic as 
  select I, monotonic() as M
  from test 
;
create table test_m2 as
  select i, m
  from test_monotonic
  group by i
  having m = min(m)
  order by i, m
;
quit;

Log:

 73         data Test;
 74         do I=1 to 1e7;
 75         output;
 76         output;
 77         end;
 78         run;
 
 NOTE: The data set WORK.TEST has 20000000 observations and 1 variables.
 NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
       real time           0.78 seconds
       cpu time            0.65 seconds
       
 
 79         
 80         data Test_first;
 81         set Test;
 82         by I;
 83         if first.I;
 84         run;
 
 NOTE: There were 20000000 observations read from the data set WORK.TEST.
 NOTE: The data set WORK.TEST_FIRST has 10000000 observations and 1 variables.
 NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
       real time           2.05 seconds
       cpu time            1.92 seconds
       
 
 85         
 86         proc sql;
 87         create table Test_monotonic as
 88           select I, monotonic() as M
 89           from test
 90         ;
 NOTE: Table WORK.TEST_MONOTONIC created, with 20000000 rows and 2 columns.
 
 91         create table test_m2 as
 92           select i, m
 93           from test_monotonic
 94           group by i
 95           having m = min(m)
 96           order by i, m
 97         ;
 NOTE: The query requires remerging summary statistics back with the original data.
 NOTE: Table WORK.TEST_M2 created, with 10000000 rows and 2 columns.
 
 98         quit;
 NOTE:  Verwendet wurde: PROZEDUR SQL - (Gesamtverarbeitungszeit):
       real time           27.45 seconds
       cpu time            22.22 seconds

As I added to my other post, SQL quickly deteriorates to performance worse by order of magnitudes.

Forcing SQL to do something that it is not designed to do is not a good idea.

Now, if you do something similar in explicit passthrough to a database, the results might be different, as dedicated DB systems have much more powerful optimization.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

andreas_lds · Posted 02-18-2021 12:34 AM

Afaik there is no equivalent to first.variable in proc sql, due to the nature of the procedure. And this has been discussed already, please use the search-function.

ChrisNZ · Posted 02-18-2021 02:16 AM

SQL is not suited to handling data in order.

SQL actually does not guarantee that a query will read the data in any given order.

One way around this is to use (very costly in terms of performance) window functions. Sadly SAS still does not support this non-ANSI-compliant feature.

High-Performance SAS Coding - Third Edition

Suzy_Cat · Posted 02-18-2021 03:20 PM

Thank you Chris, for explaining

Kurt_Bremser · Posted 02-18-2021 02:32 AM

If you have a criterium identifying the "first" observation within the group, you can use it with GROUP BY in a HAVING clause::

proc sql;
create table cwant as
  select *
  from have
  group by pid
  having code = min(code)
;
quit;

but that won't work in your case, as code is of type character, so the MIN function can't be used.

So we come back to Maxim 14: Use the Right Tool, which means the data step for this task.

Edit: I correct myself. MAX() and MIN() do work on character variables in SAS SQL, so this code creates your wanted result:

data have;    
input PID $ code $ date :date9.;
format date yymmdd10.;
datalines;          
AAAACd  1 13Aug2012
AAAACd  3 13Aug2012
AAAACd  5 13Aug2012
bbbbbd  1 14Dec2018
bbbbbd  2 14Dec2018
ggggec  0 23Apr2012
ggggec  4 23Apr2012
;                          

proc sql;
create table want as
  select *
  from have
  group by pid
  having code = min(code)
;
quit;

Note that, with large datasets, you will be MUCH better off with the data step in terms of performance. Up to orders of magnitude.

So it still boils down to Maxim 14.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Suzy_Cat · Posted 02-18-2021 03:25 PM

Thought of this solution too. It is the closest to get there.

If there is no certain criterium identifying the "first" observation within the group, then in order to keep whatever the order is, an extra step required to create a column to hold the order info for PID.

PhilC · Posted 02-18-2021 01:17 PM

I hate doing this in PROC SQL but here.

proc sql;
  create table want(drop=n) as 
  select *, monotonic() as n
    from have
    group by PID
    having (calculated n) = min(calculated n)
;
quit;

Know that monotonic is considered experimental and undocumented and may not be supported in the future by SAS. Also know such SQL queries will give the following note in the SAS log. If you are using a having clause with aggregation functions you will, in general, be doing two passes on your data, which can be slow if your data is big.

NOTE: The query requires remerging summary statistics back with the original data.

Suzy_Cat · Posted 02-18-2021 03:31 PM

Hey Phil,

This is exactly what I wanted to achieve!!
Learnt how to use monotonic() !!
Amazing! Perfact!!

It does the magic!

Suzy_Cat · Posted 02-18-2021 03:44 PM

Hi All,

Thanks for chipping in your knowledge and your wonderful thoughts!

Found this place really cool to learn and grow.

In case anyone is interested, here is some very useful info for monotonic:

https://communities.sas.com/t5/SAS-Communities-Library/MONOTONIC-function-in-PROC-SQL/ta-p/475752

ChrisNZ · Posted 02-18-2021 08:15 PM

Note the many limitations of your chosen solution:

1. It is very expensive as shown by the the note in the log. Sorting and remerging of the data may be needed.

2. Function monotonic(0 is not production is should not be used in production jobs, as mentioned.

3. The record selected by the HAVING clause can vary from one run to the next as SQL can process data in any order.

4. For all these reasons, and as @Kurt_Bremser mentioned, you should use the right tool, which here is the data step.

High-Performance SAS Coding - Third Edition

Suzy_Cat · Posted 02-18-2021 09:16 PM

Thank you Chris for commenting -- they help me to understand the whole thing here. Really appreciated.

1. It is very expensive as shown by the the note in the log. Sorting and remerging of the data may be needed.

I don't see any big problems with remerging info in the log. Can you advise what do you mean by expensive ? (they sound like problematic )

2. Function monotonic(0 is not production is should not be used in production jobs, as mentioned.

I understand monotonic() is not documented and thank you for restate and suggest that it should not be used in production.

This is just my exploration in replacing data steps as we might need shift current work and using DBT in the future.

3. The record selected by the HAVING clause can vary from one run to the next as SQL can process data in any order.

I thought as long as the input data is properly sorted, we would expect to get the data as wanted.

I have tested different order combination of the source data, it looks like it produces what is targeted.

4. For all these reasons, and as @Kurt_Bremser mentioned, you should use the right tool, which here is the data step.

No problems at all, I am using data step in most cases. As mentioned earlier, it is just an exploration in preparation moving to DBT.

After all, would love to hear any suggestions, ideas, experience, that any one would love to share in migrating current SAS code to DBT ...

ChrisNZ · Posted 02-18-2021 10:09 PM

1. The log could show notes regarding remerging the data, and using threaded sorts.

This is expensive in term of resources. Instead of reading and writing the data sequentially, the whole table is fully sorted and reprocessed.

The real time doubles in my small example. The CPU time quadruples.

3. Because threads can be used, the order of the data will change depending on how threads synchronise. Using options NOTHREADS prevents that.

Also when sorting occurs with no full information on the sort order (which data comes first in each BY group?), the order in each BY group can change.

This is prevented by option SORTEQUALS for proc sort. I don't think this option controls proc sql.

See here:

30 data T; do I=1 to 1e7; output; output; end; run;

NOTE: Compression was disabled for data set WORK.T because compression overhead would increase the size of the data set.
NOTE: The data set WORK.T has 20000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.71 seconds
cpu time 0.71 seconds

31 data T1; set T; by I; if first.I; run;

NOTE: Compression was disabled for data set WORK.T1 because compression overhead would increase the size of the data set.
NOTE: There were 20000000 observations read from the data set WORK.T.
NOTE: The data set WORK.T1 has 10000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 2.27 seconds
cpu time 2.26 seconds

32 proc sql;
32 ! create table T2 as select monotonic() as M, I from T group by I having M=min(M);
NOTE: SAS threaded sort was used.
NOTE: Compressing data set WORK.T2 increased size by 58.33 percent.
Compressed is 114 pages; un-compressed would require 72 pages.
NOTE: Table WORK.T2 created, with 579164 rows and 2 columns.

32 ! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 5.32 seconds
cpu time 10.06 seconds

33 proc sql;
33 ! create table T3 as select monotonic() as M, I from T having M=min(M);
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: Compressing data set WORK.T3 increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: Table WORK.T3 created, with 1 rows and 2 columns.

33 ! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 5.37 seconds
cpu time 5.37 seconds

Also, i just noticed the volumes are not as expected. This SQL solution may not even do what you think it does.

The random-order issue can be seen here:

proc sql; create table T4 as select monotonic() as M, I from T order by I; quit;

which yields:

M	I
1	1
2	1
4	2
3	2
6	3
5	3

High-Performance SAS Coding - Third Edition

Suzy_Cat · Posted 02-21-2021 03:45 AM

Hi Chris,

Thank you for taking time helping me understand the underlying problem here.

I have tested each step as described, and found the result really scary as it totally not what expected. I can not figure out why there are some records not being selected 😞

data Test; do I=1 to 1e7; output; output; end; run;

data Test_first; set Test; by I; if first.I; run;

proc sql;
	create table Test_monotonic as 
	select  I, monotonic() as M
	from test 
	group by I 
	having M=min(M)
	order by i,M
;Quit;

Suzy_Cat · Posted 02-21-2021 03:46 AM

the log shows

NOTE: Compression was disabled for data set WORK.TEST_MONOTONIC because compression overhead would increase the size of the data 
      set.
NOTE: Table WORK.TEST_MONOTONIC created, with 627507 rows and 2 columns.

ChrisNZ · Posted 02-21-2021 04:39 PM

Well it looks like monotonic() confuses the hell out of proc SQL.

The result is different depending on whether another variable is added or not.

data TEST; do I=1 to 1e7; N+1; output; N+1; output; end; run;

31 proc sql;
32 create table TEST_MONOTONIC2 as select *, min(N), min(M) as a from(
33 select I, N, monotonic() as M
34 from TEST )
35 group by I
36 /* having M=min(M)*/
37 order by I, M
38 ;
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: SAS threaded sort was used.

NOTE: Table WORK.TEST_MONOTONIC2 created, with 20000000 rows and 5 columns.

41 proc sql;
42 create table TEST_MONOTONIC3 as select *, min(M) as a from(
43 select I, monotonic() as M
44 from TEST )
45 group by I
46 /* having M=min(M)*/
47 order by I, M
48 ;
NOTE: SAS threaded sort was used.

NOTE: Table WORK.TEST_MONOTONIC3 created, with 10000000 rows and 3 columns.

We can't even call this a defect as the function is unsupported.

This is why staying away from non-production features is advised: They have not been fully tested.

Heck even some new production features escape into production while still hiding behaviours from the QA process.

Plus really, one of the strengths of the SAS language is that it allows leveraging sorted tables for best performance. Make use of that.

High-Performance SAS Coding - Third Edition

Registration is open

SAS Training: Just a Click Away