BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Suzy_Cat
Pyrite | Level 9

Hi wonderful helpers,

 

Can anyone help me to get the equivalent Proc SQL code to achieve what  first.variable can get in data steps?

The idea is to only take the first record for each PID (no matter how many obs, and how they been sorted) using Proc SQL code.

 

have.PNG           Capture.PNG

 

Here is the sample code to generate the datasets to play:

data have;    
   infile datalines;  
   input PID $1-7 code $8 +1 date date9.; 
   datalines;          
AAAACd	1	13Aug2012
AAAACd	3	13Aug2012
AAAACd 5 13Aug2012 bbbbbd 1 14Dec2018 bbbbbd 2 14Dec2018 ggggec 0 23Apr2012 ggggec 4 23Apr2012 ; data have; set have; format date date9.; run; data want; set have; if first.PID; by pid; run;

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

As was shown, MONOTONIC() is unreliable when used in conjunction with a HAVING clause.

By splitting the SQL into two steps, it works, but just look at this:

data Test;
do I=1 to 1e7;
output;
output;
end;
run;

data Test_first;
set Test;
by I;
if first.I;
run;

proc sql;
create table Test_monotonic as 
  select I, monotonic() as M
  from test 
;
create table test_m2 as
  select i, m
  from test_monotonic
  group by i
  having m = min(m)
  order by i, m
;
quit;

Log:

 73         data Test;
 74         do I=1 to 1e7;
 75         output;
 76         output;
 77         end;
 78         run;
 
 NOTE: The data set WORK.TEST has 20000000 observations and 1 variables.
 NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
       real time           0.78 seconds
       cpu time            0.65 seconds
       
 
 79         
 80         data Test_first;
 81         set Test;
 82         by I;
 83         if first.I;
 84         run;
 
 NOTE: There were 20000000 observations read from the data set WORK.TEST.
 NOTE: The data set WORK.TEST_FIRST has 10000000 observations and 1 variables.
 NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
       real time           2.05 seconds
       cpu time            1.92 seconds
       
 
 85         
 86         proc sql;
 87         create table Test_monotonic as
 88           select I, monotonic() as M
 89           from test
 90         ;
 NOTE: Table WORK.TEST_MONOTONIC created, with 20000000 rows and 2 columns.
 
 91         create table test_m2 as
 92           select i, m
 93           from test_monotonic
 94           group by i
 95           having m = min(m)
 96           order by i, m
 97         ;
 NOTE: The query requires remerging summary statistics back with the original data.
 NOTE: Table WORK.TEST_M2 created, with 10000000 rows and 2 columns.
 
 98         quit;
 NOTE:  Verwendet wurde: PROZEDUR SQL - (Gesamtverarbeitungszeit):
       real time           27.45 seconds
       cpu time            22.22 seconds

As I added to my other post, SQL quickly deteriorates to performance worse by order of magnitudes.

Forcing SQL to do something that it is not designed to do is not a good idea.

 

Now, if you do something similar in explicit passthrough to a database, the results might be different, as dedicated DB systems have much more powerful optimization.

View solution in original post

25 REPLIES 25
andreas_lds
Jade | Level 19

Afaik there is no equivalent to first.variable in proc sql, due to the nature of the procedure. And this has been discussed already, please use the search-function.

ChrisNZ
Tourmaline | Level 20

SQL is not suited to handling data in order.

SQL actually does not guarantee that a query will read the data in any given order.

One way around this is to use (very costly in terms of performance) window functions. Sadly SAS still does not support this non-ANSI-compliant feature.

Suzy_Cat
Pyrite | Level 9

Thank you Chris, for explaining 

Kurt_Bremser
Super User

If you have a criterium identifying the "first" observation within the group, you can use it with GROUP BY in a HAVING clause::

proc sql;
create table cwant as
  select *
  from have
  group by pid
  having code = min(code)
;
quit;

but that won't work in your case, as code is of type character, so the MIN function can't be used.

So we come back to Maxim 14: Use the Right Tool, which means the data step for this task.

 

 

Edit: I correct myself. MAX() and MIN() do work on character variables in SAS SQL, so this code creates your wanted result:

data have;    
input PID $ code $ date :date9.;
format date yymmdd10.;
datalines;          
AAAACd  1 13Aug2012
AAAACd  3 13Aug2012
AAAACd  5 13Aug2012
bbbbbd  1 14Dec2018
bbbbbd  2 14Dec2018
ggggec  0 23Apr2012
ggggec  4 23Apr2012
;                          

proc sql;
create table want as
  select *
  from have
  group by pid
  having code = min(code)
;
quit;

Note that, with large datasets, you will be MUCH better off with the data step in terms of performance. Up to orders of magnitude.

So it still boils down to Maxim 14.

Suzy_Cat
Pyrite | Level 9
Thought of this solution too. It is the closest to get there.

If there is no certain criterium identifying the "first" observation within the group, then in order to keep whatever the order is, an extra step required to create a column to hold the order info for PID.
PhilC
Rhodochrosite | Level 12

I hate doing this in PROC SQL but here.  

proc sql;
  create table want(drop=n) as 
  select *, monotonic() as n
    from have
    group by PID
    having (calculated n) = min(calculated n)
;
quit;

Know that monotonic is considered experimental and undocumented and may not be supported in the future by SAS. Also know such SQL queries will give the following note in the SAS log.  If you are using a having clause with aggregation functions you will, in general, be doing two passes on your data, which can be slow if your data is big.

NOTE: The query requires remerging summary statistics back with the original data.
Suzy_Cat
Pyrite | Level 9
Hey Phil,

This is exactly what I wanted to achieve!!
Learnt how to use monotonic() !!
Amazing! Perfact!!

It does the magic!
Suzy_Cat
Pyrite | Level 9

Hi All,

 

Thanks for chipping in your knowledge and your wonderful thoughts!

 

Found this place really cool to learn and grow.

 

In case anyone is interested, here is some very useful info for monotonic:

https://communities.sas.com/t5/SAS-Communities-Library/MONOTONIC-function-in-PROC-SQL/ta-p/475752

 

ChrisNZ
Tourmaline | Level 20

Note the many limitations of your chosen solution:

 

1. It is very expensive as shown by the the note in the log. Sorting and remerging of the data may be needed.

 

2. Function monotonic(0 is not production is should not be used in production jobs, as mentioned.

 

3. The record selected by the HAVING clause can vary from one run to the next as SQL can process data in any order.

 

4. For all these reasons, and as @Kurt_Bremser mentioned, you should use the right tool, which here is the data step.

 

 

 

Suzy_Cat
Pyrite | Level 9

Thank you Chris for commenting -- they help me to understand the whole thing here. Really appreciated.

 

1. It is very expensive as shown by the the note in the log. Sorting and remerging of the data may be needed.

I don't see any big problems with remerging info in the log. Can you advise what do you mean by expensive ? (they sound like problematic )

 

2. Function monotonic(0 is not production is should not be used in production jobs, as mentioned.

I understand monotonic() is not documented and thank you for restate and suggest that it should not be used in production.

This is just my exploration in replacing data steps as we might need shift current work and using DBT in the future.

 

3. The record selected by the HAVING clause can vary from one run to the next as SQL can process data in any order.

I thought as long as the input data is properly sorted, we would expect to get the data as wanted.

I have tested different order combination of the source data, it looks like it produces what is targeted. 


4. For all these reasons, and as @Kurt_Bremser mentioned, you should use the right tool, which here is the data step.

No problems at all, I am using data step in most cases. As mentioned earlier, it is just an exploration in preparation moving to DBT. 

 

After all, would love to hear any suggestions, ideas, experience, that any one would love to share in migrating current SAS code to DBT ...

ChrisNZ
Tourmaline | Level 20

1. The log could show notes regarding remerging the data, and using threaded sorts.

   This is expensive in term of resources. Instead of reading and writing the data sequentially, the whole table is fully sorted and reprocessed.

  The real time doubles in my small example. The CPU time quadruples.

 

3. Because threads can be used, the order of the data will change depending on how threads synchronise. Using options NOTHREADS prevents that.

  Also when sorting occurs with no full information on the sort order (which data comes first in each BY group?), the order in each BY group can change.

  This is prevented by option SORTEQUALS for proc sort. I don't think this option controls proc sql.

 

See here:

30 data T; do I=1 to 1e7; output; output; end; run;

NOTE: Compression was disabled for data set WORK.T because compression overhead would increase the size of the data set.
NOTE: The data set WORK.T has 20000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.71 seconds
cpu time 0.71 seconds

 

31 data T1; set T; by I; if first.I; run;

NOTE: Compression was disabled for data set WORK.T1 because compression overhead would increase the size of the data set.
NOTE: There were 20000000 observations read from the data set WORK.T.
NOTE: The data set WORK.T1 has 10000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 2.27 seconds
cpu time 2.26 seconds

 

32 proc sql;
32 ! create table T2 as select monotonic() as M, I from T group by I having M=min(M);
NOTE: SAS threaded sort was used.
NOTE: Compressing data set WORK.T2 increased size by 58.33 percent.
Compressed is 114 pages; un-compressed would require 72 pages.
NOTE: Table WORK.T2 created, with 579164 rows and 2 columns.

32 ! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 5.32 seconds
cpu time 10.06 seconds

 

33 proc sql;
33 ! create table T3 as select monotonic() as M, I from T having M=min(M);
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: Compressing data set WORK.T3 increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: Table WORK.T3 created, with 1 rows and 2 columns.

33 ! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 5.37 seconds
cpu time 5.37 seconds

 

Also, i just noticed the volumes are not as expected. This SQL solution may not even do what you think it does.

 

 

The random-order issue can be seen here:

proc sql; create table T4 as select monotonic() as M, I from T order by I; quit;

which yields:

M I
1 1
2 1
4 2
3 2
6 3
5 3

 

Suzy_Cat
Pyrite | Level 9

Hi Chris,

 

Thank you for taking time helping me understand the underlying problem here.

 

I have tested each step as described, and found the result really scary as it totally not what expected. I can not figure out why there are some records not being selected 😞

 

data Test; do I=1 to 1e7; output; output; end; run;

data Test_first; set Test; by I; if first.I; run;

proc sql;
	create table Test_monotonic as 
	select  I, monotonic() as M
	from test 
	group by I 
	having M=min(M)
	order by i,M
;Quit;

Suzy_Cat_0-1613897107438.png

 

Suzy_Cat
Pyrite | Level 9

the log shows 

NOTE: Compression was disabled for data set WORK.TEST_MONOTONIC because compression overhead would increase the size of the data 
      set.
NOTE: Table WORK.TEST_MONOTONIC created, with 627507 rows and 2 columns.
ChrisNZ
Tourmaline | Level 20

Well it looks like monotonic() confuses the hell out of proc SQL.

The result is different depending on whether another variable is added or not.


data TEST; do I=1 to 1e7; N+1; output; N+1; output; end; run;

 

31 proc sql;
32 create table TEST_MONOTONIC2 as select *, min(N), min(M) as a from(
33 select I, N, monotonic() as M
34 from TEST )
35 group by I
36 /* having M=min(M)*/
37 order by I, M
38 ;
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: SAS threaded sort was used.

NOTE: Table WORK.TEST_MONOTONIC2 created, with 20000000 rows and 5 columns.


41 proc sql;
42 create table TEST_MONOTONIC3 as select *, min(M) as a from(
43 select I, monotonic() as M
44 from TEST )
45 group by I
46 /* having M=min(M)*/
47 order by I, M
48 ;
NOTE: SAS threaded sort was used.

NOTE: Table WORK.TEST_MONOTONIC3 created, with 10000000 rows and 3 columns.

 

We can't even call this a defect as the function is unsupported.

This is why staying away from non-production features is advised: They have not been fully tested.

Heck even some new production features escape into production while still hiding behaviours from the QA process. 

 

Plus really, one of the strengths of the SAS language is that it allows leveraging sorted tables for best performance. Make use of that.

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 25 replies
  • 8499 views
  • 17 likes
  • 5 in conversation