Hi wonderful helpers,
Can anyone help me to get the equivalent Proc SQL code to achieve what first.variable can get in data steps?
The idea is to only take the first record for each PID (no matter how many obs, and how they been sorted) using Proc SQL code.
Here is the sample code to generate the datasets to play:
data have;
infile datalines;
input PID $1-7 code $8 +1 date date9.;
datalines;
AAAACd 1 13Aug2012
AAAACd 3 13Aug2012
AAAACd 5 13Aug2012
bbbbbd 1 14Dec2018
bbbbbd 2 14Dec2018
ggggec 0 23Apr2012
ggggec 4 23Apr2012
;
data have;
set have;
format date date9.;
run;
data want;
set have;
if first.PID;
by pid;
run;
As was shown, MONOTONIC() is unreliable when used in conjunction with a HAVING clause.
By splitting the SQL into two steps, it works, but just look at this:
data Test;
do I=1 to 1e7;
output;
output;
end;
run;
data Test_first;
set Test;
by I;
if first.I;
run;
proc sql;
create table Test_monotonic as
select I, monotonic() as M
from test
;
create table test_m2 as
select i, m
from test_monotonic
group by i
having m = min(m)
order by i, m
;
quit;
Log:
73 data Test; 74 do I=1 to 1e7; 75 output; 76 output; 77 end; 78 run; NOTE: The data set WORK.TEST has 20000000 observations and 1 variables. NOTE: Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit): real time 0.78 seconds cpu time 0.65 seconds 79 80 data Test_first; 81 set Test; 82 by I; 83 if first.I; 84 run; NOTE: There were 20000000 observations read from the data set WORK.TEST. NOTE: The data set WORK.TEST_FIRST has 10000000 observations and 1 variables. NOTE: Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit): real time 2.05 seconds cpu time 1.92 seconds 85 86 proc sql; 87 create table Test_monotonic as 88 select I, monotonic() as M 89 from test 90 ; NOTE: Table WORK.TEST_MONOTONIC created, with 20000000 rows and 2 columns. 91 create table test_m2 as 92 select i, m 93 from test_monotonic 94 group by i 95 having m = min(m) 96 order by i, m 97 ; NOTE: The query requires remerging summary statistics back with the original data. NOTE: Table WORK.TEST_M2 created, with 10000000 rows and 2 columns. 98 quit; NOTE: Verwendet wurde: PROZEDUR SQL - (Gesamtverarbeitungszeit): real time 27.45 seconds cpu time 22.22 seconds
As I added to my other post, SQL quickly deteriorates to performance worse by order of magnitudes.
Forcing SQL to do something that it is not designed to do is not a good idea.
Now, if you do something similar in explicit passthrough to a database, the results might be different, as dedicated DB systems have much more powerful optimization.
Afaik there is no equivalent to first.variable in proc sql, due to the nature of the procedure. And this has been discussed already, please use the search-function.
SQL is not suited to handling data in order.
SQL actually does not guarantee that a query will read the data in any given order.
One way around this is to use (very costly in terms of performance) window functions. Sadly SAS still does not support this non-ANSI-compliant feature.
Thank you Chris, for explaining
If you have a criterium identifying the "first" observation within the group, you can use it with GROUP BY in a HAVING clause::
proc sql;
create table cwant as
select *
from have
group by pid
having code = min(code)
;
quit;
but that won't work in your case, as code is of type character, so the MIN function can't be used.
So we come back to Maxim 14: Use the Right Tool, which means the data step for this task.
Edit: I correct myself. MAX() and MIN() do work on character variables in SAS SQL, so this code creates your wanted result:
data have;
input PID $ code $ date :date9.;
format date yymmdd10.;
datalines;
AAAACd 1 13Aug2012
AAAACd 3 13Aug2012
AAAACd 5 13Aug2012
bbbbbd 1 14Dec2018
bbbbbd 2 14Dec2018
ggggec 0 23Apr2012
ggggec 4 23Apr2012
;
proc sql;
create table want as
select *
from have
group by pid
having code = min(code)
;
quit;
Note that, with large datasets, you will be MUCH better off with the data step in terms of performance. Up to orders of magnitude.
So it still boils down to Maxim 14.
I hate doing this in PROC SQL but here.
proc sql;
create table want(drop=n) as
select *, monotonic() as n
from have
group by PID
having (calculated n) = min(calculated n)
;
quit;
Know that monotonic
is considered experimental and undocumented and may not be supported in the future by SAS. Also know such SQL queries will give the following note in the SAS log. If you are using a having clause with aggregation functions you will, in general, be doing two passes on your data, which can be slow if your data is big.
NOTE: The query requires remerging summary statistics back with the original data.
Hi All,
Thanks for chipping in your knowledge and your wonderful thoughts!
Found this place really cool to learn and grow.
In case anyone is interested, here is some very useful info for monotonic:
https://communities.sas.com/t5/SAS-Communities-Library/MONOTONIC-function-in-PROC-SQL/ta-p/475752
Note the many limitations of your chosen solution:
1. It is very expensive as shown by the the note in the log. Sorting and remerging of the data may be needed.
2. Function monotonic(0 is not production is should not be used in production jobs, as mentioned.
3. The record selected by the HAVING clause can vary from one run to the next as SQL can process data in any order.
4. For all these reasons, and as @Kurt_Bremser mentioned, you should use the right tool, which here is the data step.
Thank you Chris for commenting -- they help me to understand the whole thing here. Really appreciated.
1. It is very expensive as shown by the the note in the log. Sorting and remerging of the data may be needed.
I don't see any big problems with remerging info in the log. Can you advise what do you mean by expensive ? (they sound like problematic )
2. Function monotonic(0 is not production is should not be used in production jobs, as mentioned.
I understand monotonic() is not documented and thank you for restate and suggest that it should not be used in production.
This is just my exploration in replacing data steps as we might need shift current work and using DBT in the future.
3. The record selected by the HAVING clause can vary from one run to the next as SQL can process data in any order.
I thought as long as the input data is properly sorted, we would expect to get the data as wanted.
I have tested different order combination of the source data, it looks like it produces what is targeted.
4. For all these reasons, and as @Kurt_Bremser mentioned, you should use the right tool, which here is the data step.
No problems at all, I am using data step in most cases. As mentioned earlier, it is just an exploration in preparation moving to DBT.
After all, would love to hear any suggestions, ideas, experience, that any one would love to share in migrating current SAS code to DBT ...
1. The log could show notes regarding remerging the data, and using threaded sorts.
This is expensive in term of resources. Instead of reading and writing the data sequentially, the whole table is fully sorted and reprocessed.
The real time doubles in my small example. The CPU time quadruples.
3. Because threads can be used, the order of the data will change depending on how threads synchronise. Using options NOTHREADS prevents that.
Also when sorting occurs with no full information on the sort order (which data comes first in each BY group?), the order in each BY group can change.
This is prevented by option SORTEQUALS for proc sort. I don't think this option controls proc sql.
See here:
30 data T; do I=1 to 1e7; output; output; end; run;
NOTE: Compression was disabled for data set WORK.T because compression overhead would increase the size of the data set.
NOTE: The data set WORK.T has 20000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.71 seconds
cpu time 0.71 seconds
31 data T1; set T; by I; if first.I; run;
NOTE: Compression was disabled for data set WORK.T1 because compression overhead would increase the size of the data set.
NOTE: There were 20000000 observations read from the data set WORK.T.
NOTE: The data set WORK.T1 has 10000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 2.27 seconds
cpu time 2.26 seconds
32 proc sql;
32 ! create table T2 as select monotonic() as M, I from T group by I having M=min(M);
NOTE: SAS threaded sort was used.
NOTE: Compressing data set WORK.T2 increased size by 58.33 percent.
Compressed is 114 pages; un-compressed would require 72 pages.
NOTE: Table WORK.T2 created, with 579164 rows and 2 columns.
32 ! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 5.32 seconds
cpu time 10.06 seconds
33 proc sql;
33 ! create table T3 as select monotonic() as M, I from T having M=min(M);
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: Compressing data set WORK.T3 increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: Table WORK.T3 created, with 1 rows and 2 columns.
33 ! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 5.37 seconds
cpu time 5.37 seconds
Also, i just noticed the volumes are not as expected. This SQL solution may not even do what you think it does.
The random-order issue can be seen here:
proc sql; create table T4 as select monotonic() as M, I from T order by I; quit;
which yields:
M | I |
---|---|
1 | 1 |
2 | 1 |
4 | 2 |
3 | 2 |
6 | 3 |
5 | 3 |
Hi Chris,
Thank you for taking time helping me understand the underlying problem here.
I have tested each step as described, and found the result really scary as it totally not what expected. I can not figure out why there are some records not being selected 😞
data Test; do I=1 to 1e7; output; output; end; run;
data Test_first; set Test; by I; if first.I; run;
proc sql;
create table Test_monotonic as
select I, monotonic() as M
from test
group by I
having M=min(M)
order by i,M
;Quit;
the log shows
NOTE: Compression was disabled for data set WORK.TEST_MONOTONIC because compression overhead would increase the size of the data
set.
NOTE: Table WORK.TEST_MONOTONIC created, with 627507 rows and 2 columns.
Well it looks like monotonic() confuses the hell out of proc SQL.
The result is different depending on whether another variable is added or not.
data TEST; do I=1 to 1e7; N+1; output; N+1; output; end; run;
31 proc sql;
32 create table TEST_MONOTONIC2 as select *, min(N), min(M) as a from(
33 select I, N, monotonic() as M
34 from TEST )
35 group by I
36 /* having M=min(M)*/
37 order by I, M
38 ;
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: SAS threaded sort was used.
NOTE: Table WORK.TEST_MONOTONIC2 created, with 20000000 rows and 5 columns.
41 proc sql;
42 create table TEST_MONOTONIC3 as select *, min(M) as a from(
43 select I, monotonic() as M
44 from TEST )
45 group by I
46 /* having M=min(M)*/
47 order by I, M
48 ;
NOTE: SAS threaded sort was used.
NOTE: Table WORK.TEST_MONOTONIC3 created, with 10000000 rows and 3 columns.
We can't even call this a defect as the function is unsupported.
This is why staying away from non-production features is advised: They have not been fully tested.
Heck even some new production features escape into production while still hiding behaviours from the QA process.
Plus really, one of the strengths of the SAS language is that it allows leveraging sorted tables for best performance. Make use of that.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.