Hello all,
I have a question around using PROC SQL which is probably best explained using the following example.
I have two variables, ID and CODE.
ID CODE
1 3
1 3
1 3
2 76
2 89
2 13
3 4
For each unique ID, I want to select the second distinct CODEvalue.
So the answers should be;
ID1= 3 (we only have 1 distinct value)
ID2= 89 (this is the 2nd distinct value after 76)
ID3= 4 (we only have 1 value so we want to use this)
How do I code this using PROC SQL (if possible?)
Thanks
Thanks
Hi @zishaq Love these fun puzzles
data have;
input ID CODE ;
cards;
1 3
1 3
1 3
2 76
2 89
2 13
3 4
;
proc sql;
create table want(drop=_:) as
select *,count(distinct code) as _c
from (select *,monotonic() as _rn from have)
group by id
having _c>1 and _rn=min(_rn)+1 or _c=1 and min(_rn)=_rn
order by id, _rn;
quit;
Very easy to do using a DATA step; probably much harder to code in SQL.
By the way, are the data are used in the order shown? There is no need to sort the data?
data want;
set have;
by id;
if first.id then count=0;
count+1;
if count=2 or (count=1 and last.id) then output;
run;
Yes, data is already sorted (should have mentioned this before).
Thanks
@zishaq wrote:
Hello all,
I have a question around using PROC SQL which is probably best explained using the following example.
I have two variables, ID and CODE.
ID CODE
1 3
1 3
1 3
2 76
2 89
2 13
3 4
For each unique ID, I want to select the second distinct CODEvalue.
So the answers should be;
ID1= 3 (we only have 1 distinct value)
ID2= 89 (this is the 2nd distinct value after 76)
ID3= 4 (we only have 1 value so we want to use this)
How do I code this using PROC SQL (if possible?)
Thanks
Thanks
In general SQL is a poor choice for attempting specific order processing of a set that is already in order. The DATA step would be a better way to go.
data have; input ID CODE ; datalines; 1 3 1 3 1 3 2 76 2 89 2 13 3 4 ; data want; set have; by id; if first.id then count=1; else count+1; if count=2 or (count=1 and last.id) then output; drop count; run;
Hi @zishaq Love these fun puzzles
data have;
input ID CODE ;
cards;
1 3
1 3
1 3
2 76
2 89
2 13
3 4
;
proc sql;
create table want(drop=_:) as
select *,count(distinct code) as _c
from (select *,monotonic() as _rn from have)
group by id
having _c>1 and _rn=min(_rn)+1 or _c=1 and min(_rn)=_rn
order by id, _rn;
quit;
I am not sure if this is possible in Proc SQL as SQL does not process data one row at a time in the order it is found, rather it takes chuncks of data.
If SQL is not required the the solution is possible - first remove duplicate rows then use first. and last. to flag if it is the first row of the group. Skip that row and output the next row using the lag function.
Something like this should work:
data test;
input ID CODE;
datalines;
1 3
1 3
1 3
2 76
2 89
2 13
3 4
run;
/*noduprec will remove duplicates of id and code*/
proc sort data=test noduprec;
by id;
run;
data second;
set test;
by id;
if first.id and last.id then output; /*output row for single row groups*/
else if first.id then flag=1;
if lag(flag)=1 then output;/*output the row after the first if more than one*/
drop flag;
run;
I see a lot of people recommending the DATA Step. As a hardcore advocate and fan of PROC SQL, here are two links that provide alternatives (both of which I have used with a great deal of success in my own work):
Subqueries
https://stackoverflow.com/questions/43191746/sql-to-proc-sql-partition-by-alternative-min-case
or
Hash Tables
Both are relatively easy (subqueries are significantly easier than hash tables). Post back if you need more help using SQL!
Chris
With regards to solutions using MONOTONIC() in SQL, this is unsupported and could provide incorrect results in certain situations (as explained here: https://communities.sas.com/t5/SAS-Communities-Library/MONOTONIC-function-in-PROC-SQL/ta-p/475752) and may provide incorrect solutions in other situations not listed in the article.
I am always uncomfortable advising newcomers to use SQL in situations where a DATA step works just as well, or better (and is easier and more straightforward to program); see Maxim 10 and Maxim 14 of "Maxims of Maximally Efficient SAS Programmers". I am uncomfortable providing solutions which use undocumented features of SAS. I am uncomfortable with the idea of using SQL for "everything". I am uncomfortable with a user who states that the solution has to be in SQL, and I generally ignore that and give what I think is a better solution not using SQL (just in case someone else is reading who is not restricted to SQL).
@PaigeMiller As I mentioned, I'm a hardcore SQL fan (SQL DBA >20 years, SAS programmer~7). Having said that, I'm nodding furiously at your reply.
A) I have often used the DATA step, particularly in merging massive files together. Way more efficient than SQL.
B) A boss of mine used to say that a sign of a good programmer is appreciating the fact that there's a dozen different ways to handle the same problem, and the sign of a great programmer is to know when to use each one.
C) To second your comment re: MONOTONIC, I have used it. I have seen it fail me more often than not, and so avoid it now (I will occasionally use it for testing or dummy data).
This goes back to a point I make to the students I mentor - research, test, research some more. If it's a small dataset, you may be fine with your first solution, but you can't assume it'll work every time.
I hope you have a great day!
Chris
Wise words, and truly makes sense. I can't agree more. I must admit I do not advocate any however it's fun and entertainment for me. Nothing more/nothing less.
I once asked @mkeintz and @hashman(not on the same thread of course) , why do you offer such sophisticated solutions when chances are OP is not likely to at their level and could be confusing but at the same time those solutions inspires blokes like to me to experiment further and lift my spirits up. Oh well, a lot of times their solutions have compelled me to read 2-3 times or more to fully understand.
It's a buffet and it's OP's choice to pick and choose.
I too agree with these observations. And in particular, I would advise limiting use of MONOTONIC to development (or maybe instructional) work
I presume it's an officially "deprecated" usage. And I also presume that once a user has entered the world of parallel data access processing, MONOTONIC should be shunned even for development purposes. It's worth knowing the MONOTONIC exists but ...
data have;
input ID CODE ;
cards;
1 3
1 3
1 3
2 76
2 89
2 13
3 4
;
data temp;
set have ;
by id code notsorted;
if first.id then group=0;
group+first.code;
run;
data temp1;
set temp;
by id group;
if first.group and group in ( 1 2 );
run;
data want;
set temp1;
by id ;
if last.id;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.