BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
zishaq
Obsidian | Level 7

Hello all,

 

I have a question around using PROC SQL which is probably best explained using the following example.

 

I have two variables, ID and CODE.

 

ID       CODE 

1            3

1            3

1            3

2           76

2           89

2           13

3            4

 

For each unique ID, I want to select the second distinct CODEvalue.

So the answers should be;

ID1= 3 (we only have 1 distinct value)

ID2= 89 (this is the 2nd distinct value after 76)

ID3= 4 (we only have 1 value so we want to use this)

 

How do I code this using PROC SQL (if possible?)

 

Thanks

 

 

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
novinosrin
Tourmaline | Level 20

Hi @zishaq   Love these fun puzzles

 



data have;
input ID       CODE ;
cards;
1            3
1            3
1            3
2           76
2           89
2           13
3            4
;

proc sql;
create table want(drop=_:) as
select *,count(distinct code) as _c
from (select *,monotonic() as _rn from have)
group by id
having _c>1 and _rn=min(_rn)+1 or _c=1 and min(_rn)=_rn
order by id, _rn;
quit;

View solution in original post

13 REPLIES 13
PaigeMiller
Diamond | Level 26

Very easy to do using a DATA step; probably much harder to code in SQL.

 

By the way, are the data are used in the order shown? There is no need to sort the data?

 

data want;
    set have;
    by id;
    if first.id then count=0;
    count+1;
    if count=2 or (count=1 and last.id) then output;
run;
--
Paige Miller
zishaq
Obsidian | Level 7

Yes, data is already sorted (should have mentioned this before).

 

Thanks

ballardw
Super User

@zishaq wrote:

Hello all,

 

I have a question around using PROC SQL which is probably best explained using the following example.

 

I have two variables, ID and CODE.

 

ID       CODE 

1            3

1            3

1            3

2           76

2           89

2           13

3            4

 

For each unique ID, I want to select the second distinct CODEvalue.

So the answers should be;

ID1= 3 (we only have 1 distinct value)

ID2= 89 (this is the 2nd distinct value after 76)

ID3= 4 (we only have 1 value so we want to use this)

 

How do I code this using PROC SQL (if possible?)

 

Thanks

 

 

Thanks


In general SQL is a poor choice for attempting specific order processing of a set that is already in order. The DATA step would be a better way to go.

data have;
input ID       CODE  ;
datalines;
1            3 
1            3 
1            3 
2           76 
2           89 
2           13 
3            4
;

data want;
   set have;
   by id;
   if first.id then count=1;
   else count+1;
   if count=2 or (count=1 and last.id) then output;
   drop count;
run;
zishaq
Obsidian | Level 7
Thanks - this is correct although I am looking to use SQL and was wondering if there was a way possible...
novinosrin
Tourmaline | Level 20

Hi @zishaq   Love these fun puzzles

 



data have;
input ID       CODE ;
cards;
1            3
1            3
1            3
2           76
2           89
2           13
3            4
;

proc sql;
create table want(drop=_:) as
select *,count(distinct code) as _c
from (select *,monotonic() as _rn from have)
group by id
having _c>1 and _rn=min(_rn)+1 or _c=1 and min(_rn)=_rn
order by id, _rn;
quit;
zishaq
Obsidian | Level 7
Awesome - thanks for your help
Lucy1
Fluorite | Level 6

I am not sure if this is possible in Proc SQL as SQL does not process data one row at a time in the order it is found, rather it takes chuncks of data.  

If SQL is not required the the solution is possible - first remove duplicate rows then use first. and last. to flag if it is the first row of the group. Skip that row and output the next row using the lag function. 

 

Something like this should work:

 

data test;
input ID CODE; 
datalines;
1 3
1 3
1 3
2 76
2 89
2 13
3 4
run;

/*noduprec will remove duplicates of id and code*/
proc sort data=test noduprec;
  by id;
run;

data second;
  set test;
  by id;
  if first.id and last.id then output; /*output row for single row groups*/
  else if first.id then flag=1;

  if lag(flag)=1 then output;/*output the row after the first if more than one*/
  drop flag;
run;
DarthPathos
Lapis Lazuli | Level 10

I see a lot of people recommending the DATA Step.  As a hardcore advocate and fan of PROC SQL, here are two links that provide alternatives (both of which I have used with a great deal of success in my own work):

 

Subqueries 

https://stackoverflow.com/questions/43191746/sql-to-proc-sql-partition-by-alternative-min-case

 

or

 

Hash Tables

https://communities.sas.com/t5/General-SAS-Programming/Partition-by-equivalent-in-SAS-base-or-proc-s...

 

Both are relatively easy (subqueries are significantly easier than hash tables).  Post back if you need more help using SQL!

Chris

 

Has my article or post helped? Please mark as Solution or Like the article!
PaigeMiller
Diamond | Level 26

With regards to solutions using MONOTONIC() in SQL, this is unsupported and could provide incorrect results in certain situations (as explained here: https://communities.sas.com/t5/SAS-Communities-Library/MONOTONIC-function-in-PROC-SQL/ta-p/475752) and may provide incorrect solutions in other situations not listed in the article.

 

I am always uncomfortable advising newcomers to use SQL in situations where a DATA step works just as well, or better (and is easier and more straightforward to program); see Maxim 10 and Maxim 14 of "Maxims of Maximally Efficient SAS Programmers". I am uncomfortable providing solutions which use undocumented features of SAS. I am uncomfortable with the idea of using SQL for "everything". I am uncomfortable with a user who states that the solution has to be in SQL, and I generally ignore that and give what I think is a better solution not using SQL (just in case someone else is reading who is not restricted to SQL).

--
Paige Miller
DarthPathos
Lapis Lazuli | Level 10

@PaigeMiller As I mentioned, I'm a hardcore SQL fan (SQL DBA >20 years, SAS programmer~7).  Having said that, I'm nodding furiously at your reply.  

A) I have often used the DATA step, particularly in merging massive files together.  Way more efficient than SQL.

B) A boss of mine used to say that a sign of a good programmer is appreciating the fact that there's a dozen different ways to handle the same problem, and the sign of a great programmer is to know when to use each one.  

C) To second your comment re: MONOTONIC, I have used it.  I have seen it fail me more often than not, and so avoid it now (I will occasionally use it for testing or dummy data).  

 

This goes back to a point I make to the students I mentor - research, test, research some more.  If it's a small dataset, you may be fine with your first solution, but you can't assume it'll work every time.  

 

I hope you have a great day!
Chris

 

 

Has my article or post helped? Please mark as Solution or Like the article!
novinosrin
Tourmaline | Level 20

Wise words, and truly makes sense. I can't agree more. I must admit I do not advocate any however it's fun and entertainment for me. Nothing more/nothing less. 

 

I once asked @mkeintz  and @hashman(not on the same thread of course) , why do you offer such sophisticated solutions when  chances are OP is not likely to at their level and could be confusing but at the same time those solutions inspires blokes like to me to experiment further and lift my spirits up. Oh well, a lot of times their solutions have compelled me to read 2-3 times or more to fully understand. 

 

It's a buffet and it's OP's choice to pick and choose. 

 

 

 

 

mkeintz
PROC Star

I too agree with these observations.  And in particular, I would advise limiting use of MONOTONIC to development (or maybe instructional) work  

 

I presume it's an officially "deprecated" usage.  And I also presume that once a user has entered the world of parallel data access processing, MONOTONIC should be shunned even for development purposes.  It's worth knowing the MONOTONIC exists but ...

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ksharp
Super User

data have;
input ID       CODE ;
cards;
1            3
1            3
1            3
2           76
2           89
2           13
3            4
;
data temp;
 set have ;
 by id code notsorted;
 if first.id then group=0;
 group+first.code;
run;
data temp1;
 set temp;
 by id group;
 if first.group and group in ( 1 2 );
run;
data want;
 set temp1;
 by id ;
 if last.id;
run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 4657 views
  • 8 likes
  • 8 in conversation