Solved: Re: PROC SQL - Grouping

zishaq · Posted 11-25-2019 11:06 AM

Hello all,

I have a question around using PROC SQL which is probably best explained using the following example.

I have two variables, ID and CODE.

ID CODE

1 3

2 76

2 89

2 13

3 4

For each unique ID, I want to select the second distinct CODEvalue.

So the answers should be;

ID1= 3 (we only have 1 distinct value)

ID2= 89 (this is the 2nd distinct value after 76)

ID3= 4 (we only have 1 value so we want to use this)

How do I code this using PROC SQL (if possible?)

Thanks

novinosrin · Posted 11-25-2019 11:22 AM

Hi @zishaq Love these fun puzzles



data have;
input ID       CODE ;
cards;
1            3
1            3
1            3
2           76
2           89
2           13
3            4
;

proc sql;
create table want(drop=_:) as
select *,count(distinct code) as _c
from (select *,monotonic() as _rn from have)
group by id
having _c>1 and _rn=min(_rn)+1 or _c=1 and min(_rn)=_rn
order by id, _rn;
quit;

View solution in original post

PaigeMiller · Posted 11-25-2019 11:20 AM

Very easy to do using a DATA step; probably much harder to code in SQL.

By the way, are the data are used in the order shown? There is no need to sort the data?

data want;
    set have;
    by id;
    if first.id then count=0;
    count+1;
    if count=2 or (count=1 and last.id) then output;
run;

--
Paige Miller

zishaq · Posted 11-25-2019 11:21 AM

Yes, data is already sorted (should have mentioned this before).

Thanks

ballardw · Posted 11-25-2019 11:20 AM

@zishaq wrote:

Hello all,

I have a question around using PROC SQL which is probably best explained using the following example.

I have two variables, ID and CODE.

ID CODE

1    3

1    3

1    3

2 76

2 89

2 13

3    4

For each unique ID, I want to select the second distinct CODEvalue.

So the answers should be;

ID1= 3 (we only have 1 distinct value)

ID2= 89 (this is the 2nd distinct value after 76)

ID3= 4 (we only have 1 value so we want to use this)

How do I code this using PROC SQL (if possible?)

Thanks

Thanks

In general SQL is a poor choice for attempting specific order processing of a set that is already in order. The DATA step would be a better way to go.

data have;
input ID       CODE  ;
datalines;
1            3 
1            3 
1            3 
2           76 
2           89 
2           13 
3            4
;

data want;
   set have;
   by id;
   if first.id then count=1;
   else count+1;
   if count=2 or (count=1 and last.id) then output;
   drop count;
run;

zishaq · Posted 11-25-2019 11:24 AM

Thanks - this is correct although I am looking to use SQL and was wondering if there was a way possible...

novinosrin · Posted 11-25-2019 11:22 AM

Hi @zishaq Love these fun puzzles



data have;
input ID       CODE ;
cards;
1            3
1            3
1            3
2           76
2           89
2           13
3            4
;

proc sql;
create table want(drop=_:) as
select *,count(distinct code) as _c
from (select *,monotonic() as _rn from have)
group by id
having _c>1 and _rn=min(_rn)+1 or _c=1 and min(_rn)=_rn
order by id, _rn;
quit;

zishaq · Posted 11-25-2019 11:29 AM

Awesome - thanks for your help

Lucy1 · Posted 11-25-2019 11:24 AM

I am not sure if this is possible in Proc SQL as SQL does not process data one row at a time in the order it is found, rather it takes chuncks of data.

If SQL is not required the the solution is possible - first remove duplicate rows then use first. and last. to flag if it is the first row of the group. Skip that row and output the next row using the lag function.

Something like this should work:

data test;
input ID CODE; 
datalines;
1 3
1 3
1 3
2 76
2 89
2 13
3 4
run;

/*noduprec will remove duplicates of id and code*/
proc sort data=test noduprec;
  by id;
run;

data second;
  set test;
  by id;
  if first.id and last.id then output; /*output row for single row groups*/
  else if first.id then flag=1;

  if lag(flag)=1 then output;/*output the row after the first if more than one*/
  drop flag;
run;

DarthPathos · Posted 11-25-2019 11:29 AM

I see a lot of people recommending the DATA Step. As a hardcore advocate and fan of PROC SQL, here are two links that provide alternatives (both of which I have used with a great deal of success in my own work):

Subqueries

https://stackoverflow.com/questions/43191746/sql-to-proc-sql-partition-by-alternative-min-case

or

Hash Tables

https://communities.sas.com/t5/General-SAS-Programming/Partition-by-equivalent-in-SAS-base-or-proc-s...

Both are relatively easy (subqueries are significantly easier than hash tables). Post back if you need more help using SQL!

Chris

Has my article or post helped? Please mark as Solution or Like the article!

PaigeMiller · Posted 11-25-2019 11:39 AM

With regards to solutions using MONOTONIC() in SQL, this is unsupported and could provide incorrect results in certain situations (as explained here: https://communities.sas.com/t5/SAS-Communities-Library/MONOTONIC-function-in-PROC-SQL/ta-p/475752) and may provide incorrect solutions in other situations not listed in the article.

I am always uncomfortable advising newcomers to use SQL in situations where a DATA step works just as well, or better (and is easier and more straightforward to program); see Maxim 10 and Maxim 14 of "Maxims of Maximally Efficient SAS Programmers". I am uncomfortable providing solutions which use undocumented features of SAS. I am uncomfortable with the idea of using SQL for "everything". I am uncomfortable with a user who states that the solution has to be in SQL, and I generally ignore that and give what I think is a better solution not using SQL (just in case someone else is reading who is not restricted to SQL).

--
Paige Miller

DarthPathos · Posted 11-25-2019 11:51 AM

@PaigeMiller As I mentioned, I'm a hardcore SQL fan (SQL DBA >20 years, SAS programmer~7). Having said that, I'm nodding furiously at your reply.

A) I have often used the DATA step, particularly in merging massive files together. Way more efficient than SQL.

B) A boss of mine used to say that a sign of a good programmer is appreciating the fact that there's a dozen different ways to handle the same problem, and the sign of a great programmer is to know when to use each one.

C) To second your comment re: MONOTONIC, I have used it. I have seen it fail me more often than not, and so avoid it now (I will occasionally use it for testing or dummy data).

This goes back to a point I make to the students I mentor - research, test, research some more. If it's a small dataset, you may be fine with your first solution, but you can't assume it'll work every time.

I hope you have a great day!
Chris

Has my article or post helped? Please mark as Solution or Like the article!

novinosrin · Posted 11-25-2019 11:57 AM

Wise words, and truly makes sense. I can't agree more. I must admit I do not advocate any however it's fun and entertainment for me. Nothing more/nothing less.

I once asked @mkeintz and @hashman(not on the same thread of course) , why do you offer such sophisticated solutions when chances are OP is not likely to at their level and could be confusing but at the same time those solutions inspires blokes like to me to experiment further and lift my spirits up. Oh well, a lot of times their solutions have compelled me to read 2-3 times or more to fully understand.

It's a buffet and it's OP's choice to pick and choose.

mkeintz · Posted 11-25-2019 12:32 PM

I too agree with these observations. And in particular, I would advise limiting use of MONOTONIC to development (or maybe instructional) work

I presume it's an officially "deprecated" usage. And I also presume that once a user has entered the world of parallel data access processing, MONOTONIC should be shunned even for development purposes. It's worth knowing the MONOTONIC exists but ...

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ksharp · Posted 11-26-2019 07:21 AM


data have;
input ID       CODE ;
cards;
1            3
1            3
1            3
2           76
2           89
2           13
3            4
;
data temp;
 set have ;
 by id code notsorted;
 if first.id then group=0;
 group+first.code;
run;
data temp1;
 set temp;
 by id group;
 if first.group and group in ( 1 2 );
run;
data want;
 set temp1;
 by id ;
 if last.id;
run;