Hello,
I am trying to remove duplicates from a large dataset and have been having trouble doing that. I was using proc sql to do it but it seems like it does it for one criteria but not the other. Specifically, it will work for the max test_scores but it won't work for " test_scores is null and Pending = 'YES'"
. I did this proc sql separately and it will work for the null condition but I want this done all together. Below is sample code and data what my example data looks like and what I want.
proc sql;
create table WANT as
select*
from HAVE
group by ID
having test_scores = max(test_scores) or test_scores is null and Pending = 'YES';
quit;
Have
ID | date | test_scores | Pending |
2A | 9/5/19 | 100 | YES |
2A | 9/5/19 | NO | |
2A | 9/5/19 | NO | |
3C | 10/21/21 | YES | |
3C | 10/21/21 | NO | |
3C | 10/21/21 | NO |
Want
ID | date | test_scores | Pending |
2A | 9/5/19 | 100 | YES |
3C | 10/21/21 | YES |
Please help us help you and provide your HAVE data via working SAS datastep code that creates the table.
A group by requires an aggregate function in the select clause. Your code likely returns some log message that the group by has been converted to an order by.
data HAVE;
input ID $ date mmddyy10. test_scores Pending $;
format date mmddyy10.;
datalines;
2A 9/5/19 100 YES
2A 9/5/19 . NO
2A 9/5/19 . NO
3C 10/21/21 . YES
3C 10/21/21 . NO
3C 10/21/21 . NO
;
Here one option:
data HAVE;
infile datalines dsd truncover;
input ID $ date :mmddyy10. test_scores Pending $;
format date mmddyy10.;
datalines;
2A,9/5/19,100,YES
2A,9/5/19,.,NO
2A,9/5/19,.,NO
3C,10/21/21,.,YES
3C,10/21/21,.,NO
3C,10/21/21,.,NO
;
proc sql;
create view v_inter as
select *
from have
order by id, test_scores, Pending
;
quit;
/* alternative to above SQL: Simple proc sort */
/*proc sort data=have out=v_inter;*/
/* by id test_scores pending;*/
/*run;*/
data want;
set v_inter;
by id;
if last.id;
run;
proc print data=want;
run;
Form the data you have posted something like
data want;
set have;
where Pending = 'YES';
run;
would create the dataset you expect, but most likely only for the small subset posted.
Can you add some more observations?
What would you expect in the output dataset, if
Look at your log. A SELECT * will invariably cause a "automatic remerge", unless all variables in the dataset are used in the GROUP BY.
Assuming Pending is character, without a special display format, this should do it:
proc sort data=have;
by id test_scores pending;
run;
data want;
set have;
by id;
if last.id;
run;
/*
That is not a good behavior for coding.
Assuming I understood what you need.
*/
data HAVE;
infile datalines dsd truncover;
input ID $ date :mmddyy10. test_scores Pending $;
format date mmddyy10.;
datalines;
2A,9/5/19,100,YES
2A,9/5/19,.,NO
2A,9/5/19,.,NO
3C,10/21/21,.,YES
3C,10/21/21,.,NO
3C,10/21/21,.,NO
;
proc sql;
create table WANT as
select*
from HAVE
group by ID
having (test_scores = max(test_scores)) and
(test_scores is not missing or Pending='YES');
quit;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.