BookmarkSubscribeRSS Feed
rebelde52
Fluorite | Level 6

Hello, 

I am trying to remove duplicates from a large dataset and have been having trouble doing that. I was using proc sql to do it but it seems like it does it for one criteria but not the other. Specifically, it will work for the max test_scores but it won't work for " test_scores is null and Pending = 'YES'" . I did this proc sql separately and it will work for the null condition but I want this done all together. Below is sample code and data what my example data looks like and what I want. 

 

proc sql;
create table WANT as
select*
from HAVE
group by ID
having test_scores = max(test_scores) or test_scores is null and Pending = 'YES';
quit;

Have

IDdatetest_scoresPending
2A9/5/19100YES
2A9/5/19 NO
2A9/5/19 NO
3C10/21/21 YES
3C10/21/21 NO
3C10/21/21 NO

 

Want 

IDdatetest_scoresPending
2A9/5/19100YES
3C10/21/21 YES
6 REPLIES 6
Patrick
Opal | Level 21

Please help us help you and provide your HAVE data via working SAS datastep code that creates the table.

 

A group by requires an aggregate function in the select clause. Your code likely returns some log message that the group by has been converted to an order by.

rebelde52
Fluorite | Level 6
data HAVE;
   input ID $ date mmddyy10. test_scores Pending $;
   format date mmddyy10.;
   datalines;
2A 9/5/19  100 YES
2A 9/5/19  .    NO
2A 9/5/19  .    NO
3C 10/21/21 .  YES
3C 10/21/21 .  NO
3C 10/21/21 .  NO
;
Patrick
Opal | Level 21

Here one option:

data HAVE;
  infile datalines dsd truncover;
  input ID $ date :mmddyy10. test_scores Pending $;
  format date mmddyy10.;
  datalines;
2A,9/5/19,100,YES
2A,9/5/19,.,NO
2A,9/5/19,.,NO
3C,10/21/21,.,YES
3C,10/21/21,.,NO
3C,10/21/21,.,NO
;

proc sql;
  create view v_inter as
  select *
  from have
  order by id, test_scores, Pending
  ;
quit;

/* alternative to above SQL: Simple proc sort */
/*proc sort data=have out=v_inter;*/
/*  by id test_scores pending;*/
/*run;*/

data want;
  set v_inter;
  by id;
  if last.id;
run;

proc print data=want;
run;

Patrick_0-1698111514995.png

 

andreas_lds
Jade | Level 19

Form the data you have posted something like

data want;
  set have;
  where Pending = 'YES';
run;

would create the dataset you expect, but most likely only for the small subset posted.

Can you add some more observations?

What would you expect in the output dataset, if

  • in the second obs test_score is 200
  • id=c3 and test_score = 1 in the last obs?
Kurt_Bremser
Super User

Look at your log. A SELECT * will invariably cause a "automatic remerge", unless all variables in the dataset are used in the GROUP BY.

Assuming Pending is character, without a special display format, this should do it:

proc sort data=have;
by id test_scores pending;
run;

data want;
set have;
by id;
if last.id;
run;
Ksharp
Super User
/*
That is not a good behavior for coding.
Assuming I understood what you need.
*/
data HAVE;
  infile datalines dsd truncover;
  input ID $ date :mmddyy10. test_scores Pending $;
  format date mmddyy10.;
  datalines;
2A,9/5/19,100,YES
2A,9/5/19,.,NO
2A,9/5/19,.,NO
3C,10/21/21,.,YES
3C,10/21/21,.,NO
3C,10/21/21,.,NO
;

proc sql;
create table WANT as
select*
from HAVE
group by ID
having (test_scores = max(test_scores)) and 
       (test_scores is not missing or Pending='YES');
quit;

SAS Innovate 2025: Register Today!

 

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 1274 views
  • 0 likes
  • 5 in conversation