BookmarkSubscribeRSS Feed
rebelde52
Fluorite | Level 6

Hello, 

I am trying to remove duplicates from a large dataset and have been having trouble doing that. I was using proc sql to do it but it seems like it does it for one criteria but not the other. Specifically, it will work for the max test_scores but it won't work for " test_scores is null and Pending = 'YES'" . I did this proc sql separately and it will work for the null condition but I want this done all together. Below is sample code and data what my example data looks like and what I want. 

 

proc sql;
create table WANT as
select*
from HAVE
group by ID
having test_scores = max(test_scores) or test_scores is null and Pending = 'YES';
quit;

Have

IDdatetest_scoresPending
2A9/5/19100YES
2A9/5/19 NO
2A9/5/19 NO
3C10/21/21 YES
3C10/21/21 NO
3C10/21/21 NO

 

Want 

IDdatetest_scoresPending
2A9/5/19100YES
3C10/21/21 YES
6 REPLIES 6
Patrick
Opal | Level 21

Please help us help you and provide your HAVE data via working SAS datastep code that creates the table.

 

A group by requires an aggregate function in the select clause. Your code likely returns some log message that the group by has been converted to an order by.

rebelde52
Fluorite | Level 6
data HAVE;
   input ID $ date mmddyy10. test_scores Pending $;
   format date mmddyy10.;
   datalines;
2A 9/5/19  100 YES
2A 9/5/19  .    NO
2A 9/5/19  .    NO
3C 10/21/21 .  YES
3C 10/21/21 .  NO
3C 10/21/21 .  NO
;
Patrick
Opal | Level 21

Here one option:

data HAVE;
  infile datalines dsd truncover;
  input ID $ date :mmddyy10. test_scores Pending $;
  format date mmddyy10.;
  datalines;
2A,9/5/19,100,YES
2A,9/5/19,.,NO
2A,9/5/19,.,NO
3C,10/21/21,.,YES
3C,10/21/21,.,NO
3C,10/21/21,.,NO
;

proc sql;
  create view v_inter as
  select *
  from have
  order by id, test_scores, Pending
  ;
quit;

/* alternative to above SQL: Simple proc sort */
/*proc sort data=have out=v_inter;*/
/*  by id test_scores pending;*/
/*run;*/

data want;
  set v_inter;
  by id;
  if last.id;
run;

proc print data=want;
run;

Patrick_0-1698111514995.png

 

andreas_lds
Jade | Level 19

Form the data you have posted something like

data want;
  set have;
  where Pending = 'YES';
run;

would create the dataset you expect, but most likely only for the small subset posted.

Can you add some more observations?

What would you expect in the output dataset, if

  • in the second obs test_score is 200
  • id=c3 and test_score = 1 in the last obs?
Kurt_Bremser
Super User

Look at your log. A SELECT * will invariably cause a "automatic remerge", unless all variables in the dataset are used in the GROUP BY.

Assuming Pending is character, without a special display format, this should do it:

proc sort data=have;
by id test_scores pending;
run;

data want;
set have;
by id;
if last.id;
run;
Ksharp
Super User
/*
That is not a good behavior for coding.
Assuming I understood what you need.
*/
data HAVE;
  infile datalines dsd truncover;
  input ID $ date :mmddyy10. test_scores Pending $;
  format date mmddyy10.;
  datalines;
2A,9/5/19,100,YES
2A,9/5/19,.,NO
2A,9/5/19,.,NO
3C,10/21/21,.,YES
3C,10/21/21,.,NO
3C,10/21/21,.,NO
;

proc sql;
create table WANT as
select*
from HAVE
group by ID
having (test_scores = max(test_scores)) and 
       (test_scores is not missing or Pending='YES');
quit;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 694 views
  • 0 likes
  • 5 in conversation