- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am trying to remove duplicates from a large dataset and have been having trouble doing that. I was using proc sql to do it but it seems like it does it for one criteria but not the other. Specifically, it will work for the max test_scores but it won't work for " test_scores is null and Pending = 'YES'"
. I did this proc sql separately and it will work for the null condition but I want this done all together. Below is sample code and data what my example data looks like and what I want.
proc sql;
create table WANT as
select*
from HAVE
group by ID
having test_scores = max(test_scores) or test_scores is null and Pending = 'YES';
quit;
Have
ID | date | test_scores | Pending |
2A | 9/5/19 | 100 | YES |
2A | 9/5/19 | NO | |
2A | 9/5/19 | NO | |
3C | 10/21/21 | YES | |
3C | 10/21/21 | NO | |
3C | 10/21/21 | NO |
Want
ID | date | test_scores | Pending |
2A | 9/5/19 | 100 | YES |
3C | 10/21/21 | YES |
- Tags:
- duplicates
- sql
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Please help us help you and provide your HAVE data via working SAS datastep code that creates the table.
A group by requires an aggregate function in the select clause. Your code likely returns some log message that the group by has been converted to an order by.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data HAVE;
input ID $ date mmddyy10. test_scores Pending $;
format date mmddyy10.;
datalines;
2A 9/5/19 100 YES
2A 9/5/19 . NO
2A 9/5/19 . NO
3C 10/21/21 . YES
3C 10/21/21 . NO
3C 10/21/21 . NO
;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here one option:
data HAVE;
infile datalines dsd truncover;
input ID $ date :mmddyy10. test_scores Pending $;
format date mmddyy10.;
datalines;
2A,9/5/19,100,YES
2A,9/5/19,.,NO
2A,9/5/19,.,NO
3C,10/21/21,.,YES
3C,10/21/21,.,NO
3C,10/21/21,.,NO
;
proc sql;
create view v_inter as
select *
from have
order by id, test_scores, Pending
;
quit;
/* alternative to above SQL: Simple proc sort */
/*proc sort data=have out=v_inter;*/
/* by id test_scores pending;*/
/*run;*/
data want;
set v_inter;
by id;
if last.id;
run;
proc print data=want;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Form the data you have posted something like
data want;
set have;
where Pending = 'YES';
run;
would create the dataset you expect, but most likely only for the small subset posted.
Can you add some more observations?
What would you expect in the output dataset, if
- in the second obs test_score is 200
- id=c3 and test_score = 1 in the last obs?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Look at your log. A SELECT * will invariably cause a "automatic remerge", unless all variables in the dataset are used in the GROUP BY.
Assuming Pending is character, without a special display format, this should do it:
proc sort data=have;
by id test_scores pending;
run;
data want;
set have;
by id;
if last.id;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
/*
That is not a good behavior for coding.
Assuming I understood what you need.
*/
data HAVE;
infile datalines dsd truncover;
input ID $ date :mmddyy10. test_scores Pending $;
format date mmddyy10.;
datalines;
2A,9/5/19,100,YES
2A,9/5/19,.,NO
2A,9/5/19,.,NO
3C,10/21/21,.,YES
3C,10/21/21,.,NO
3C,10/21/21,.,NO
;
proc sql;
create table WANT as
select*
from HAVE
group by ID
having (test_scores = max(test_scores)) and
(test_scores is not missing or Pending='YES');
quit;