topic Re: How to deduplicate data using Proc Sql with 2 different criteria. in SAS Programming

How to deduplicate data using Proc Sql with 2 different criteria.

rebelde52 — Mon, 23 Oct 2023 23:15:11 GMT

Hello,

I am trying to remove duplicates from a large dataset and have been having trouble doing that. I was using proc sql to do it but it seems like it does it for one criteria but not the other. Specifically, it will work for the max test_scores but it won't work for " test_scores is null and Pending = 'YES'" . I did this proc sql separately and it will work for the null condition but I want this done all together. Below is sample code and data what my example data looks like and what I want.

proc sql;
create table WANT as
select*
from HAVE
group by ID
having test_scores = max(test_scores) or test_scores is null and Pending = 'YES';
quit;

Have

ID	date	test_scores	Pending
2A	9/5/19	100	YES
2A	9/5/19		NO
2A	9/5/19		NO
3C	10/21/21		YES
3C	10/21/21		NO
3C	10/21/21		NO

Want

ID	date	test_scores	Pending
2A	9/5/19	100	YES
3C	10/21/21		YES

Re: How to deduplicate data using Proc Sql with 2 different criteria.

Patrick — Tue, 24 Oct 2023 01:25:35 GMT

Please help us help you and provide your HAVE data via working SAS datastep code that creates the table.

A group by requires an aggregate function in the select clause. Your code likely returns some log message that the group by has been converted to an order by.

Re: How to deduplicate data using Proc Sql with 2 different criteria.

rebelde52 — Tue, 24 Oct 2023 01:34:24 GMT

data HAVE;
   input ID $ date mmddyy10. test_scores Pending $;
   format date mmddyy10.;
   datalines;
2A 9/5/19  100 YES
2A 9/5/19  .    NO
2A 9/5/19  .    NO
3C 10/21/21 .  YES
3C 10/21/21 .  NO
3C 10/21/21 .  NO
;

Re: How to deduplicate data using Proc Sql with 2 different criteria.

Patrick — Tue, 24 Oct 2023 10:24:30 GMT

Here one option:

data HAVE;
  infile datalines dsd truncover;
  input ID $ date :mmddyy10. test_scores Pending $;
  format date mmddyy10.;
  datalines;
2A,9/5/19,100,YES
2A,9/5/19,.,NO
2A,9/5/19,.,NO
3C,10/21/21,.,YES
3C,10/21/21,.,NO
3C,10/21/21,.,NO
;

proc sql;
  create view v_inter as
  select *
  from have
  order by id, test_scores, Pending
  ;
quit;

/* alternative to above SQL: Simple proc sort */
/*proc sort data=have out=v_inter;*/
/*  by id test_scores pending;*/
/*run;*/

data want;
  set v_inter;
  by id;
  if last.id;
run;

proc print data=want;
run;

Re: How to deduplicate data using Proc Sql with 2 different criteria.

andreas_lds — Tue, 24 Oct 2023 06:39:13 GMT

Form the data you have posted something like

data want;
  set have;
  where Pending = 'YES';
run;

would create the dataset you expect, but most likely only for the small subset posted.

Can you add some more observations?

What would you expect in the output dataset, if

in the second obs test_score is 200
id=c3 and test_score = 1 in the last obs?

Re: How to deduplicate data using Proc Sql with 2 different criteria.

Kurt_Bremser — Tue, 24 Oct 2023 08:14:57 GMT

Look at your log. A SELECT * will invariably cause a "automatic remerge", unless all variables in the dataset are used in the GROUP BY.

Assuming Pending is character, without a special display format, this should do it:

proc sort data=have;
by id test_scores pending;
run;

data want;
set have;
by id;
if last.id;
run;

Re: How to deduplicate data using Proc Sql with 2 different criteria.

Ksharp — Tue, 24 Oct 2023 11:26:57 GMT

/*
That is not a good behavior for coding.
Assuming I understood what you need.
*/
data HAVE;
  infile datalines dsd truncover;
  input ID $ date :mmddyy10. test_scores Pending $;
  format date mmddyy10.;
  datalines;
2A,9/5/19,100,YES
2A,9/5/19,.,NO
2A,9/5/19,.,NO
3C,10/21/21,.,YES
3C,10/21/21,.,NO
3C,10/21/21,.,NO
;

proc sql;
create table WANT as
select*
from HAVE
group by ID
having (test_scores = max(test_scores)) and 
       (test_scores is not missing or Pending='YES');
quit;