How to deduplicate data using Proc Sql with 2 different criteria.

rebelde52 · Posted 10-23-2023 06:26 PM

Hello,

I am trying to remove duplicates from a large dataset and have been having trouble doing that. I was using proc sql to do it but it seems like it does it for one criteria but not the other. Specifically, it will work for the max test_scores but it won't work for " test_scores is null and Pending = 'YES'" . I did this proc sql separately and it will work for the null condition but I want this done all together. Below is sample code and data what my example data looks like and what I want.

proc sql;
create table WANT as
select*
from HAVE
group by ID
having test_scores = max(test_scores) or test_scores is null and Pending = 'YES';
quit;

Have

ID	date	test_scores	Pending
2A	9/5/19	100	YES
2A	9/5/19		NO
2A	9/5/19		NO
3C	10/21/21		YES
3C	10/21/21		NO
3C	10/21/21		NO

Want

ID	date	test_scores	Pending
2A	9/5/19	100	YES
3C	10/21/21		YES

Patrick · Posted 10-23-2023 09:24 PM

Please help us help you and provide your HAVE data via working SAS datastep code that creates the table.

A group by requires an aggregate function in the select clause. Your code likely returns some log message that the group by has been converted to an order by.

rebelde52 · Posted 10-23-2023 09:34 PM

data HAVE;
   input ID $ date mmddyy10. test_scores Pending $;
   format date mmddyy10.;
   datalines;
2A 9/5/19  100 YES
2A 9/5/19  .    NO
2A 9/5/19  .    NO
3C 10/21/21 .  YES
3C 10/21/21 .  NO
3C 10/21/21 .  NO
;

Patrick · Posted 10-23-2023 09:37 PM

Here one option:

data HAVE;
  infile datalines dsd truncover;
  input ID $ date :mmddyy10. test_scores Pending $;
  format date mmddyy10.;
  datalines;
2A,9/5/19,100,YES
2A,9/5/19,.,NO
2A,9/5/19,.,NO
3C,10/21/21,.,YES
3C,10/21/21,.,NO
3C,10/21/21,.,NO
;

proc sql;
  create view v_inter as
  select *
  from have
  order by id, test_scores, Pending
  ;
quit;

/* alternative to above SQL: Simple proc sort */
/*proc sort data=have out=v_inter;*/
/*  by id test_scores pending;*/
/*run;*/

data want;
  set v_inter;
  by id;
  if last.id;
run;

proc print data=want;
run;

andreas_lds · Posted 10-24-2023 02:39 AM

Form the data you have posted something like

data want;
  set have;
  where Pending = 'YES';
run;

would create the dataset you expect, but most likely only for the small subset posted.

Can you add some more observations?

What would you expect in the output dataset, if

in the second obs test_score is 200
id=c3 and test_score = 1 in the last obs?

Kurt_Bremser · Posted 10-24-2023 04:14 AM

Look at your log. A SELECT * will invariably cause a "automatic remerge", unless all variables in the dataset are used in the GROUP BY.

Assuming Pending is character, without a special display format, this should do it:

proc sort data=have;
by id test_scores pending;
run;

data want;
set have;
by id;
if last.id;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ksharp · Posted 10-24-2023 07:26 AM

/*
That is not a good behavior for coding.
Assuming I understood what you need.
*/
data HAVE;
  infile datalines dsd truncover;
  input ID $ date :mmddyy10. test_scores Pending $;
  format date mmddyy10.;
  datalines;
2A,9/5/19,100,YES
2A,9/5/19,.,NO
2A,9/5/19,.,NO
3C,10/21/21,.,YES
3C,10/21/21,.,NO
3C,10/21/21,.,NO
;

proc sql;
create table WANT as
select*
from HAVE
group by ID
having (test_scores = max(test_scores)) and 
       (test_scores is not missing or Pending='YES');
quit;

How to deduplicate data using Proc Sql with 2 different criteria.

Re: How to deduplicate data using Proc Sql with 2 different criteria.

Re: How to deduplicate data using Proc Sql with 2 different criteria.

Re: How to deduplicate data using Proc Sql with 2 different criteria.

Re: How to deduplicate data using Proc Sql with 2 different criteria.

Re: How to deduplicate data using Proc Sql with 2 different criteria.

Re: How to deduplicate data using Proc Sql with 2 different criteria.

Classroom Training Available!