Solved: Deleting Duplicate Records But Keep the one with (not binary) Specific...

Jay_Aguilar · Posted 05-06-2020 04:59 AM

Hi everyone,

I have a pretty similar problem as in https://communities.sas.com/t5/SAS-Procedures/Deleting-Duplicate-Records-But-Keep-the-one-with-Speci... but I my specific variable is not binary but can have diffrent expressions:

Customer Name Customer ID Address Customer Type

Joe Doe 123 123 Way Online (retain)

Joe Doe 123 123 Way In-Store (delete)

Ken Moore 456 456 Way Online (retain)

Ken Moore 456 456 Way Offline (delete)

Lisa Mae 789 789 Way In-Store (delete)

Lisa Mae 123 123 Way Marketplace (retain)

Paul Hum 456 456 Way Online2 (retain)

Paul Hum 456 456 Way Market (delete)

So I would need the SAS procedure where I can restrict which specific expressions should be kept and which ones can be deleted.

Thank you for your help.

FreelanceReinh · Posted 05-06-2020 06:09 AM

Hi @Jay_Aguilar,

Similar to @ed_sas_member's approach, you could create an informat to define the hierarchy, e.g.:

proc format;
invalue type
'Online'      = 1
'Online2'     = 2
'Marketplace' = 3
'Market'      = 4
'In-Store'    = 5
'Offline'     = 6
other         = 99;
run;

Then, using a PROC SQL view, you don't need an intermediate dataset or a new variable:

proc sql;
create view _tmp as
select * from have
order by id, /* add more key variables if needed */
         input(Customer_type, type.);
quit;

data want;
set _tmp;
by id; /* add more key variables if needed */
if first.id; /* then replace ID with last key variable */
run;

Alternatively, you could define the hierarchy with the WHICHC function and use whichc(Customer_type, ...) instead of input(...) in the ORDER BY clause. The "other" case would then require special consideration (e.g. switching to "if last.id;" with reversed order in the WHICHC arguments) because WHICHC returns 0 for unexpected Customer_type values.

View solution in original post

ed_sas_member · Posted 05-06-2020 05:13 AM

Hi @Jay_Aguilar

I am thinking about something like this -> you need to establish the rules to define "priority"

data have;
	infile datalines dlm=",";
	input Customer_Name:$50. Customer_ID Address:$50. Customer_Type:$50.;
	datalines;
Joe Doe,123,123 Way,Online
Joe Doe,123,123 Way,In-Store
Ken Moore,456,456 Way,Online
Ken Moore,456,456 Way,Offline
Lisa Mae,789,789 Way,In-Store
Lisa Mae,123,123 Way,Marketplace
Paul Hum,456,456 Way,Online2
Paul Hum,456,456 Way,Market
;
run;

data have_priority; /*Check priorities !*/
	set have;
	if 	    Customer_Type = "Online" 	  then Priority = 1;
	else if Customer_Type = "Online2" 	  then Priority = 2;
	else if Customer_Type = "In-Store" 	  then Priority = 3;
	else if Customer_Type = "Marketplace" then Priority = 4;
	else if Customer_Type = "Market"	  then Priority = 5;
	else if Customer_Type = "Offline"     then Priority = 6;
run;

proc sort data=have_priority;
	by Customer_Name Customer_ID priority;
run;

data want;
	set have_priority;
	by Customer_Name Customer_ID priority;
	if first.Customer_Name then output;
	drop priority;
run;

Best,

Kurt_Bremser · Posted 05-06-2020 05:27 AM

Try this:

data want;
set have (where=(customer_type not in ("Market","Offline","In-Store")));
by customer_name notsorted;
if first.customer_name;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

FreelanceReinh · Posted 05-06-2020 06:09 AM

Hi @Jay_Aguilar,

Similar to @ed_sas_member's approach, you could create an informat to define the hierarchy, e.g.:

proc format;
invalue type
'Online'      = 1
'Online2'     = 2
'Marketplace' = 3
'Market'      = 4
'In-Store'    = 5
'Offline'     = 6
other         = 99;
run;

Then, using a PROC SQL view, you don't need an intermediate dataset or a new variable:

proc sql;
create view _tmp as
select * from have
order by id, /* add more key variables if needed */
         input(Customer_type, type.);
quit;

data want;
set _tmp;
by id; /* add more key variables if needed */
if first.id; /* then replace ID with last key variable */
run;

Alternatively, you could define the hierarchy with the WHICHC function and use whichc(Customer_type, ...) instead of input(...) in the ORDER BY clause. The "other" case would then require special consideration (e.g. switching to "if last.id;" with reversed order in the WHICHC arguments) because WHICHC returns 0 for unexpected Customer_type values.

Jay_Aguilar · Posted 05-06-2020 07:15 AM

Thank you very much all @Kurt_Bremser @ed_sas_member @FreelanceReinh, actually all 3 solutions work for me.

Deleting Duplicate Records But Keep the one with (not binary) Specific Variable

Re: Deleting Duplicate Records But Keep the one with (not binary) Specific Variable

Re: Deleting Duplicate Records But Keep the one with (not binary) Specific Variable

Re: Deleting Duplicate Records But Keep the one with (not binary) Specific Variable

Re: Deleting Duplicate Records But Keep the one with (not binary) Specific Variable

Re: Deleting Duplicate Records But Keep the one with (not binary) Specific Variable

Deleting Duplicate Records But Keep the one with (not binary) Specific Variable

Re: Deleting Duplicate Records But Keep the one with (not binary) Specific Variable

Re: Deleting Duplicate Records But Keep the one with (not binary) Specific Variable

Re: Deleting Duplicate Records But Keep the one with (not binary) Specific Variable

Re: Deleting Duplicate Records But Keep the one with (not binary) Specific Variable

Re: Deleting Duplicate Records But Keep the one with (not binary) Specific Variable

Registration is open

SAS Training: Just a Click Away