Help using Base SAS procedures

Deleting duplicates based on multiple criteria

Accepted Solution Solved
Reply
Contributor
Posts: 32
Accepted Solution

Deleting duplicates based on multiple criteria

How can I delete duplicates of one variable based on a start date using NODUPKEY?

Thanks!


Accepted Solutions
Solution
‎10-16-2017 10:35 AM
Super User
Posts: 23,683

Re: Deleting duplicates based on multiple criteria

Sort it twice. First time with email address and date, second with email address and the NODUPKEY option.

View solution in original post


All Replies
Super User
Posts: 23,683

Re: Deleting duplicates based on multiple criteria

 

Provide more details.


lboyd wrote:

How can I delete duplicates of one variable based on a start date using NODUPKEY?

Thanks!


 

Contributor
Posts: 32

Re: Deleting duplicates based on multiple criteria

I've been using this:
PROC SORT DATA=CMpre DUPOUT=results NODUPKEY ;
BY QID25;
RUN ;

where qid25 is an email address. I need to get rid of duplicate email addresses and I want to keep the one that had the earliest start date. Startdate variable looks something like this:
22JUN17:00:00:00
Solution
‎10-16-2017 10:35 AM
Super User
Posts: 23,683

Re: Deleting duplicates based on multiple criteria

Sort it twice. First time with email address and date, second with email address and the NODUPKEY option.

Contributor
Posts: 32

Re: Deleting duplicates based on multiple criteria

This code gets rid of missing e-mails, is there a way to prevent that?
Contributor
Posts: 32

Re: Deleting duplicates based on multiple criteria

Also some of the e-mails start with an uppercase letter while others start with lower case-is there any way to delete based on both? For instance, if someone said Sam123@gmail.com and also sam123@gmail.com-I'd want one of those deleted from the database.
Super User
Posts: 23,683

Re: Deleting duplicates based on multiple criteria

For instance, if someone said Sam123@gmail.com and also sam123@gmail.com-I'd want one of those deleted from the database.

 

To fix this you need to clean your data first.

 

This code gets rid of missing e-mails, is there a way to prevent that?

 

To deal with this you likely need to do it manually. 

First sort and then use a data step with first/last but coding an exception for the missing emails.

 

proc sort data=have;
by group_var;
run;

data want;
set have;
by group_var;
if first.group_var or missing(group_var);
run;



Contributor
Posts: 32

Re: Deleting duplicates based on multiple criteria

Thank you!
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 241 views
  • 0 likes
  • 2 in conversation