DATA Step, Macro, Functions and more

Removing specific duplicates

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 7
Accepted Solution

Removing specific duplicates

[ Edited ]

Sorry if this is redundant but I don't see anything about removing duplicates in here that answers my question.  I've tried IF-THEN, and the Case Where statements.  I want to remove the Var2 obs with a 7 only when Var1=X (character data).  This is a very large consolidated dataset with over 23.5 million rows and 156 variables.  The other variables in the Y and Z rows have different data and I need to retain it.  So I just want to get rid of the X duplicate.

 

What I have:  

 

Var1Var2
X5
X6
X7
Y5
Y6
Y7
Z7
Z9
Z10

 

 

What a want:

 

 

Var1Var2
X2
X3
Y5
Y6
Y7
Z7
Z9
Z10

 

Thank you ahead of time!


Accepted Solutions
Solution
‎04-05-2017 11:26 AM
Super User
Posts: 10,526

Re: Removing specific duplicates

If code like this does not work post your code and log and copies of some of the records it should delete but doesn't:

 

data want;
   set have;
   if var1='X' and var2=7 then delete;
run;

Note the character comparison is case sensitive so if you say var1='ABC' and the actual value is 'Abc' then that is not a match.

 

If that is an issue then you may want to UPCASE the variable: If upcase(var1) = 'ABC'

View solution in original post


All Replies
Solution
‎04-05-2017 11:26 AM
Super User
Posts: 10,526

Re: Removing specific duplicates

If code like this does not work post your code and log and copies of some of the records it should delete but doesn't:

 

data want;
   set have;
   if var1='X' and var2=7 then delete;
run;

Note the character comparison is case sensitive so if you say var1='ABC' and the actual value is 'Abc' then that is not a match.

 

If that is an issue then you may want to UPCASE the variable: If upcase(var1) = 'ABC'

Occasional Contributor
Posts: 7

Re: Removing specific duplicates

Thanks Ballardw for your quick reply.  Yes, I did try that and it seems it should be that simple.  It retains all of the X's that are other than 7, but it eliminates all of the rows with Var2=7, that being the ones that have Var1=Y, and Z.  I'm sorting by Var1 and Var2.  Would this make a different?

Super User
Posts: 10,526

Re: Removing specific duplicates


Randy1 wrote:

Thanks Ballardw for your quick reply.  Yes, I did try that and it seems it should be that simple.  It retains all of the X's that are other than 7, but it eliminates all of the rows with Var2=7, that being the ones that have Var1=Y, and Z.  I'm sorting by Var1 and Var2.  Would this make a different?


Please show the code that you ran as that is not what will happen with the code I provided.

data have;
input Var1 $ Var2 ;
datalines;
X 5 
X 6 
X 7 
Y 5 
Y 6 
Y 7 
Z 7 
Z 9 
Z 10 
;

run;
data want;
   set have;
   if var1='X' and var2=7 then delete;
run;
proc print noobs;
run;

Note that your example want data contains values that were not in the have data.

 

Occasional Contributor
Posts: 7

Re: Removing specific duplicates

[ Edited ]

Unfortunately I'm not allowed to post my actual data so I have to create an example.

Super User
Posts: 10,526

Re: Removing specific duplicates


Randy1 wrote:

Unfortunately I'm not allowed to post my actual data so I have to create an example.


At least post the code that you ran.

The behavior you claim for the "if var1='X' and var2=7 then delete;"

would make me think that you either included an OR some where in the code or left out the AND in the comparison.

Occasional Contributor
Posts: 7

Re: Removing specific duplicates

Ballardw, I removed the BY statement in the code below and it ran.  Do you know why that would mess it up?

 

 

DATA TEMP1;
SET QUERY_FOR_MERGE_STORET_SRBT_0006;
BY Organizati POINT_X POINT_Y DATE YEAR SLID_Rep;
IF Var1='X' and Var2>1 THEN DELETE;

RUN

Occasional Contributor
Posts: 7

Re: Removing specific duplicates

Thanks for all you help everyone. I'll let it at that.
Super User
Posts: 10,526

Re: Removing specific duplicates


Randy1 wrote:

Ballardw, I removed the BY statement in the code below and it ran.  Do you know why that would mess it up?

 

 

DATA TEMP1;
SET QUERY_FOR_MERGE_STORET_SRBT_0006;
BY Organizati POINT_X POINT_Y DATE YEAR SLID_Rep;
IF Var1='X' and Var2>1 THEN DELETE;

RUN


If that code did not yield the result desired I would suspect soem error to appear in the log. Most likely would be something like a sort order error or possibly one of the variables on the by statement. If the log shows something like:

ERROR: BY variables are not properly sorted on data set

then the resulting set would not be replaced. If you have an existing temp1 created incorrectly then that would be what you would see when you examine the values in Temp1. Or

 

ERROR: BY variable <any of your variables on the by statement> is not on input data set

would also not replace an existing Temp1.

 

New Contributor
Posts: 3

Re: Removing specific duplicates

I don't really understand what you want !!

 

If you want to delete duplicates you can use NODUPKEYS in Proc Sort  or an Proc sql with distinct :

 

Proc Sort data = your_table NODUPKEYS ; 

By _All_ ;

Run ; 

Occasional Contributor
Posts: 7

Re: Removing specific duplicates

Thanks Mokben for your quick reply and suggestion.  I'll will give your suggestion a try as I prefer SQL and using SAS EG.  I was afraid my example might not be clear.

Occasional Contributor
Posts: 7

Re: Removing specific duplicates

if i understood correctly.


data new;
input var1 $ var2;
cards;
X 5
X 6
X 7
Y 5
Y 6
Y 7
Z 7
Z 9
Z 10
;
run;
data new1;
set new;
if var1='X' and var2=7 then delete;
run;

Occasional Contributor
Posts: 7

Re: Removing specific duplicates

Thanks for your quick reply Kaverisonu1.  I used the suggested code above (that is essentially the same as yours here)   I removed a sort BY statement and it seems to work now.  Not sure why that would mess anything up, but as long as it work, I'm happy.  Have a good one.

Super User
Posts: 10,526

Re: Removing specific duplicates

Sort order would not affect the results of the posted code. If you have a "sort by" in the data step that would cause an error and not replace an existing incorrect dataset. As I mentioned, posting your code can clarify a number of questions.

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 13 replies
  • 158 views
  • 0 likes
  • 4 in conversation