BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SAS_student_11
Fluorite | Level 6

I have a dataset of 60K rows and I need to remove duplicates that meet certain criteria across multiple variables.

 

The values flagged and removed have to meet all three of the criteria as follows:

Criteria 1: Duplicates of variable 1 

Criteria 2. That contains 2 or more of the same value in variable 2

Criteria 3. That contains the same value (1) across ALL rows (that meets the first and second criteria) in variable 3

 

Below is an example of the data I have, the data I want, and the data I want to exclude. 

 

Example data I have:

Variable1Variable2Variable3
5511
5511
5511
2111
2122
2121
3312
9021
9031
9021
6721
6721
6721
6711
6711
6711
4031
8162
8161
8141
8141
4322
4321
2191
2191
2191
2191
5521
5521

 

Example data of what I want to keep:

Variable1Variable2 Variable3
2111
2122
2121
3312
4031
4322
4321

 

Example data that I need to flag then filter out:

Variable1Variable2 Variable3
5511
5511
5511
9021
9021
6721
6721
6721
6711
6711
6711
8141
8141
2191
2191
2191
2191
5521
5521

 

 

I am new to SAS programming so please forgive my naivete. I searched the forums and found the code below but it did not work, I receive syntax errors. Perhaps I am putting in the variable names incorrectly. Either way, I am not even sure if this code is what I need.  I appreciate your help...

 

PROC SQL;
   CREATE TABLE WORK.Orders1 AS 
   SELECT t1.variableone, 
          t1.variabletwo, 
          t1.variablethree
      FROM WORK.Orders t1
      ORDER BY t1.variableone,
               t1.variabletwo,
               t1.variablethree;
QUIT;

data WORK.Orders2;
set WORK.Orders1;
by FIRST.WORK.Orders1.variableone LAST.WORK.Orders1.variabletwo ;
if not (WORK.orders1.one and WORK.orders1.two) 
   then flag_1=1;      
   else flag_1=0;     
run;
1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

Sure.

 


proc sort data=have out=have2;by v1 v2;
data have2;
 set have2;
 by v1 v2 ;
 n+first.v2;
run;


proc sql;
create table want as
select * , (count(*)>1 and count(distinct v2) ne count(*) and sum(v3=1) = count(*)) as remove
from have2 
group by n 
;
quit;

View solution in original post

13 REPLIES 13
Ksharp
Super User
data have;
infile cards expandtabs truncover;
input v1-v3;
cards;
55	1	1
55	1	1
55	1	1
21	1	1
21	2	2
21	2	1
33	1	2
90	2	1
90	3	1
90	2	1
67	2	1
67	2	1
67	2	1
67	1	1
67	1	1
67	1	1
40	3	1
81	6	2
81	6	1
81	4	1
81	4	1
43	2	2
43	2	1
21	9	1
21	9	1
21	9	1
21	9	1
55	2	1
55	2	1
;

data have;
 set have;
 by v1 notsorted;
 n+first.v1;
run;


proc sql;
create table want as
select * from have group by n 
having not (count(*)>1 and count(distinct v2) ne count(*) and sum(v3=1) = count(*));
quit;
SAS_student_11
Fluorite | Level 6

@Ksharp This code actually does not fulfill all of the criteria as there are some values that are not excluded but should be see below:

Annotation 2021-06-24 093734.png

 

So I am not sure this would work for my 60K rows. 

SAS_student_11
Fluorite | Level 6

@Ksharp Below are the full results. There are two extra values that should not be on the list.Annotation 2021-06-24 094328.png

Ksharp
Super User

I don't understand your question. Can you explain why v1=81 should be removed ?

 

Or try this one :


data have;
 set have;
 by v1  notsorted;
 n+first.v1;
run;


proc sql;
create table want as
select * from (
select * from have group by n 
having not (count(*)>1 and count(distinct v2) ne count(*) and sum(v3=1) = count(*)) )
group by n,v2
having not (count(*)>1 and count(distinct v2) ne count(*) and sum(v3=1) = count(*)) 
;
quit;
SAS_student_11
Fluorite | Level 6

@Ksharp 

The below rows should be removed because:rows to remove.png

1. 81 is found repeated in the dataset (criteria1)

AND

2. Among the rows containing repeating 81s, the number 4 is also repeated (criteria 2)

AND

3. Among those rows v3 contains repeating 1s. 

 

Your code also appears to add to the final dataset whereas rows should only be flagged or removed. 

Ksharp
Super User
There are another two 81
81 6 2
81 6 1
81 4 1
81 4 1
Shouldn't be consider together ?
SAS_student_11
Fluorite | Level 6

@Ksharp the three criteria I need to apply are all-or-nothing criteria. 

The 2 (81s) match with 2 (6s) and the other 2 (81s) match with the 2 (4s). Both of these groups fulfill criteria 1 and 2. But the reason why we are not excluding the first 2 (81s) that match with 2 (6s) is because there is no repeated number in v3. Whereas the other 2 (81s) that match with the 2 (4s) also have repeated 1s. Therefore, they need to be excluded. 

Ksharp
Super User

Try this one . if it is what you want ?

 

data have;
infile cards expandtabs truncover;
input v1-v3;
cards;
55	1	1
55	1	1
55	1	1
21	1	1
21	2	2
21	2	1
33	1	2
90	2	1
90	3	1
90	2	1
67	2	1
67	2	1
67	2	1
67	1	1
67	1	1
67	1	1
40	3	1
81	6	2
81	6	1
81	4	1
81	4	1
43	2	2
43	2	1
21	9	1
21	9	1
21	9	1
21	9	1
55	2	1
55	2	1
;
proc sort data=have out=have2;by v1 v2;
data have2;
 set have2;
 by v1 v2 ;
 n+first.v2;
run;


proc sql;
create table want as
select * from have2 group by n 
having not (count(*)>1 and count(distinct v2) ne count(*) and sum(v3=1) = count(*));
quit;
SAS_student_11
Fluorite | Level 6

@Ksharp 

Is it possible to flag the rows that will be removed perhaps by creating a binary column (1= remove 0=keep) so that I could keep track of what is removed before it is actually removed?

Ksharp
Super User

Sure.

 


proc sort data=have out=have2;by v1 v2;
data have2;
 set have2;
 by v1 v2 ;
 n+first.v2;
run;


proc sql;
create table want as
select * , (count(*)>1 and count(distinct v2) ne count(*) and sum(v3=1) = count(*)) as remove
from have2 
group by n 
;
quit;
mkeintz
PROC Star

What does "Criteria 2. That contains 2 or more of the same value in variable 2" mean?  

 

Does this mean

   among records with the same value for var1 (criterion 1),   there must be at least 2 distinct values of var2.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
SAS_student_11
Fluorite | Level 6

@mkeintz criteria 2 means that within the list of repeats for variable 1 there needs to be repeating value(s) for variable 2. Said differently, if variable 1 contains any value repeated, for example, 12 times then for criteria 2 there needs to be two or more of any single value (e.g 1,1,1,2,2,3,3,3,3,4,5,5; in this case, 11 rows would pass criteria 1 and criteria 2 because value 4 is not a repeat).

Hopefully, this is more clear. 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 2462 views
  • 2 likes
  • 3 in conversation