Solved: Re: How do I Flag and filter duplicate rows with 3 different variable ...

SAS_student_11 · Posted 06-24-2021 08:29 AM

I have a dataset of 60K rows and I need to remove duplicates that meet certain criteria across multiple variables.

The values flagged and removed have to meet all three of the criteria as follows:

Criteria 1: Duplicates of variable 1

Criteria 2. That contains 2 or more of the same value in variable 2

Criteria 3. That contains the same value (1) across ALL rows (that meets the first and second criteria) in variable 3.

Below is an example of the data I have, the data I want, and the data I want to exclude.

Example data I have:

Variable1	Variable2	Variable3
55	1	1
55	1	1
55	1	1
21	1	1
21	2	2
21	2	1
33	1	2
90	2	1
90	3	1
90	2	1
67	2	1
67	2	1
67	2	1
67	1	1
67	1	1
67	1	1
40	3	1
81	6	2
81	6	1
81	4	1
81	4	1
43	2	2
43	2	1
21	9	1
21	9	1
21	9	1
21	9	1
55	2	1
55	2	1

Example data of what I want to keep:

Variable1	Variable2	Variable3
21	1	1
21	2	2
21	2	1
33	1	2
40	3	1
43	2	2
43	2	1

Example data that I need to flag then filter out:

Variable1	Variable2	Variable3
55	1	1
55	1	1
55	1	1
90	2	1
90	2	1
67	2	1
67	2	1
67	2	1
67	1	1
67	1	1
67	1	1
81	4	1
81	4	1
21	9	1
21	9	1
21	9	1
21	9	1
55	2	1
55	2	1

I am new to SAS programming so please forgive my naivete. I searched the forums and found the code below but it did not work, I receive syntax errors. Perhaps I am putting in the variable names incorrectly. Either way, I am not even sure if this code is what I need. I appreciate your help...

PROC SQL;
   CREATE TABLE WORK.Orders1 AS 
   SELECT t1.variableone, 
          t1.variabletwo, 
          t1.variablethree
      FROM WORK.Orders t1
      ORDER BY t1.variableone,
               t1.variabletwo,
               t1.variablethree;
QUIT;

data WORK.Orders2;
set WORK.Orders1;
by FIRST.WORK.Orders1.variableone LAST.WORK.Orders1.variabletwo ;
if not (WORK.orders1.one and WORK.orders1.two) 
   then flag_1=1;      
   else flag_1=0;     
run;

Ksharp · Posted 06-29-2021 07:43 AM

Sure.


proc sort data=have out=have2;by v1 v2;
data have2;
 set have2;
 by v1 v2 ;
 n+first.v2;
run;


proc sql;
create table want as
select * , (count(*)>1 and count(distinct v2) ne count(*) and sum(v3=1) = count(*)) as remove
from have2 
group by n 
;
quit;

View solution in original post

Ksharp · Posted 06-24-2021 09:02 AM

data have;
infile cards expandtabs truncover;
input v1-v3;
cards;
55	1	1
55	1	1
55	1	1
21	1	1
21	2	2
21	2	1
33	1	2
90	2	1
90	3	1
90	2	1
67	2	1
67	2	1
67	2	1
67	1	1
67	1	1
67	1	1
40	3	1
81	6	2
81	6	1
81	4	1
81	4	1
43	2	2
43	2	1
21	9	1
21	9	1
21	9	1
21	9	1
55	2	1
55	2	1
;

data have;
 set have;
 by v1 notsorted;
 n+first.v1;
run;


proc sql;
create table want as
select * from have group by n 
having not (count(*)>1 and count(distinct v2) ne count(*) and sum(v3=1) = count(*));
quit;

SAS_student_11 · Posted 06-24-2021 09:41 AM

@Ksharp This code actually does not fulfill all of the criteria as there are some values that are not excluded but should be see below:

So I am not sure this would work for my 60K rows.

SAS_student_11 · Posted 06-24-2021 09:46 AM

@Ksharp Below are the full results. There are two extra values that should not be on the list.

Ksharp · Posted 06-25-2021 08:21 AM

I don't understand your question. Can you explain why v1=81 should be removed ?

Or try this one :


data have;
 set have;
 by v1  notsorted;
 n+first.v1;
run;


proc sql;
create table want as
select * from (
select * from have group by n 
having not (count(*)>1 and count(distinct v2) ne count(*) and sum(v3=1) = count(*)) )
group by n,v2
having not (count(*)>1 and count(distinct v2) ne count(*) and sum(v3=1) = count(*)) 
;
quit;

SAS_student_11 · Posted 06-25-2021 08:42 AM

@Ksharp

The below rows should be removed because:

1. 81 is found repeated in the dataset (criteria1)

AND

2. Among the rows containing repeating 81s, the number 4 is also repeated (criteria 2)

AND

3. Among those rows v3 contains repeating 1s.

Your code also appears to add to the final dataset whereas rows should only be flagged or removed.

Ksharp · Posted 06-25-2021 09:07 AM

There are another two 81
81 6 2
81 6 1
81 4 1
81 4 1
Shouldn't be consider together ?

SAS_student_11 · Posted 06-25-2021 09:49 AM

@Ksharp the three criteria I need to apply are all-or-nothing criteria.

The 2 (81s) match with 2 (6s) and the other 2 (81s) match with the 2 (4s). Both of these groups fulfill criteria 1 and 2. But the reason why we are not excluding the first 2 (81s) that match with 2 (6s) is because there is no repeated number in v3. Whereas the other 2 (81s) that match with the 2 (4s) also have repeated 1s. Therefore, they need to be excluded.

Ksharp · Posted 06-26-2021 06:09 AM

Try this one . if it is what you want ?

data have;
infile cards expandtabs truncover;
input v1-v3;
cards;
55	1	1
55	1	1
55	1	1
21	1	1
21	2	2
21	2	1
33	1	2
90	2	1
90	3	1
90	2	1
67	2	1
67	2	1
67	2	1
67	1	1
67	1	1
67	1	1
40	3	1
81	6	2
81	6	1
81	4	1
81	4	1
43	2	2
43	2	1
21	9	1
21	9	1
21	9	1
21	9	1
55	2	1
55	2	1
;
proc sort data=have out=have2;by v1 v2;
data have2;
 set have2;
 by v1 v2 ;
 n+first.v2;
run;


proc sql;
create table want as
select * from have2 group by n 
having not (count(*)>1 and count(distinct v2) ne count(*) and sum(v3=1) = count(*));
quit;

SAS_student_11 · Posted 06-28-2021 03:03 PM

@Ksharp

Is it possible to flag the rows that will be removed perhaps by creating a binary column (1= remove 0=keep) so that I could keep track of what is removed before it is actually removed?

Ksharp · Posted 06-29-2021 07:43 AM

Sure.


proc sort data=have out=have2;by v1 v2;
data have2;
 set have2;
 by v1 v2 ;
 n+first.v2;
run;


proc sql;
create table want as
select * , (count(*)>1 and count(distinct v2) ne count(*) and sum(v3=1) = count(*)) as remove
from have2 
group by n 
;
quit;

SAS_student_11 · Posted 06-29-2021 08:33 AM

@Ksharp

Thank you so much!

mkeintz · Posted 06-24-2021 10:07 AM

What does "Criteria 2. That contains 2 or more of the same value in variable 2" mean?

Does this mean

among records with the same value for var1 (criterion 1), there must be at least 2 distinct values of var2.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS_student_11 · Posted 06-24-2021 10:28 AM

@mkeintz criteria 2 means that within the list of repeats for variable 1 there needs to be repeating value(s) for variable 2. Said differently, if variable 1 contains any value repeated, for example, 12 times then for criteria 2 there needs to be two or more of any single value (e.g 1,1,1,2,2,3,3,3,3,4,5,5; in this case, 11 rows would pass criteria 1 and criteria 2 because value 4 is not a repeat).

Hopefully, this is more clear.

How do I Flag and filter duplicate rows with 3 different variable conditions?

Re: How do I Flag and filter duplicate rows with 3 different variable conditions?

Re: How do I Flag and filter duplicate rows with 3 different variable conditions?

Re: How do I Flag and filter duplicate rows with 3 different variable conditions?

Re: How do I Flag and filter duplicate rows with 3 different variable conditions?

Re: How do I Flag and filter duplicate rows with 3 different variable conditions?

Re: How do I Flag and filter duplicate rows with 3 different variable conditions?

Re: How do I Flag and filter duplicate rows with 3 different variable conditions?

Re: How do I Flag and filter duplicate rows with 3 different variable conditions?

Re: How do I Flag and filter duplicate rows with 3 different variable conditions?

Re: How do I Flag and filter duplicate rows with 3 different variable conditions?

Re: How do I Flag and filter duplicate rows with 3 different variable conditions?

Re: How do I Flag and filter duplicate rows with 3 different variable conditions?

Re: How do I Flag and filter duplicate rows with 3 different variable conditions?

Re: How do I Flag and filter duplicate rows with 3 different variable conditions?

SAS Innovate 2026 Registration is Open

SAS Training: Just a Click Away