I have a dataset of 60K rows and I need to remove duplicates that meet certain criteria across multiple variables.
The values flagged and removed have to meet all three of the criteria as follows:
Criteria 1: Duplicates of variable 1
Criteria 2. That contains 2 or more of the same value in variable 2
Criteria 3. That contains the same value (1) across ALL rows (that meets the first and second criteria) in variable 3.
Below is an example of the data I have, the data I want, and the data I want to exclude.
Example data I have:
| Variable1 | Variable2 | Variable3 | 
| 55 | 1 | 1 | 
| 55 | 1 | 1 | 
| 55 | 1 | 1 | 
| 21 | 1 | 1 | 
| 21 | 2 | 2 | 
| 21 | 2 | 1 | 
| 33 | 1 | 2 | 
| 90 | 2 | 1 | 
| 90 | 3 | 1 | 
| 90 | 2 | 1 | 
| 67 | 2 | 1 | 
| 67 | 2 | 1 | 
| 67 | 2 | 1 | 
| 67 | 1 | 1 | 
| 67 | 1 | 1 | 
| 67 | 1 | 1 | 
| 40 | 3 | 1 | 
| 81 | 6 | 2 | 
| 81 | 6 | 1 | 
| 81 | 4 | 1 | 
| 81 | 4 | 1 | 
| 43 | 2 | 2 | 
| 43 | 2 | 1 | 
| 21 | 9 | 1 | 
| 21 | 9 | 1 | 
| 21 | 9 | 1 | 
| 21 | 9 | 1 | 
| 55 | 2 | 1 | 
| 55 | 2 | 1 | 
Example data of what I want to keep:
| Variable1 | Variable2 | Variable3 | 
| 21 | 1 | 1 | 
| 21 | 2 | 2 | 
| 21 | 2 | 1 | 
| 33 | 1 | 2 | 
| 40 | 3 | 1 | 
| 43 | 2 | 2 | 
| 43 | 2 | 1 | 
Example data that I need to flag then filter out:
| Variable1 | Variable2 | Variable3 | 
| 55 | 1 | 1 | 
| 55 | 1 | 1 | 
| 55 | 1 | 1 | 
| 90 | 2 | 1 | 
| 90 | 2 | 1 | 
| 67 | 2 | 1 | 
| 67 | 2 | 1 | 
| 67 | 2 | 1 | 
| 67 | 1 | 1 | 
| 67 | 1 | 1 | 
| 67 | 1 | 1 | 
| 81 | 4 | 1 | 
| 81 | 4 | 1 | 
| 21 | 9 | 1 | 
| 21 | 9 | 1 | 
| 21 | 9 | 1 | 
| 21 | 9 | 1 | 
| 55 | 2 | 1 | 
| 55 | 2 | 1 | 
I am new to SAS programming so please forgive my naivete. I searched the forums and found the code below but it did not work, I receive syntax errors. Perhaps I am putting in the variable names incorrectly. Either way, I am not even sure if this code is what I need. I appreciate your help...
PROC SQL;
   CREATE TABLE WORK.Orders1 AS 
   SELECT t1.variableone, 
          t1.variabletwo, 
          t1.variablethree
      FROM WORK.Orders t1
      ORDER BY t1.variableone,
               t1.variabletwo,
               t1.variablethree;
QUIT;
data WORK.Orders2;
set WORK.Orders1;
by FIRST.WORK.Orders1.variableone LAST.WORK.Orders1.variabletwo ;
if not (WORK.orders1.one and WORK.orders1.two) 
   then flag_1=1;      
   else flag_1=0;     
run;Sure.
proc sort data=have out=have2;by v1 v2;
data have2;
 set have2;
 by v1 v2 ;
 n+first.v2;
run;
proc sql;
create table want as
select * , (count(*)>1 and count(distinct v2) ne count(*) and sum(v3=1) = count(*)) as remove
from have2 
group by n 
;
quit;data have;
infile cards expandtabs truncover;
input v1-v3;
cards;
55	1	1
55	1	1
55	1	1
21	1	1
21	2	2
21	2	1
33	1	2
90	2	1
90	3	1
90	2	1
67	2	1
67	2	1
67	2	1
67	1	1
67	1	1
67	1	1
40	3	1
81	6	2
81	6	1
81	4	1
81	4	1
43	2	2
43	2	1
21	9	1
21	9	1
21	9	1
21	9	1
55	2	1
55	2	1
;
data have;
 set have;
 by v1 notsorted;
 n+first.v1;
run;
proc sql;
create table want as
select * from have group by n 
having not (count(*)>1 and count(distinct v2) ne count(*) and sum(v3=1) = count(*));
quit;
@Ksharp This code actually does not fulfill all of the criteria as there are some values that are not excluded but should be see below:
So I am not sure this would work for my 60K rows.
@Ksharp Below are the full results. There are two extra values that should not be on the list.
I don't understand your question. Can you explain why v1=81 should be removed ?
Or try this one :
data have;
 set have;
 by v1  notsorted;
 n+first.v1;
run;
proc sql;
create table want as
select * from (
select * from have group by n 
having not (count(*)>1 and count(distinct v2) ne count(*) and sum(v3=1) = count(*)) )
group by n,v2
having not (count(*)>1 and count(distinct v2) ne count(*) and sum(v3=1) = count(*)) 
;
quit;The below rows should be removed because:
1. 81 is found repeated in the dataset (criteria1)
AND
2. Among the rows containing repeating 81s, the number 4 is also repeated (criteria 2)
AND
3. Among those rows v3 contains repeating 1s.
Your code also appears to add to the final dataset whereas rows should only be flagged or removed.
@Ksharp the three criteria I need to apply are all-or-nothing criteria.
The 2 (81s) match with 2 (6s) and the other 2 (81s) match with the 2 (4s). Both of these groups fulfill criteria 1 and 2. But the reason why we are not excluding the first 2 (81s) that match with 2 (6s) is because there is no repeated number in v3. Whereas the other 2 (81s) that match with the 2 (4s) also have repeated 1s. Therefore, they need to be excluded.
Try this one . if it is what you want ?
data have; infile cards expandtabs truncover; input v1-v3; cards; 55 1 1 55 1 1 55 1 1 21 1 1 21 2 2 21 2 1 33 1 2 90 2 1 90 3 1 90 2 1 67 2 1 67 2 1 67 2 1 67 1 1 67 1 1 67 1 1 40 3 1 81 6 2 81 6 1 81 4 1 81 4 1 43 2 2 43 2 1 21 9 1 21 9 1 21 9 1 21 9 1 55 2 1 55 2 1 ; proc sort data=have out=have2;by v1 v2; data have2; set have2; by v1 v2 ; n+first.v2; run; proc sql; create table want as select * from have2 group by n having not (count(*)>1 and count(distinct v2) ne count(*) and sum(v3=1) = count(*)); quit;
Is it possible to flag the rows that will be removed perhaps by creating a binary column (1= remove 0=keep) so that I could keep track of what is removed before it is actually removed?
Sure.
proc sort data=have out=have2;by v1 v2;
data have2;
 set have2;
 by v1 v2 ;
 n+first.v2;
run;
proc sql;
create table want as
select * , (count(*)>1 and count(distinct v2) ne count(*) and sum(v3=1) = count(*)) as remove
from have2 
group by n 
;
quit;Thank you so much!
What does "Criteria 2. That contains 2 or more of the same value in variable 2" mean?
Does this mean
among records with the same value for var1 (criterion 1), there must be at least 2 distinct values of var2.
@mkeintz criteria 2 means that within the list of repeats for variable 1 there needs to be repeating value(s) for variable 2. Said differently, if variable 1 contains any value repeated, for example, 12 times then for criteria 2 there needs to be two or more of any single value (e.g 1,1,1,2,2,3,3,3,3,4,5,5; in this case, 11 rows would pass criteria 1 and criteria 2 because value 4 is not a repeat).
Hopefully, this is more clear.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
