Re: MORE DUPLICATES!

deleted_user · Posted 05-10-2009 09:25 PM

Hi, just when I thought I was ready to merge I found more duplicates within my dataset. I have learned how to pull out duplicates based on one variable, but how can I pull out duplicate values where many conditions are true? For example I want to know when there are duplicate occurences of last name, first name, date of birth, and idnumber (all together). I don't know or care right now whether the other variables surrounding these four are the same or different; I just want to know if those four are equal in more than one observation, and if they are, to create a separate dataset so that I can look at them.

Any ideas?

Thanks 🙂

FredrikE · Posted 05-11-2009 03:06 AM

Try this!

proc sort data = x out = tmp dupout = dup nodupkey;
by last_name first_name date_of_birth idnumber;
run;

Dataset dup will contain your duplicates and tmp the rest...
//Fredrik

deleted_user · Posted 05-11-2009 04:17 PM

Thanks for the help! I may not be doing things right, but when I use the code you provided, I get only one of the duplicates from each 'pair' of duplicates in dataset dup. Is that correct?
What I'd like to see is all the duplicate occurences together so I can see what went wrong.

Is this possible?

For example, if I have John Doe 1/1/01 #333 and John Doe 1/1/01 #333 I'd like to see those two together (along with the rest of the variables surrounding them) to see if it is a duplicate entry or not.

Suggestions?

deleted_user · Posted 05-11-2009 04:36 AM

Any suggestions & issues...
plz feel free to contact Message was edited by: giri2help

deleted_user · Posted 05-11-2009 04:40 AM

Let's consider Table name :MAST_CS;
contains field1,field2,field3,field4.. as fields say...last_name , first_name, date_of_birth, idnumber
1.using same table mapped to conditions as "NOT EQUAL " field values for Duplicates removal
2.using same table mapped to conditions as " EQUAL " field values for displaying only Duplicates
1.OUT1 dataset contains elimination of duplicate entries
proc sql;
create table mynodups as
SELECT distinct Table1.field1,Table1.field2,Table1.field3,Table1.field4
FROM MAST_CS Table1,MAST_CS Table2
WHERE Table1.field1 NOT EQUAL Table2.field1
AND Table1.field2 NOT EQUAL Table2.field2
AND Table1.field3 NOT EQUAL Table2.field3
AND Table1.field4 NOT EQUAL Table2.field 4;

quit;

Data out1;
set work.mynodups;
run;

2.OUT2 dataset contains only duplicate entries

Displaying Duplicates alone:

create table mydupsonly as
SELECT distinct Table1.field1,Table1.field2,Table1.field3,Table1.field4
FROM MAST_CS Table1,MAST_CS Table2

WHERE Table1.field1 == Table2.field1
AND Table1.field2 == Table2.field2
AND Table1.field3 == Table2.field3
AND Table1.field4 == Table2.field 4;

quit;

Data out2;
set work.mydupsonly;
run;

deleted_user · Posted 05-11-2009 04:01 PM

Thanks for the suggestion! I get the same exact table, however, either way with the same number of observations in each? I.e. I get the "no dups" result with both tables.

Also...if I do it this way and it works, will it allow me to see the duplicate pairs? Or just one member of a duplicate pair? Thanks so much!!

Emily

deleted_user · Posted 05-12-2009 02:08 AM

Reg:- Duplicate Pairs
Distinct will produce without duplicates ...so u r getting same in both tables..
check without using distinct keyword... for duplicates...

If u want to see only duplicates with pairs..remove distinct keyword & u can see the variation of results in select query...Hope this is what u expect i think ..

Any clarifications let me know plz...
giridhar

deleted_user · Posted 05-12-2009 03:53 PM

I can't seem to get that to work. Thanks so much for your help, though. I'm trying a different approach, and I think this is a very simple thing to do: how can I pull out 35 observations that meet a certain criteria?

For example, I located the duplicate numbers. I have the list. So I want to create a dataset that includes observations containing variables of certain values only. I want to pull the following IDs:

MN-1233
MN-14354
MN-1515
MN-3535

I can't seem to get the "where" or "if" statements to work with this.....
Help??? This one should be easy....I'm just such a novice and getting frustrated.
Thanks 🙂

Peter_C · Posted 05-13-2009 07:28 AM

this provides one way to select cases from a table.
DATA SELECTED ;
set original ;
where ID in( 'MN-1233', 'MN-14354', 'MN-1515', 'MN-3535' );
run;

This collects duplicates from a table in the key order;
DATA DUPLICATES ;
set original ;
by ID ;
if first.id and last.id then delete ; * removing singletons ;
run ;

hope it helps

MORE DUPLICATES!