Solved: Re: Identifying and deleting duplicates by multiple variables

emcee825 · Posted 10-30-2013 02:39 PM

Hello,

I'm trying to identify any dups in a dataset with 32200 obs (which is a product of two separate datasets). The only reliable way to identify dups in my data is by three different variables. I've tried using IN= and proc freq to identify the dups. I get a different number of dups for each method and when I scan through the supposed dup datasets, they typically aren't true dups.

I'm aware that I can use nodupskey with proc sort but my dataset also contains 169 variables so I don't trust this to get the job done, as using proc sort still doesn't order everything adjacent to each other (I'm aware that noduprecs only identifies dups if the observations are adjacent to each other after sorting but I wasn't sure if this were the case for nodupskey). I've scanned through the dupout dataset that was produced and most of them seemed to be unique observations.

Also, when I tried to use proc freq to look for dups, I get an error message stating that one of my variables is unreferenced. I'm really hoping that I made a silly mistake somewhere by overlooking something but a fresh pair of eyes would really help right now.

My code is as follows for the methods described:

proc sort data=perm.avss_2; by dob lnam disease1; run;

proc sort data=perm.calredie_2; by dob lnam disease1; run;

data dups_goaway;

merge perm.avss_2 (in=a)

perm.calredie_2 (in=b);

by dob lnam disease1;

if a and b;

run;

/*Concatenating*/

data perm.letssee;

set perm.avss_2 perm.calredie_2;

run;

proc sort data=perm.letssee;

by dob lnam disease1;

run;

proc freq data=perm.letssee noprint;

by DOB;

table lnam disease1 / out = dup_obs (keep = DOB lnam disease1 Count

where = (Count > 1)) ;

run;

Thank you!

MumSquared · Posted 10-31-2013 02:59 PM

OK for that problem then I would:

* concatenate ... ;

data perm.letssee;

set perm.avss_2 perm.calredie_2;

run;

* order - assuming event_date is the date of disease event;

proc sort data=perm.letssee;

by dob lnam disease1 event_date;

run;

* should give you a list of all the records in a dataset that have the same dob inam disease1 event_date values ;

data dups;

set perm.avss_2;

by dob lnam disease1 event_date;

if first.event_date and last.event_date then ; /* do nothing */

else output;

run;

* then you need to make a decision on which ones to keep ;

View solution in original post

Reeza · Posted 10-30-2013 02:53 PM

Can you explain what you mean by the following?

I get a different number of dups for each method and when I scan through the supposed dup datasets, they typically aren't true dups.

Also, try the following proc freq instead, I don't think your proc freq will identify duplicates.

proc freq data=perm.letssee noprint;

table dob* lnam*disease1 / out = dup_obs (keep = DOB lnam disease1 Count

where = (Count > 1)) ;

run;

emcee825 · Posted 10-30-2013 04:41 PM

Thanks for your reply, Reeza. What I meant by that sentence was that through the different methods I've tried, the dataset of dups that it produced had a different number of dups, meaning that each method is yielding a different of dups. Because of this lack of consistency, I'm unsure of which method to use. Also, I tried using Method 2 of what was outlined in this article: http://www2.sas.com/proceedings/sugi31/164-31.pdf. I definitely see that I left out the "*" that is needed to cross-tabulate, thank you for pointing that out. However, when I ran it, I got an error message saying, "The requested table is too large to process." Yikes.

MumSquared · Posted 10-30-2013 03:44 PM

When you merge your two datasets do you get a SAS note about multiple by variables in the log?

This happens where there is more than one record in both of the datasets which matches the variables listed in the by statement used in the merge.

I avoid using nodupkey option on sorting (unless just keeping the key variables).

I would check each of the input datasets for the duplicates first.

proc sort data=perm.avss_2; by dob lnam disease1; run;

* should give you a list of all the records in a dataset that have the same dob inam disease1 values ;

data dups;

set perm.avss_2;

by dob lnam disease1;

if first.disease1 and last.disease1 then ; /* do nothing */

else output;

run;

Although in your problem is it not reasonable for someone to have a disease more than once? Do you have an date the disease identified which could be used to identify different instances of the disease? Then you will need to be clear if you are counting occurrences or people. (I had a study where one person and two events in a 21 day period.)

emcee825 · Posted 10-30-2013 04:42 PM

MumSquared, thank you for your reply as well. I don't get an error message about multiple by variables for the merge shown above. The reason why I'm merging/concatenating and then looking for dups is because I'm unsure if there are redundancies between the two datasets. There was a period of a couple years where there was a transition from one data reporting system to another but no one is sure if people reported the same case into both systems. And you do make an excellent point about someone being susceptible to having a disease more than once. I do have a variable that allows me to look at the date of diagnosis. Before determining whether I have redundant observations of the same case for a specific disease, though, I do want to see how many dups of people I have.

MumSquared · Posted 10-31-2013 02:59 PM

OK for that problem then I would:

* concatenate ... ;

data perm.letssee;

set perm.avss_2 perm.calredie_2;

run;

* order - assuming event_date is the date of disease event;

proc sort data=perm.letssee;

by dob lnam disease1 event_date;

run;

* should give you a list of all the records in a dataset that have the same dob inam disease1 event_date values ;

data dups;

set perm.avss_2;

by dob lnam disease1 event_date;

if first.event_date and last.event_date then ; /* do nothing */

else output;

run;

* then you need to make a decision on which ones to keep ;

emcee825 · Posted 11-01-2013 11:43 AM

Thanks again for your response, MumSquared.

That definitely seemed to do the trick! There were a few more dups than I had been expecting. Now, how would you propose that I remove these dups from the dataset while being able to keep one of each as a unique observation?

[Added]: I've actually figured that part out myself. Thanks so much for the help!

Patrick · Posted 10-31-2013 06:37 PM

I believe to understand that your definition of a duplicate is "all values of all variable identical". If so then you need to sort by all variables to find duplicate rows.

proc sort data=perm.letssee noduprecs dupout=dups;

by _all_;

run;

If using DI Studio then use the validation transformation for this (with all variables selected to find duplicates). SAS(R) Data Integration Studio 4.4: User's Guide

M_Elteibi · Posted 10-31-2013 07:52 PM

Hi emcee825,

If i may suggest a different approach. From my understanding you want to join 2 tables produce a new table that only contains records that are matched on dob lnam disease1.

If that is the case i would suggest simply using PROC SQL for this task.

Such as

PROC SQL;

CREATE TABLE NEW_DISTINCT_TABLE AS

SELECT

*

FROM perm.avss_2 as a

JOIN perm.calredie_2 as b

ON a.dob = b.dob

AND a.lnam = b.lnam

AND a.disease1 = b.disease1

;

QUIT;

OR if you have completed the merge and only want to create a new table with unique rows across all variables you could use PROC SQL with DISTINCT as

PROC SQL;

CREATE TABLE only_distinct as

SELECT

DISTINCT

*

FROM my_combined_table

;

QUIT;

Hope this helps.

Mohammad

emcee825 · Posted 11-01-2013 11:46 AM

Thanks for your response, Mohammad. I think that I'm on the right track for this, but as I'm unfamiliar with proc sql, I'll definitely play around with the code you've provided me to become more familiar with it. Thanks again!

SAS Innovate 2025: Register Now