Re: Multiple Duplicate Records

NOA · Posted 03-21-2014 11:23 AM

PRIMARY KEY ADDRESS SEX AGE ZIP DISEASE EVENT DATE

AA12345 1 SAMPLE ST, NY M 20 00000 x 03/01/2008

2 SAMPLE ST, FL F 21 12345 Y 04/01/2008

3 SAMPLE ST, CA F 22 22222 Y 05/01/2008

BD6789 M 19 33333 Z 06/01/2008

AA12345 1 SAMPLE ST, NY M 20 00000 x 03/01/2008

3 SAMPLE ST, CA F 22 22222 Y 05/01/2008

I am facing a dilemma with fixing a data set. The raw data I have received is on SQL file that I have no capability of changing (or at least they are not letting me just to see how I can manage things with SAS). My data set looks like this with around 700,000 observations. The data has a lot of duplicates and I want to get rid of them. As you can see…some have primary keys, some do not. Some have addresses, some do not. The only ones that everyone has are sex, age, zip, disease and event date.

I tried sorting the data with the code below but to no avail. Please help me –the SAS newbie.

PROC SORT;

DATA = NEW NODUPRECS;

BY _ALL_;

RUN;

ballardw · Posted 03-21-2014 01:02 PM

I would say if the data doesn't have a primary key then the field is not a primary key but that's not germaine.

I am intersted in why you say the sort didn't work. We no records removed? Only some? Or is there a criteria on determining which are to be removed that you haven't detailed.

Another option might be:

proc sql;

create table new as

select distinct *

from <your data set name goes here>

;

quit;

Tom · Posted 03-22-2014 10:51 AM

The code you posted should remove the duplicates you highlighted with coloring. Normally you should use NODUPKEY option as the NODUPRECS option only eliminates duplicates that are adjacent, but since you are sorting by _ALL_ all the duplicates will be adjacent.

If some that appear to you to be exact duplicates did not get removed then there is some small difference in the values (leading spaces for example) that cause them to be different but look the same when printed.

Another possibility is that SAS delegated the sorting to your foreign database and it is sorting values differently than SAS would. For example it could be that some records have trailing blanks stored and others do not. SAS ignores trailing spaces but most database systems do not.

If want you really want to do is remove LOGICAL duplicates then you will need to be clearer what are the rules for indicating that something is a duplicate.

pradeepalankar · Posted 03-22-2014 03:02 PM

hi,

as written in above post, code should remove the duplicates: Check the syntex of your code:

DATA TEST;

infile cards dlm='|' dsd;

INPUT PRIMARY_KEY :$ ADDRESS :$20. SEX :$ AGE : ZIP : DISEASE : EVENT :$ DATE ddmmyy10.;

cards;

AA12345|1 SAMPLE ST,NY|M|20|00000| |x|03/01/2008

|2 SAMPLE ST,FL|F|21|12345||Y|04/01/2008

|3 SAMPLE ST,CA|F|22|22222||Y|05/01/2008

BD6789||M|19|33333||Z|06/01/2008

AA12345|1SAMPLE ST,NY|M|20|00000||x|03/01/2008

|3 SAMPLE ST,CA|F|22|22222||Y|05/01/2008

;

run;

165 PROC SORT DATA=test NODUPRECS;

166 BY _ALL_;

167 RUN;

NOTE: There were 9 observations read from the data set WORK.TEST.

NOTE: 4 duplicate observations were deleted.

NOTE: The data set WORK.TEST has 5 observations and 8 variables.

Multiple Duplicate Records