Multiple Duplicate Records

Reply
Occasional Contributor NOA
Occasional Contributor
Posts: 8

Multiple Duplicate Records

PRIMARY KEY    ADDRESS                              SEX      AGE        ZIP               DISEASE        EVENT DATE

AA12345              1 SAMPLE ST, NY                   M         20         00000                   x             03/01/2008

AA12345              1 SAMPLE ST, NY                   M         20         00000                   x             03/01/2008

AA12345              1 SAMPLE ST, NY                   M         20         00000                   x             03/01/2008

                           2 SAMPLE ST, FL                    F          21        12345                    Y            04/01/2008

                           2 SAMPLE ST, FL                    F          21        12345                    Y            04/01/2008

                           3 SAMPLE ST, CA                   F          22         22222                   Y            05/01/2008

BD6789                                                               M         19          33333                  Z             06/01/2008

AA12345             1 SAMPLE ST, NY                    M         20         00000                   x             03/01/2008

  3 SAMPLE ST, CA                    F          22        22222                   Y              05/01/2008

I am facing a dilemma with fixing a data set. The raw data I have received is on SQL file that I have no capability of changing (or at least they are not letting me just to see how I can manage things with SAS). My data set looks like this with around 700,000 observations. The data has a lot of duplicates and I want to get rid of them.  As you can see…some have primary keys, some do not. Some have addresses, some do not. The only ones that everyone has are sex, age, zip, disease and event date.

I tried sorting the data with the code below but to no avail. Please help me –the SAS newbie.

PROC SORT;

DATA = NEW NODUPRECS;

BY _ALL_;

RUN;

Super User
Posts: 10,500

Re: Multiple Duplicate Records

I would say if the data doesn't have a primary key then the field is not a primary key but that's not germaine.

I am intersted in why you say the sort didn't work. We no records removed? Only some? Or is there a criteria on determining which are to be removed that you haven't detailed.

Another option might be:

proc sql;

     create table new as

     select distinct *

     from <your data set name goes here>

     ;

quit;

Super User
Super User
Posts: 6,500

Re: Multiple Duplicate Records

The code you posted should remove the duplicates you highlighted with coloring.  Normally you should use NODUPKEY option as the NODUPRECS option only eliminates duplicates that are adjacent, but since you are sorting by _ALL_ all the duplicates will be adjacent.

If some that appear to you to be exact duplicates did not get removed then there is some small difference in the values (leading spaces for example) that cause them to be different but look the same when printed.

Another possibility is that SAS delegated the sorting to your foreign database and it is sorting values differently than SAS would.  For example it could be that some records have trailing blanks stored and others do not. SAS ignores trailing spaces but most database systems do not.

If want you really want to do is remove LOGICAL duplicates then you will need to be clearer what are the rules for indicating that something is a duplicate.

Frequent Contributor
Posts: 106

Re: Multiple Duplicate Records

hi,

as written in above post, code should remove the duplicates: Check the syntex of your code:

DATA TEST;

infile cards dlm='|' dsd;

INPUT PRIMARY_KEY :$ ADDRESS :$20. SEX :$ AGE : ZIP : DISEASE : EVENT :$ DATE ddmmyy10.;

cards;

AA12345|1 SAMPLE ST,NY|M|20|00000| |x|03/01/2008

AA12345|1 SAMPLE ST,NY|M|20|00000| |x|03/01/2008

AA12345|1 SAMPLE ST,NY|M|20|00000| |x|03/01/2008

|2 SAMPLE ST,FL|F|21|12345||Y|04/01/2008

|2 SAMPLE ST,FL|F|21|12345||Y|04/01/2008

|3 SAMPLE ST,CA|F|22|22222||Y|05/01/2008

BD6789||M|19|33333||Z|06/01/2008

AA12345|1SAMPLE ST,NY|M|20|00000||x|03/01/2008

|3 SAMPLE ST,CA|F|22|22222||Y|05/01/2008

;

run;

165  PROC SORT DATA=test NODUPRECS;

166  BY _ALL_;

167  RUN;

NOTE: There were 9 observations read from the data set WORK.TEST.

NOTE: 4 duplicate observations were deleted.

NOTE: The data set WORK.TEST has 5 observations and 8 variables.

Ask a Question
Discussion stats
  • 3 replies
  • 228 views
  • 1 like
  • 4 in conversation