BookmarkSubscribeRSS Feed
NOA
Calcite | Level 5 NOA
Calcite | Level 5

PRIMARY KEY    ADDRESS                              SEX      AGE        ZIP               DISEASE        EVENT DATE

AA12345              1 SAMPLE ST, NY                   M         20         00000                   x             03/01/2008

AA12345              1 SAMPLE ST, NY                   M         20         00000                   x             03/01/2008

AA12345              1 SAMPLE ST, NY                   M         20         00000                   x             03/01/2008

                           2 SAMPLE ST, FL                    F          21        12345                    Y            04/01/2008

                           2 SAMPLE ST, FL                    F          21        12345                    Y            04/01/2008

                           3 SAMPLE ST, CA                   F          22         22222                   Y            05/01/2008

BD6789                                                               M         19          33333                  Z             06/01/2008

AA12345             1 SAMPLE ST, NY                    M         20         00000                   x             03/01/2008

  3 SAMPLE ST, CA                    F          22        22222                   Y              05/01/2008

I am facing a dilemma with fixing a data set. The raw data I have received is on SQL file that I have no capability of changing (or at least they are not letting me just to see how I can manage things with SAS). My data set looks like this with around 700,000 observations. The data has a lot of duplicates and I want to get rid of them.  As you can see…some have primary keys, some do not. Some have addresses, some do not. The only ones that everyone has are sex, age, zip, disease and event date.

I tried sorting the data with the code below but to no avail. Please help me –the SAS newbie.

PROC SORT;

DATA = NEW NODUPRECS;

BY _ALL_;

RUN;

3 REPLIES 3
ballardw
Super User

I would say if the data doesn't have a primary key then the field is not a primary key but that's not germaine.

I am intersted in why you say the sort didn't work. We no records removed? Only some? Or is there a criteria on determining which are to be removed that you haven't detailed.

Another option might be:

proc sql;

     create table new as

     select distinct *

     from <your data set name goes here>

     ;

quit;

Tom
Super User Tom
Super User

The code you posted should remove the duplicates you highlighted with coloring.  Normally you should use NODUPKEY option as the NODUPRECS option only eliminates duplicates that are adjacent, but since you are sorting by _ALL_ all the duplicates will be adjacent.

If some that appear to you to be exact duplicates did not get removed then there is some small difference in the values (leading spaces for example) that cause them to be different but look the same when printed.

Another possibility is that SAS delegated the sorting to your foreign database and it is sorting values differently than SAS would.  For example it could be that some records have trailing blanks stored and others do not. SAS ignores trailing spaces but most database systems do not.

If want you really want to do is remove LOGICAL duplicates then you will need to be clearer what are the rules for indicating that something is a duplicate.

pradeepalankar
Obsidian | Level 7

hi,

as written in above post, code should remove the duplicates: Check the syntex of your code:

DATA TEST;

infile cards dlm='|' dsd;

INPUT PRIMARY_KEY :$ ADDRESS :$20. SEX :$ AGE : ZIP : DISEASE : EVENT :$ DATE ddmmyy10.;

cards;

AA12345|1 SAMPLE ST,NY|M|20|00000| |x|03/01/2008

AA12345|1 SAMPLE ST,NY|M|20|00000| |x|03/01/2008

AA12345|1 SAMPLE ST,NY|M|20|00000| |x|03/01/2008

|2 SAMPLE ST,FL|F|21|12345||Y|04/01/2008

|2 SAMPLE ST,FL|F|21|12345||Y|04/01/2008

|3 SAMPLE ST,CA|F|22|22222||Y|05/01/2008

BD6789||M|19|33333||Z|06/01/2008

AA12345|1SAMPLE ST,NY|M|20|00000||x|03/01/2008

|3 SAMPLE ST,CA|F|22|22222||Y|05/01/2008

;

run;

165  PROC SORT DATA=test NODUPRECS;

166  BY _ALL_;

167  RUN;

NOTE: There were 9 observations read from the data set WORK.TEST.

NOTE: 4 duplicate observations were deleted.

NOTE: The data set WORK.TEST has 5 observations and 8 variables.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 565 views
  • 1 like
  • 4 in conversation