PRIMARY KEY ADDRESS SEX AGE ZIP DISEASE EVENT DATE
AA12345 1 SAMPLE ST, NY M 20 00000 x 03/01/2008
AA12345 1 SAMPLE ST, NY M 20 00000 x 03/01/2008
AA12345 1 SAMPLE ST, NY M 20 00000 x 03/01/2008
2 SAMPLE ST, FL F 21 12345 Y 04/01/2008
2 SAMPLE ST, FL F 21 12345 Y 04/01/2008
3 SAMPLE ST, CA F 22 22222 Y 05/01/2008
BD6789 M 19 33333 Z 06/01/2008
AA12345 1 SAMPLE ST, NY M 20 00000 x 03/01/2008
3 SAMPLE ST, CA F 22 22222 Y 05/01/2008
I am facing a dilemma with fixing a data set. The raw data I have received is on SQL file that I have no capability of changing (or at least they are not letting me just to see how I can manage things with SAS). My data set looks like this with around 700,000 observations. The data has a lot of duplicates and I want to get rid of them. As you can see…some have primary keys, some do not. Some have addresses, some do not. The only ones that everyone has are sex, age, zip, disease and event date.
I tried sorting the data with the code below but to no avail. Please help me –the SAS newbie.
PROC SORT;
DATA = NEW NODUPRECS;
BY _ALL_;
RUN;
I would say if the data doesn't have a primary key then the field is not a primary key but that's not germaine.
I am intersted in why you say the sort didn't work. We no records removed? Only some? Or is there a criteria on determining which are to be removed that you haven't detailed.
Another option might be:
proc sql;
create table new as
select distinct *
from <your data set name goes here>
;
quit;
The code you posted should remove the duplicates you highlighted with coloring. Normally you should use NODUPKEY option as the NODUPRECS option only eliminates duplicates that are adjacent, but since you are sorting by _ALL_ all the duplicates will be adjacent.
If some that appear to you to be exact duplicates did not get removed then there is some small difference in the values (leading spaces for example) that cause them to be different but look the same when printed.
Another possibility is that SAS delegated the sorting to your foreign database and it is sorting values differently than SAS would. For example it could be that some records have trailing blanks stored and others do not. SAS ignores trailing spaces but most database systems do not.
If want you really want to do is remove LOGICAL duplicates then you will need to be clearer what are the rules for indicating that something is a duplicate.
hi,
as written in above post, code should remove the duplicates: Check the syntex of your code:
DATA TEST;
infile cards dlm='|' dsd;
INPUT PRIMARY_KEY :$ ADDRESS :$20. SEX :$ AGE : ZIP : DISEASE : EVENT :$ DATE ddmmyy10.;
cards;
AA12345|1 SAMPLE ST,NY|M|20|00000| |x|03/01/2008
AA12345|1 SAMPLE ST,NY|M|20|00000| |x|03/01/2008
AA12345|1 SAMPLE ST,NY|M|20|00000| |x|03/01/2008
|2 SAMPLE ST,FL|F|21|12345||Y|04/01/2008
|2 SAMPLE ST,FL|F|21|12345||Y|04/01/2008
|3 SAMPLE ST,CA|F|22|22222||Y|05/01/2008
BD6789||M|19|33333||Z|06/01/2008
AA12345|1SAMPLE ST,NY|M|20|00000||x|03/01/2008
|3 SAMPLE ST,CA|F|22|22222||Y|05/01/2008
;
run;
165 PROC SORT DATA=test NODUPRECS;
166 BY _ALL_;
167 RUN;
NOTE: There were 9 observations read from the data set WORK.TEST.
NOTE: 4 duplicate observations were deleted.
NOTE: The data set WORK.TEST has 5 observations and 8 variables.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.