Hello Experts,
I am working with a large dataset of around 100 variables and need to check for exact duplicate observations. I used the code below, but it returns 0 observations, however, when I run the code specifying 2 or 3 variables it returns observations that are exact duplicates. I tried nodupkey as well, same result.
proc sort data=me.in out=me.out dupout=me.dup nodup;
by _all_;
run;
I used this and got the desired result. Thanks for the feedback, I really appreciate it
proc sort data=me.in
out=me.out
nodup;
by ID;
run;
The NODUP or NODUPKEY option check for same combination of variables' list value and
not duplicate per a variable. So having less variables in the BY statement the possibility to find duplicate combination is greater.
Usually the use of searching duplicates is to deal with a set of observations having same ID,
either ID is one variable or a combination some variables.
What ls the logic to look for duplicate values per each variable in a dataset ?
If you reformat your data-set by transpose or by a data step into an observation made of
3 variables each: ID, Time, Value - then run next code to find dupliactes:
proc sort data=reformated ; by ID Time Value; run;
data dups unique;
set reformated;
by ID Time;
if not (first.Time and last.Time) then output dups;
else output unique;
run;
otherwise, please post your data-set format - are there more then one value (different meaning) per ID Time in the same observation? Do have a set of variables VAR1-VARn ? what is the relations between those variables ?
Sort and deduping works best when there are categorical variables, dates, character data and integers. It is very expensive and not very efficient for floats and not give correct results due to precision around storing values. Please check if that is the case.
I tried this on a smaller but identical dataset with only categorical variables and the result was the same.
To avoid misunderstandings what you have and what you need ideally post some representative but as simple as possible sample data (via a SAS data step creating such data) where you can show us what's not working for you right now and how your desired result should look like.
Please see the sample data attached. The observations in yellow are exact duplicates but observations in green are different despite having the same ID, I hope this helps. I would like to maintain only one record of the same observations (yellow) while keeping all observations in green.
Code 1(below) returns 0 observations in the output file
proc sort data=me.in nodupkey dupout=me.out;
by _all_;
run;
Code 2(below), includes the a number of the same observations (like ones in yellow)
proc sort data=me.in nodupkey dupout=me.out;
by ID Year;
run;
The variables are much so I used _all_ instead of listing all of them.
Thank you.
@michokwu wrote:
Hello Experts,
I am working with a large dataset of around 100 variables and need to check for exact duplicate observations. I used the code below, but it returns 0 observations, however, when I run the code specifying 2 or 3 variables it returns observations that are exact duplicates. I tried nodupkey as well, same result.
proc sort data=me.in out=me.out dupout=me.dup nodup; by _all_; run;
Please explain in more detail what you are trying to do and how your different attempts behave differently.
It sounds like you are trying to create OUT with no duplicate observations. If you ran the same thing without listing ALL of the variables then it is possible for duplicate observations to written to OUT. That is because of how the NODUP (aka NODUPREC) works. It compares the observations as it is writing them and will only eliminate a duplicate observations that just happen to be directly adjacent. When you sort by ALL of the variables then it will insure that the duplicates are sorted next to each other so that they can be detected and removed.
Consider this simple Example:
KEY OTHER
1 A
1 B
1 A
If you sort by KEY OTHER then the identical records will be found and removed. But if you sort by just KEY then the two duplicate records will both be written since neither is the same as the one in the middle.
It is much easier for others to help if you just post data as TEXT. Posting a data step is even better.
data have ;
infile cards dsd dlm='|' truncover ;
input ID $ YEAR SEX RACE SCHOOL EDUC INCOME OCCUPATION MODE :$20.;
cards;
00001|2017|1|1|2|7|20000|1|PRIVATE CAR
00002|2017|1|4|1|6|25000|2|PRIVATE CAR
00003|2017|2|4|2|6|30000|3|RAIL
00004|2017|1|2|1|6|40000|4|PRIVATE CAR
00005|2017|2|3|1|6|45000|5|POOLED
00006|2017|2|2|1|6|18000|6|RAIL
00007|2017|2|3|1|7|22000|1|BUS
00008|2017|1|1|1|6|26000|2|BICYCLE
00009|2017|2|4|1|8|30000|3|POOLED
00010|2017|2|1|1|6|34000|4|WALK
00001|2017|1|1|2|7|20000|1|PRIVATE CAR
00006|2017|2|2|1|6|38000|4|PRIVATE CAR
00001|2018|1|1|2|7|25000|1|PRIVATE CAR
00002|2018|1|4|1|6|27000|2|PRIVATE CAR
00003|2018|2|4|2|6|32000|3|RAIL
00004|2018|1|2|1|6|42000|4|PRIVATE CAR
00005|2018|2|3|1|6|47000|5|POOLED
00006|2018|2|2|1|6|20000|6|PRIVATE CAR
00007|2018|2|3|1|7|24000|1|BUS
00008|2018|1|1|1|6|28000|2|BICYCLE
00009|2018|2|4|1|8|32000|3|POOLED
00010|2018|2|1|1|6|36000|4|WALK
00010|2018|2|1|1|6|27000|2|BICYCLE
00004|2018|1|2|1|6|42000|4|PRIVATE CAR
;
data want ;
infile cards dsd dlm='|' truncover ;
input ID $ YEAR SEX RACE SCHOOL EDUC INCOME OCCUPATION MODE :$20.;
cards;
00001|2017|1|1|2|7|20000|1|PRIVATE CAR
00002|2017|1|4|1|6|25000|2|PRIVATE CAR
00003|2017|2|4|2|6|30000|3|RAIL
00004|2017|1|2|1|6|40000|4|PRIVATE CAR
00005|2017|2|3|1|6|45000|5|POOLED
00006|2017|2|2|1|6|18000|6|PRIVATE CAR
00006|2017|2|2|1|6|38000|4|RAIL
00007|2017|2|3|1|7|22000|1|BUS
00008|2017|1|1|1|6|26000|2|BICYCLE
00009|2017|2|4|1|8|30000|3|POOLED
00010|2017|2|1|1|6|34000|4|WALK
00001|2018|1|1|2|7|25000|1|PRIVATE CAR
00002|2018|1|4|1|6|27000|2|PRIVATE CAR
00003|2018|2|4|2|6|32000|3|RAIL
00004|2018|1|2|1|6|42000|4|PRIVATE CAR
00005|2018|2|3|1|6|47000|5|POOLED
00006|2018|2|2|1|6|20000|6|POOLED
00007|2018|2|3|1|7|24000|1|BUS
00008|2018|1|1|1|6|28000|2|BICYCLE
00009|2018|2|4|1|8|32000|3|POOLED
00010|2018|2|1|1|6|36000|4|WALK
00010|2018|2|1|1|6|27000|2|BICYCLE
;
The BY _ALL_ should do what you want. There are 22 observations in your WANT dataset and 22 unique observations in your HAVE dataset.
375 proc sort data=want;
376 by ID YEAR SEX RACE SCHOOL EDUC INCOME OCCUPATION MODE ;
377 run;
NOTE: There were 22 observations read from the data set WORK.WANT.
NOTE: The data set WORK.WANT has 22 observations and 9 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds
378
379 proc sort data=have nodup out=unique_sort ;
380 by ID YEAR SEX RACE SCHOOL EDUC INCOME OCCUPATION MODE ;
381 run;
NOTE: There were 24 observations read from the data set WORK.HAVE.
NOTE: 2 duplicate observations were deleted.
NOTE: The data set WORK.UNIQUE_SORT has 22 observations and 9 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
But your WANT dataset seems to have some differences from your HAVE dataset for three of the observations. Is that on purpose or just a typing error?
Number of Observations in Common: 22.
Total Number of Observations Read from WORK.UNIQUE_SORT: 22.
Total Number of Observations Read from WORK.WANT: 22.
Number of Observations with Some Compared Variables Unequal: 3.
Number of Observations with All Compared Variables Equal: 19.
Values Comparison Summary
Number of Variables Compared with All Observations Equal: 8.
Number of Variables Compared with Some Observations Unequal: 1.
Total Number of Values which Compare Unequal: 3.
Maximum Difference: 0.
Variables with Unequal Values
Variable Type Len Ndif MaxDif
MODE CHAR 20 3
Value Comparison Results for Variables
__________________________________________________________
|| Base Value Compare Value
Obs || MODE MODE
________ || ____________________ ____________________
||
11 || RAIL PRIVATE CAR
12 || PRIVATE CAR RAIL
13 || PRIVATE CAR POOLED
__________________________________________________________
@michokwu wrote:
Thank you. The by _all_did not delete all the duplicates.
But your WANT dataset seems to have some differences from your HAVE dataset for three of the observations. Is that on purpose or just a typing error? It’s on purpose to show the person may have had multiple jobs and commuted to them through different modes of transportation
That makes no sense.
How can you output rows that don't exist in the input?
What is the rule you used to convert your input to your output?
Do you want to merge multiple records into one? If so then how is the SAS program supposed to know how to do that?
Sorry, I don't understand what you mean by output rows that don't exist in the input. The observations in the WANT table are from the HAVE table. Some of the observations (ID) have multiple values
Thank you,
HAVE | WANT | ||||||||||||||||||
ID | YEAR | SEX | RACE | SCHOOL | EDUC | INCOME | OCCUPATION | MODE OF TRANSPORT | ID | YEAR | SEX | RACE | SCHOOL | EDUC | INCOME | OCCUPATION | MODE OF TRANSPORT | ||
00001 | 2017 | 1 | 1 | 2 | 7 | 20000 | 1 | PRIVATE CAR | 00001 | 2017 | 1 | 1 | 2 | 7 | 20000 | 1 | PRIVATE CAR | ||
00002 | 2017 | 1 | 4 | 1 | 6 | 25000 | 2 | PRIVATE CAR | 00002 | 2017 | 1 | 4 | 1 | 6 | 25000 | 2 | PRIVATE CAR | ||
00003 | 2017 | 2 | 4 | 2 | 6 | 30000 | 3 | RAIL | 00003 | 2017 | 2 | 4 | 2 | 6 | 30000 | 3 | RAIL | ||
00004 | 2017 | 1 | 2 | 1 | 6 | 40000 | 4 | PRIVATE CAR | 00004 | 2017 | 1 | 2 | 1 | 6 | 40000 | 4 | PRIVATE CAR | ||
00005 | 2017 | 2 | 3 | 1 | 6 | 45000 | 5 | POOLED | 00005 | 2017 | 2 | 3 | 1 | 6 | 45000 | 5 | POOLED | ||
00006 | 2017 | 2 | 2 | 1 | 6 | 18000 | 6 | RAIL | 00006 | 2017 | 2 | 2 | 1 | 6 | 18000 | 6 | PRIVATE CAR | ||
00007 | 2017 | 2 | 3 | 1 | 7 | 22000 | 1 | BUS | 00006 | 2017 | 2 | 2 | 1 | 6 | 38000 | 4 | RAIL | ||
00008 | 2017 | 1 | 1 | 1 | 6 | 26000 | 2 | BICYCLE | 00007 | 2017 | 2 | 3 | 1 | 7 | 22000 | 1 | BUS | ||
00009 | 2017 | 2 | 4 | 1 | 8 | 30000 | 3 | POOLED | 00008 | 2017 | 1 | 1 | 1 | 6 | 26000 | 2 | BICYCLE | ||
00010 | 2017 | 2 | 1 | 1 | 6 | 34000 | 4 | WALK | 00009 | 2017 | 2 | 4 | 1 | 8 | 30000 | 3 | POOLED | ||
00001 | 2017 | 1 | 1 | 2 | 7 | 20000 | 1 | PRIVATE CAR | 00010 | 2017 | 2 | 1 | 1 | 6 | 34000 | 4 | WALK | ||
00006 | 2017 | 2 | 2 | 1 | 6 | 38000 | 4 | PRIVATE CAR | 00001 | 2018 | 1 | 1 | 2 | 7 | 25000 | 1 | PRIVATE CAR | ||
00001 | 2018 | 1 | 1 | 2 | 7 | 25000 | 1 | PRIVATE CAR | 00002 | 2018 | 1 | 4 | 1 | 6 | 27000 | 2 | PRIVATE CAR | ||
00002 | 2018 | 1 | 4 | 1 | 6 | 27000 | 2 | PRIVATE CAR | 00003 | 2018 | 2 | 4 | 2 | 6 | 32000 | 3 | RAIL | ||
00003 | 2018 | 2 | 4 | 2 | 6 | 32000 | 3 | RAIL | 00004 | 2018 | 1 | 2 | 1 | 6 | 42000 | 4 | PRIVATE CAR | ||
00004 | 2018 | 1 | 2 | 1 | 6 | 42000 | 4 | PRIVATE CAR | 00005 | 2018 | 2 | 3 | 1 | 6 | 47000 | 5 | POOLED | ||
00005 | 2018 | 2 | 3 | 1 | 6 | 47000 | 5 | POOLED | 00006 | 2018 | 2 | 2 | 1 | 6 | 20000 | 6 | PRIVATE CAR | ||
00006 | 2018 | 2 | 2 | 1 | 6 | 20000 | 6 | PRIVATE CAR | 00007 | 2018 | 2 | 3 | 1 | 7 | 24000 | 1 | BUS | ||
00007 | 2018 | 2 | 3 | 1 | 7 | 24000 | 1 | BUS | 00008 | 2018 | 1 | 1 | 1 | 6 | 28000 | 2 | BICYCLE | ||
00008 | 2018 | 1 | 1 | 1 | 6 | 28000 | 2 | BICYCLE | 00009 | 2018 | 2 | 4 | 1 | 8 | 32000 | 3 | POOLED | ||
00009 | 2018 | 2 | 4 | 1 | 8 | 32000 | 3 | POOLED | 00010 | 2018 | 2 | 1 | 1 | 6 | 36000 | 4 | WALK | ||
00010 | 2018 | 2 | 1 | 1 | 6 | 36000 | 4 | WALK | 00010 | 2018 | 2 | 1 | 1 | 6 | 27000 | 2 | BICYCLE | ||
00010 | 2018 | 2 | 1 | 1 | 6 | 27000 | 2 | BICYCLE | |||||||||||
00004 | 2018 | 1 | 2 | 1 | 6 | 42000 | 4 | PRIVATE CAR |
You have OBSERVATIONS (rows) in your WANT dataset that do not appear in your HAVE dataset.
Here are the 6 observations in question, 3 from each source.
439 data check; 440 length indsname source $41 ; 441 set unique_sort want indsname=indsname ; 442 by ID YEAR SEX RACE SCHOOL EDUC INCOME OCCUPATION MODE ; 443 source=indsname; 444 if (first.mode and last.mode); 445 run; NOTE: There were 22 observations read from the data set WORK.UNIQUE_SORT. NOTE: There were 22 observations read from the data set WORK.WANT. NOTE: The data set WORK.CHECK has 6 observations and 10 variables. NOTE: DATA statement used (Total process time): real time 0.02 seconds cpu time 0.03 seconds 446 447 proc print; 448 run; NOTE: There were 6 observations read from the data set WORK.CHECK. NOTE: PROCEDURE PRINT used (Total process time): real time 0.07 seconds cpu time 0.00 seconds
Obs source ID YEAR SEX RACE SCHOOL EDUC INCOME OCCUPATION MODE 1 WORK.WANT 00006 2017 2 2 1 6 18000 6 PRIVATE CAR 2 WORK.UNIQUE_SORT 00006 2017 2 2 1 6 18000 6 RAIL 3 WORK.UNIQUE_SORT 00006 2017 2 2 1 6 38000 4 PRIVATE CAR 4 WORK.WANT 00006 2017 2 2 1 6 38000 4 RAIL 5 WORK.WANT 00006 2018 2 2 1 6 20000 6 POOLED 6 WORK.UNIQUE_SORT 00006 2018 2 2 1 6 20000 6 PRIVATE CAR
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.