BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
michokwu
Quartz | Level 8

Hello Experts,

 

I am working with a large dataset of around 100 variables and need to check for exact duplicate observations. I used the code below, but it returns 0 observations, however, when I run the code specifying 2 or 3 variables it returns observations that are exact duplicates. I tried nodupkey as well, same result.

 

proc sort data=me.in out=me.out dupout=me.dup nodup;
by _all_;
run;
1 ACCEPTED SOLUTION

Accepted Solutions
michokwu
Quartz | Level 8

I used this and got the desired result. Thanks for the feedback, I really appreciate it

proc sort data=me.in
out=me.out
nodup; 
by ID;
run;

View solution in original post

15 REPLIES 15
Shmuel
Garnet | Level 18

The NODUP or NODUPKEY option check for same combination of variables' list value and

not duplicate per a variable. So having less variables in the BY statement the possibility to find duplicate combination is greater.

 

Usually the use of searching duplicates is to deal with a set of observations having same ID,

either ID is one variable or a combination some variables.

 

What ls the logic to look for duplicate values per each variable in a dataset ?

michokwu
Quartz | Level 8
Observations have unique IDs and since it is time-series data set, use time period and ID to search. 
The same ID can appear multiple times as long as it appears at different time periods. The challenge is that there are observations with same ID in the same period. When I search by  _all_ it returns 0 observations but when I search by time and ID, it returns these observations. I would like the dupout file to contain all observations with the same values across all variables.
Shmuel
Garnet | Level 18

If you reformat your data-set by transpose or by a data step into an observation made of 

3 variables each: ID, Time, Value - then run next code to find dupliactes:

proc sort data=reformated ; by ID Time Value; run;
data dups unique;
 set reformated;
  by ID Time;
      if not (first.Time and last.Time) then output dups;
     else output unique;
run;

otherwise, please post your data-set format - are there more then one value (different meaning) per ID Time in the same observation? Do have a set of variables VAR1-VARn ? what is the relations between those variables ? 

 

smantha
Lapis Lazuli | Level 10

Sort and deduping works best when there are categorical variables, dates, character data and integers. It is very expensive and not very efficient for floats and not give correct results due to precision around storing values. Please check if that is the case.

michokwu
Quartz | Level 8

I tried this on a smaller but identical dataset with only categorical variables and the result was the same.

Patrick
Opal | Level 21

@michokwu 

To avoid misunderstandings what you have and what you need ideally post some representative but as simple as possible sample data (via a SAS data step creating such data) where you can show us what's not working for you right now and how your desired result should look like.

michokwu
Quartz | Level 8

Please see the sample data attached. The observations in yellow are exact duplicates but observations in green are different despite having the same ID, I hope this helps. I would like to maintain only one record of the same observations (yellow)  while keeping all observations in green.

Code 1(below) returns 0 observations in the output file

proc sort data=me.in nodupkey dupout=me.out;

by _all_;

run;

Code 2(below), includes the a number of the same observations (like ones in yellow)

proc sort data=me.in nodupkey dupout=me.out;

by  ID Year;

run;

The variables are much so I used _all_ instead of listing all of them.

Thank you.

Tom
Super User Tom
Super User

@michokwu wrote:

Hello Experts,

 

I am working with a large dataset of around 100 variables and need to check for exact duplicate observations. I used the code below, but it returns 0 observations, however, when I run the code specifying 2 or 3 variables it returns observations that are exact duplicates. I tried nodupkey as well, same result.

 

proc sort data=me.in out=me.out dupout=me.dup nodup;
by _all_;
run;

Please explain in more detail what you are trying to do and how your different attempts behave differently.

It sounds like you are trying to create OUT with no duplicate observations.  If you ran the same thing without listing ALL of the variables then it is possible for duplicate observations to written to OUT.  That is because of how the NODUP (aka NODUPREC) works.  It compares the observations as it is writing them and will only eliminate a duplicate observations that just happen to be directly adjacent. When you sort by ALL of the variables then it will insure that the duplicates are sorted next to each other so that they can be detected and removed.

Consider this simple Example:

KEY OTHER
1  A
1  B
1  A

If you sort by KEY OTHER then the identical records will be found and removed. But if you sort by just KEY then the two duplicate records will both be written since neither is the same as the one in the middle.

michokwu
Quartz | Level 8
I sort by _all_ and still had the duplicates.

Thank you,
Tom
Super User Tom
Super User

It is much easier for others to help if you just post data as TEXT.  Posting a data step is even better.

Spoiler
data have ;
  infile cards dsd dlm='|' truncover ;
  input ID $ YEAR SEX RACE SCHOOL EDUC INCOME OCCUPATION MODE :$20.;
cards;
00001|2017|1|1|2|7|20000|1|PRIVATE CAR
00002|2017|1|4|1|6|25000|2|PRIVATE CAR
00003|2017|2|4|2|6|30000|3|RAIL
00004|2017|1|2|1|6|40000|4|PRIVATE CAR
00005|2017|2|3|1|6|45000|5|POOLED
00006|2017|2|2|1|6|18000|6|RAIL
00007|2017|2|3|1|7|22000|1|BUS
00008|2017|1|1|1|6|26000|2|BICYCLE
00009|2017|2|4|1|8|30000|3|POOLED
00010|2017|2|1|1|6|34000|4|WALK
00001|2017|1|1|2|7|20000|1|PRIVATE CAR
00006|2017|2|2|1|6|38000|4|PRIVATE CAR
00001|2018|1|1|2|7|25000|1|PRIVATE CAR
00002|2018|1|4|1|6|27000|2|PRIVATE CAR
00003|2018|2|4|2|6|32000|3|RAIL
00004|2018|1|2|1|6|42000|4|PRIVATE CAR
00005|2018|2|3|1|6|47000|5|POOLED
00006|2018|2|2|1|6|20000|6|PRIVATE CAR
00007|2018|2|3|1|7|24000|1|BUS
00008|2018|1|1|1|6|28000|2|BICYCLE
00009|2018|2|4|1|8|32000|3|POOLED
00010|2018|2|1|1|6|36000|4|WALK
00010|2018|2|1|1|6|27000|2|BICYCLE
00004|2018|1|2|1|6|42000|4|PRIVATE CAR
;
data want ;
  infile cards dsd dlm='|' truncover ;
  input ID $ YEAR SEX RACE SCHOOL EDUC INCOME OCCUPATION MODE :$20.;
cards;
00001|2017|1|1|2|7|20000|1|PRIVATE CAR
00002|2017|1|4|1|6|25000|2|PRIVATE CAR
00003|2017|2|4|2|6|30000|3|RAIL
00004|2017|1|2|1|6|40000|4|PRIVATE CAR
00005|2017|2|3|1|6|45000|5|POOLED
00006|2017|2|2|1|6|18000|6|PRIVATE CAR
00006|2017|2|2|1|6|38000|4|RAIL
00007|2017|2|3|1|7|22000|1|BUS
00008|2017|1|1|1|6|26000|2|BICYCLE
00009|2017|2|4|1|8|30000|3|POOLED
00010|2017|2|1|1|6|34000|4|WALK
00001|2018|1|1|2|7|25000|1|PRIVATE CAR
00002|2018|1|4|1|6|27000|2|PRIVATE CAR
00003|2018|2|4|2|6|32000|3|RAIL
00004|2018|1|2|1|6|42000|4|PRIVATE CAR
00005|2018|2|3|1|6|47000|5|POOLED
00006|2018|2|2|1|6|20000|6|POOLED
00007|2018|2|3|1|7|24000|1|BUS
00008|2018|1|1|1|6|28000|2|BICYCLE
00009|2018|2|4|1|8|32000|3|POOLED
00010|2018|2|1|1|6|36000|4|WALK
00010|2018|2|1|1|6|27000|2|BICYCLE
;

The BY _ALL_ should do what you want.  There are 22 observations in your WANT dataset and 22 unique observations in your HAVE dataset.

 

375   proc sort data=want;
376      by ID YEAR SEX RACE SCHOOL EDUC INCOME OCCUPATION MODE ;
377   run;

NOTE: There were 22 observations read from the data set WORK.WANT.
NOTE: The data set WORK.WANT has 22 observations and 9 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds


378
379   proc sort data=have nodup out=unique_sort ;
380      by ID YEAR SEX RACE SCHOOL EDUC INCOME OCCUPATION MODE ;
381   run;

NOTE: There were 24 observations read from the data set WORK.HAVE.
NOTE: 2 duplicate observations were deleted.
NOTE: The data set WORK.UNIQUE_SORT has 22 observations and 9 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

 

But your WANT dataset seems to have some differences from your HAVE dataset for three of the observations. Is that on purpose or just a typing error?

Number of Observations in Common: 22.
Total Number of Observations Read from WORK.UNIQUE_SORT: 22.
Total Number of Observations Read from WORK.WANT: 22.

Number of Observations with Some Compared Variables Unequal: 3.

Number of Observations with All Compared Variables Equal: 19.

Values Comparison Summary

Number of Variables Compared with All Observations Equal: 8.
Number of Variables Compared with Some Observations Unequal: 1.
Total Number of Values which Compare Unequal: 3.
Maximum Difference: 0.


Variables with Unequal Values

Variable    Type  Len  Ndif   MaxDif

MODE        CHAR   20     3

Value Comparison Results for Variables
__________________________________________________________
           ||  Base Value           Compare Value
       Obs ||  MODE                  MODE
 ________  ||  ____________________  ____________________
           ||
       11  ||  RAIL                  PRIVATE CAR
       12  ||  PRIVATE CAR           RAIL
       13  ||  PRIVATE CAR           POOLED
__________________________________________________________
michokwu
Quartz | Level 8
Thank you. The by _all_did not delete all the duplicates.
But your WANT dataset seems to have some differences from your HAVE dataset for three of the observations. Is that on purpose or just a typing error? It’s on purpose to show the person may have had multiple jobs and commuted to them through different modes of transportation
Tom
Super User Tom
Super User

@michokwu wrote:
Thank you. The by _all_did not delete all the duplicates.
But your WANT dataset seems to have some differences from your HAVE dataset for three of the observations. Is that on purpose or just a typing error? It’s on purpose to show the person may have had multiple jobs and commuted to them through different modes of transportation

That makes no sense. 

How can you output rows that don't exist in the input?

What is the rule you used to convert your input to your output?

Do you want to merge multiple records into one? If so then how is the SAS program supposed to know how to do that? 

michokwu
Quartz | Level 8

Sorry, I don't understand what you mean by output rows that don't exist in the input. The observations in the WANT table are from the HAVE table. Some of the observations (ID) have multiple values

Thank you,

 

HAVE          WANT        
IDYEARSEXRACESCHOOLEDUCINCOMEOCCUPATIONMODE OF TRANSPORT  IDYEARSEXRACESCHOOLEDUCINCOMEOCCUPATIONMODE OF TRANSPORT
0000120171127200001PRIVATE CAR  0000120171127200001PRIVATE CAR
0000220171416250002PRIVATE CAR  0000220171416250002PRIVATE CAR
0000320172426300003RAIL  0000320172426300003RAIL
0000420171216400004PRIVATE CAR  0000420171216400004PRIVATE CAR
0000520172316450005POOLED  0000520172316450005POOLED
0000620172216180006RAIL  0000620172216180006PRIVATE CAR
0000720172317220001BUS  0000620172216380004RAIL
0000820171116260002BICYCLE  0000720172317220001BUS
0000920172418300003POOLED  0000820171116260002BICYCLE
0001020172116340004WALK  0000920172418300003POOLED
0000120171127200001PRIVATE CAR  0001020172116340004WALK
0000620172216380004PRIVATE CAR  0000120181127250001PRIVATE CAR
0000120181127250001PRIVATE CAR  0000220181416270002PRIVATE CAR
0000220181416270002PRIVATE CAR  0000320182426320003RAIL
0000320182426320003RAIL  0000420181216420004PRIVATE CAR
0000420181216420004PRIVATE CAR  0000520182316470005POOLED
0000520182316470005POOLED  0000620182216200006PRIVATE CAR
0000620182216200006PRIVATE CAR  0000720182317240001BUS
0000720182317240001BUS  0000820181116280002BICYCLE
0000820181116280002BICYCLE  0000920182418320003POOLED
0000920182418320003POOLED  0001020182116360004WALK
0001020182116360004WALK  0001020182116270002BICYCLE
0001020182116270002BICYCLE           
0000420181216420004PRIVATE CAR           

 

Tom
Super User Tom
Super User

You have OBSERVATIONS (rows) in your WANT dataset that do not appear in your HAVE dataset.

Here are the 6 observations in question, 3 from each source.

 

439   data check;
440     length indsname source $41 ;
441     set unique_sort want indsname=indsname ;
442     by ID YEAR SEX RACE SCHOOL EDUC INCOME OCCUPATION MODE ;
443     source=indsname;
444     if  (first.mode and last.mode);
445   run;

NOTE: There were 22 observations read from the data set WORK.UNIQUE_SORT.
NOTE: There were 22 observations read from the data set WORK.WANT.
NOTE: The data set WORK.CHECK has 6 observations and 10 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.03 seconds


446
447   proc print;
448   run;

NOTE: There were 6 observations read from the data set WORK.CHECK.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.07 seconds
      cpu time            0.00 seconds
Obs   source              ID     YEAR   SEX   RACE   SCHOOL   EDUC   INCOME   OCCUPATION   MODE

 1    WORK.WANT          00006   2017    2      2       1       6     18000        6       PRIVATE CAR
 2    WORK.UNIQUE_SORT   00006   2017    2      2       1       6     18000        6       RAIL
 3    WORK.UNIQUE_SORT   00006   2017    2      2       1       6     38000        4       PRIVATE CAR
 4    WORK.WANT          00006   2017    2      2       1       6     38000        4       RAIL
 5    WORK.WANT          00006   2018    2      2       1       6     20000        6       POOLED
 6    WORK.UNIQUE_SORT   00006   2018    2      2       1       6     20000        6       PRIVATE CAR

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 15 replies
  • 2608 views
  • 0 likes
  • 5 in conversation