Solved: Re: Store Duplicates using multiple/all variables

michokwu · Posted 07-23-2020 07:55 PM

Hello Experts,

I am working with a large dataset of around 100 variables and need to check for exact duplicate observations. I used the code below, but it returns 0 observations, however, when I run the code specifying 2 or 3 variables it returns observations that are exact duplicates. I tried nodupkey as well, same result.

proc sort data=me.in out=me.out dupout=me.dup nodup;
by _all_;
run;

michokwu · Posted 08-03-2020 12:32 PM

I used this and got the desired result. Thanks for the feedback, I really appreciate it

proc sort data=me.in
out=me.out
nodup; 
by ID;
run;

View solution in original post

Shmuel · Posted 07-23-2020 08:23 PM

The NODUP or NODUPKEY option check for same combination of variables' list value and

not duplicate per a variable. So having less variables in the BY statement the possibility to find duplicate combination is greater.

Usually the use of searching duplicates is to deal with a set of observations having same ID,

either ID is one variable or a combination some variables.

What ls the logic to look for duplicate values per each variable in a dataset ?

michokwu · Posted 07-24-2020 02:14 PM

Observations have unique IDs and since it is a time-series data set, I use time period and ID to search.

The same ID can appear multiple times as long as it appears at different time periods. The challenge is that there are observations with same ID in the same period. When I search by _all_ it returns 0 observations but when I search by time and ID, it returns these observations. I would like the dupout file to contain all observations with the same values across all variables.

Shmuel · Posted 07-24-2020 03:08 PM

If you reformat your data-set by transpose or by a data step into an observation made of

3 variables each: ID, Time, Value - then run next code to find dupliactes:

proc sort data=reformated ; by ID Time Value; run;
data dups unique;
 set reformated;
  by ID Time;
      if not (first.Time and last.Time) then output dups;
     else output unique;
run;

otherwise, please post your data-set format - are there more then one value (different meaning) per ID Time in the same observation? Do have a set of variables VAR1-VARn ? what is the relations between those variables ?

smantha · Posted 07-23-2020 11:16 PM

Sort and deduping works best when there are categorical variables, dates, character data and integers. It is very expensive and not very efficient for floats and not give correct results due to precision around storing values. Please check if that is the case.

michokwu · Posted 07-24-2020 02:23 PM

I tried this on a smaller but identical dataset with only categorical variables and the result was the same.

Patrick · Posted 07-25-2020 12:38 AM

@michokwu

To avoid misunderstandings what you have and what you need ideally post some representative but as simple as possible sample data (via a SAS data step creating such data) where you can show us what's not working for you right now and how your desired result should look like.

michokwu · Posted 07-27-2020 12:24 PM

Please see the sample data attached. The observations in yellow are exact duplicates but observations in green are different despite having the same ID, I hope this helps. I would like to maintain only one record of the same observations (yellow) while keeping all observations in green.

Code 1(below) returns 0 observations in the output file

proc sort data=me.in nodupkey dupout=me.out;

by _all_;

run;

Code 2(below), includes the a number of the same observations (like ones in yellow)

proc sort data=me.in nodupkey dupout=me.out;

by ID Year;

run;

The variables are much so I used _all_ instead of listing all of them.

Thank you.

Tom · Posted 07-25-2020 12:58 AM

@michokwu wrote:

Hello Experts,

I am working with a large dataset of around 100 variables and need to check for exact duplicate observations. I used the code below, but it returns 0 observations, however, when I run the code specifying 2 or 3 variables it returns observations that are exact duplicates. I tried nodupkey as well, same result.
proc sort data=me.in out=me.out dupout=me.dup nodup;
by _all_;
run;

Please explain in more detail what you are trying to do and how your different attempts behave differently.

It sounds like you are trying to create OUT with no duplicate observations. If you ran the same thing without listing ALL of the variables then it is possible for duplicate observations to written to OUT. That is because of how the NODUP (aka NODUPREC) works. It compares the observations as it is writing them and will only eliminate a duplicate observations that just happen to be directly adjacent. When you sort by ALL of the variables then it will insure that the duplicates are sorted next to each other so that they can be detected and removed.

Consider this simple Example:

KEY OTHER
1  A
1  B
1  A

If you sort by KEY OTHER then the identical records will be found and removed. But if you sort by just KEY then the two duplicate records will both be written since neither is the same as the one in the middle.

michokwu · Posted 07-27-2020 12:28 PM

I sort by _all_ and still had the duplicates.

Thank you,

Tom · Posted 07-27-2020 12:58 PM

It is much easier for others to help if you just post data as TEXT. Posting a data step is even better.

Spoiler

data have ;
  infile cards dsd dlm='|' truncover ;
  input ID $ YEAR SEX RACE SCHOOL EDUC INCOME OCCUPATION MODE :$20.;
cards;
00001|2017|1|1|2|7|20000|1|PRIVATE CAR
00002|2017|1|4|1|6|25000|2|PRIVATE CAR
00003|2017|2|4|2|6|30000|3|RAIL
00004|2017|1|2|1|6|40000|4|PRIVATE CAR
00005|2017|2|3|1|6|45000|5|POOLED
00006|2017|2|2|1|6|18000|6|RAIL
00007|2017|2|3|1|7|22000|1|BUS
00008|2017|1|1|1|6|26000|2|BICYCLE
00009|2017|2|4|1|8|30000|3|POOLED
00010|2017|2|1|1|6|34000|4|WALK
00001|2017|1|1|2|7|20000|1|PRIVATE CAR
00006|2017|2|2|1|6|38000|4|PRIVATE CAR
00001|2018|1|1|2|7|25000|1|PRIVATE CAR
00002|2018|1|4|1|6|27000|2|PRIVATE CAR
00003|2018|2|4|2|6|32000|3|RAIL
00004|2018|1|2|1|6|42000|4|PRIVATE CAR
00005|2018|2|3|1|6|47000|5|POOLED
00006|2018|2|2|1|6|20000|6|PRIVATE CAR
00007|2018|2|3|1|7|24000|1|BUS
00008|2018|1|1|1|6|28000|2|BICYCLE
00009|2018|2|4|1|8|32000|3|POOLED
00010|2018|2|1|1|6|36000|4|WALK
00010|2018|2|1|1|6|27000|2|BICYCLE
00004|2018|1|2|1|6|42000|4|PRIVATE CAR
;
data want ;
  infile cards dsd dlm='|' truncover ;
  input ID $ YEAR SEX RACE SCHOOL EDUC INCOME OCCUPATION MODE :$20.;
cards;
00001|2017|1|1|2|7|20000|1|PRIVATE CAR
00002|2017|1|4|1|6|25000|2|PRIVATE CAR
00003|2017|2|4|2|6|30000|3|RAIL
00004|2017|1|2|1|6|40000|4|PRIVATE CAR
00005|2017|2|3|1|6|45000|5|POOLED
00006|2017|2|2|1|6|18000|6|PRIVATE CAR
00006|2017|2|2|1|6|38000|4|RAIL
00007|2017|2|3|1|7|22000|1|BUS
00008|2017|1|1|1|6|26000|2|BICYCLE
00009|2017|2|4|1|8|30000|3|POOLED
00010|2017|2|1|1|6|34000|4|WALK
00001|2018|1|1|2|7|25000|1|PRIVATE CAR
00002|2018|1|4|1|6|27000|2|PRIVATE CAR
00003|2018|2|4|2|6|32000|3|RAIL
00004|2018|1|2|1|6|42000|4|PRIVATE CAR
00005|2018|2|3|1|6|47000|5|POOLED
00006|2018|2|2|1|6|20000|6|POOLED
00007|2018|2|3|1|7|24000|1|BUS
00008|2018|1|1|1|6|28000|2|BICYCLE
00009|2018|2|4|1|8|32000|3|POOLED
00010|2018|2|1|1|6|36000|4|WALK
00010|2018|2|1|1|6|27000|2|BICYCLE
;

data have ; infile cards dsd dlm='|' truncover ; input ID $ YEAR SEX RACE SCHOOL EDUC INCOME OCCUPATION MODE :$20.; cards; 00001|2017|1|1|2|7|20000|1|PRIVATE CAR 00002|2017|1|4|1|6|25000|2|PRIVATE CAR 00003|2017|2|4|2|6|30000|3|RAIL 00004|2017|1|2|1|6|40000|4|PRIVATE CAR 00005|2017|2|3|1|6|45000|5|POOLED 00006|2017|2|2|1|6|18000|6|RAIL 00007|2017|2|3|1|7|22000|1|BUS 00008|2017|1|1|1|6|26000|2|BICYCLE 00009|2017|2|4|1|8|30000|3|POOLED 00010|2017|2|1|1|6|34000|4|WALK 00001|2017|1|1|2|7|20000|1|PRIVATE CAR 00006|2017|2|2|1|6|38000|4|PRIVATE CAR 00001|2018|1|1|2|7|25000|1|PRIVATE CAR 00002|2018|1|4|1|6|27000|2|PRIVATE CAR 00003|2018|2|4|2|6|32000|3|RAIL 00004|2018|1|2|1|6|42000|4|PRIVATE CAR 00005|2018|2|3|1|6|47000|5|POOLED 00006|2018|2|2|1|6|20000|6|PRIVATE CAR 00007|2018|2|3|1|7|24000|1|BUS 00008|2018|1|1|1|6|28000|2|BICYCLE 00009|2018|2|4|1|8|32000|3|POOLED 00010|2018|2|1|1|6|36000|4|WALK 00010|2018|2|1|1|6|27000|2|BICYCLE 00004|2018|1|2|1|6|42000|4|PRIVATE CAR ; data want ; infile cards dsd dlm='|' truncover ; input ID $ YEAR SEX RACE SCHOOL EDUC INCOME OCCUPATION MODE :$20.; cards; 00001|2017|1|1|2|7|20000|1|PRIVATE CAR 00002|2017|1|4|1|6|25000|2|PRIVATE CAR 00003|2017|2|4|2|6|30000|3|RAIL 00004|2017|1|2|1|6|40000|4|PRIVATE CAR 00005|2017|2|3|1|6|45000|5|POOLED 00006|2017|2|2|1|6|18000|6|PRIVATE CAR 00006|2017|2|2|1|6|38000|4|RAIL 00007|2017|2|3|1|7|22000|1|BUS 00008|2017|1|1|1|6|26000|2|BICYCLE 00009|2017|2|4|1|8|30000|3|POOLED 00010|2017|2|1|1|6|34000|4|WALK 00001|2018|1|1|2|7|25000|1|PRIVATE CAR 00002|2018|1|4|1|6|27000|2|PRIVATE CAR 00003|2018|2|4|2|6|32000|3|RAIL 00004|2018|1|2|1|6|42000|4|PRIVATE CAR 00005|2018|2|3|1|6|47000|5|POOLED 00006|2018|2|2|1|6|20000|6|POOLED 00007|2018|2|3|1|7|24000|1|BUS 00008|2018|1|1|1|6|28000|2|BICYCLE 00009|2018|2|4|1|8|32000|3|POOLED 00010|2018|2|1|1|6|36000|4|WALK 00010|2018|2|1|1|6|27000|2|BICYCLE ;

The BY _ALL_ should do what you want. There are 22 observations in your WANT dataset and 22 unique observations in your HAVE dataset.

375   proc sort data=want;
376      by ID YEAR SEX RACE SCHOOL EDUC INCOME OCCUPATION MODE ;
377   run;

NOTE: There were 22 observations read from the data set WORK.WANT.
NOTE: The data set WORK.WANT has 22 observations and 9 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds


378
379   proc sort data=have nodup out=unique_sort ;
380      by ID YEAR SEX RACE SCHOOL EDUC INCOME OCCUPATION MODE ;
381   run;

NOTE: There were 24 observations read from the data set WORK.HAVE.
NOTE: 2 duplicate observations were deleted.
NOTE: The data set WORK.UNIQUE_SORT has 22 observations and 9 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

But your WANT dataset seems to have some differences from your HAVE dataset for three of the observations. Is that on purpose or just a typing error?

Number of Observations in Common: 22.
Total Number of Observations Read from WORK.UNIQUE_SORT: 22.
Total Number of Observations Read from WORK.WANT: 22.

Number of Observations with Some Compared Variables Unequal: 3.

Number of Observations with All Compared Variables Equal: 19.

Values Comparison Summary

Number of Variables Compared with All Observations Equal: 8.
Number of Variables Compared with Some Observations Unequal: 1.
Total Number of Values which Compare Unequal: 3.
Maximum Difference: 0.


Variables with Unequal Values

Variable    Type  Len  Ndif   MaxDif

MODE        CHAR   20     3

Value Comparison Results for Variables
__________________________________________________________
           ||  Base Value           Compare Value
       Obs ||  MODE                  MODE
 ________  ||  ____________________  ____________________
           ||
       11  ||  RAIL                  PRIVATE CAR
       12  ||  PRIVATE CAR           RAIL
       13  ||  PRIVATE CAR           POOLED
__________________________________________________________

michokwu · Posted 07-27-2020 02:19 PM

Thank you. The by _all_did not delete all the duplicates.
But your WANT dataset seems to have some differences from your HAVE dataset for three of the observations. Is that on purpose or just a typing error? It’s on purpose to show the person may have had multiple jobs and commuted to them through different modes of transportation

Tom · Posted 07-27-2020 02:24 PM

@michokwu wrote:
Thank you. The by _all_did not delete all the duplicates.
But your WANT dataset seems to have some differences from your HAVE dataset for three of the observations. Is that on purpose or just a typing error? It’s on purpose to show the person may have had multiple jobs and commuted to them through different modes of transportation

That makes no sense.

How can you output rows that don't exist in the input?

What is the rule you used to convert your input to your output?

Do you want to merge multiple records into one? If so then how is the SAS program supposed to know how to do that?

michokwu · Posted 07-27-2020 04:48 PM

Sorry, I don't understand what you mean by output rows that don't exist in the input. The observations in the WANT table are from the HAVE table. Some of the observations (ID) have multiple values

Thank you,

HAVE									WANT
ID	YEAR	SEX	RACE	SCHOOL	EDUC	INCOME	OCCUPATION	MODE OF TRANSPORT	ID	YEAR	SEX	RACE	SCHOOL	EDUC	INCOME	OCCUPATION	MODE OF TRANSPORT
00001	2017	1	1	2	7	20000	1	PRIVATE CAR	00001	2017	1	1	2	7	20000	1	PRIVATE CAR
00002	2017	1	4	1	6	25000	2	PRIVATE CAR	00002	2017	1	4	1	6	25000	2	PRIVATE CAR
00003	2017	2	4	2	6	30000	3	RAIL	00003	2017	2	4	2	6	30000	3	RAIL
00004	2017	1	2	1	6	40000	4	PRIVATE CAR	00004	2017	1	2	1	6	40000	4	PRIVATE CAR
00005	2017	2	3	1	6	45000	5	POOLED	00005	2017	2	3	1	6	45000	5	POOLED
00006	2017	2	2	1	6	18000	6	RAIL	00006	2017	2	2	1	6	18000	6	PRIVATE CAR
00007	2017	2	3	1	7	22000	1	BUS	00006	2017	2	2	1	6	38000	4	RAIL
00008	2017	1	1	1	6	26000	2	BICYCLE	00007	2017	2	3	1	7	22000	1	BUS
00009	2017	2	4	1	8	30000	3	POOLED	00008	2017	1	1	1	6	26000	2	BICYCLE
00010	2017	2	1	1	6	34000	4	WALK	00009	2017	2	4	1	8	30000	3	POOLED
00001	2017	1	1	2	7	20000	1	PRIVATE CAR	00010	2017	2	1	1	6	34000	4	WALK
00006	2017	2	2	1	6	38000	4	PRIVATE CAR	00001	2018	1	1	2	7	25000	1	PRIVATE CAR
00001	2018	1	1	2	7	25000	1	PRIVATE CAR	00002	2018	1	4	1	6	27000	2	PRIVATE CAR
00002	2018	1	4	1	6	27000	2	PRIVATE CAR	00003	2018	2	4	2	6	32000	3	RAIL
00003	2018	2	4	2	6	32000	3	RAIL	00004	2018	1	2	1	6	42000	4	PRIVATE CAR
00004	2018	1	2	1	6	42000	4	PRIVATE CAR	00005	2018	2	3	1	6	47000	5	POOLED
00005	2018	2	3	1	6	47000	5	POOLED	00006	2018	2	2	1	6	20000	6	PRIVATE CAR
00006	2018	2	2	1	6	20000	6	PRIVATE CAR	00007	2018	2	3	1	7	24000	1	BUS
00007	2018	2	3	1	7	24000	1	BUS	00008	2018	1	1	1	6	28000	2	BICYCLE
00008	2018	1	1	1	6	28000	2	BICYCLE	00009	2018	2	4	1	8	32000	3	POOLED
00009	2018	2	4	1	8	32000	3	POOLED	00010	2018	2	1	1	6	36000	4	WALK
00010	2018	2	1	1	6	36000	4	WALK	00010	2018	2	1	1	6	27000	2	BICYCLE
00010	2018	2	1	1	6	27000	2	BICYCLE
00004	2018	1	2	1	6	42000	4	PRIVATE CAR

Tom · Posted 07-27-2020 05:17 PM

You have OBSERVATIONS (rows) in your WANT dataset that do not appear in your HAVE dataset.

Here are the 6 observations in question, 3 from each source.

439   data check;
440     length indsname source $41 ;
441     set unique_sort want indsname=indsname ;
442     by ID YEAR SEX RACE SCHOOL EDUC INCOME OCCUPATION MODE ;
443     source=indsname;
444     if  (first.mode and last.mode);
445   run;

NOTE: There were 22 observations read from the data set WORK.UNIQUE_SORT.
NOTE: There were 22 observations read from the data set WORK.WANT.
NOTE: The data set WORK.CHECK has 6 observations and 10 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.03 seconds


446
447   proc print;
448   run;

NOTE: There were 6 observations read from the data set WORK.CHECK.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.07 seconds
      cpu time            0.00 seconds

Obs   source              ID     YEAR   SEX   RACE   SCHOOL   EDUC   INCOME   OCCUPATION   MODE

 1    WORK.WANT          00006   2017    2      2       1       6     18000        6       PRIVATE CAR
 2    WORK.UNIQUE_SORT   00006   2017    2      2       1       6     18000        6       RAIL
 3    WORK.UNIQUE_SORT   00006   2017    2      2       1       6     38000        4       PRIVATE CAR
 4    WORK.WANT          00006   2017    2      2       1       6     38000        4       RAIL
 5    WORK.WANT          00006   2018    2      2       1       6     20000        6       POOLED
 6    WORK.UNIQUE_SORT   00006   2018    2      2       1       6     20000        6       PRIVATE CAR

Registration is open

SAS Training: Just a Click Away