Solved: Duplicates list

danhopkinslewis · Posted 10-19-2020 02:15 PM

Hey there,

Novice to SAS so bear with me.

I have a large data set where I am trying to create a new data set of duplicates. I am using postcode as the duplicate but then I want to bring in if the date of birth field also matches on the row.

I tried a proc sort nodupkey but I think it’s just give me a clean list and then a duplicates list. I just want to see the ones which have duplicated on both criteria in a data set.

Any help would be greatly appreciated.

Thanks

ballardw · Posted 10-19-2020 03:25 PM

One way that might get you started:

Proc freq data=have noprint;
  tables var1*var2 / out=want(where=(count>1) drop=percent) ;
run;

The noprint option suppresses the normal proc freq output. The out= creates a data set named Want with the levels of the variable combinations and the count, and only when the count is more than 1, i.e. duplicates.

If you want to treat a missing value as a valid combination add the keyword MISSING after the / .

View solution in original post

PaigeMiller · Posted 10-19-2020 02:18 PM

A small (but realistic) example would be helpful. Please follow these instructions when providing data. Do not provide data as screen captures or as attachments. More context about the problem would also be helpful.

--
Paige Miller

danhopkinslewis · Posted 10-20-2020 03:32 AM

Sorry, unable to submit data as its sensitive.

PaigeMiller · Posted 10-20-2020 06:36 AM

@danhopkinslewis wrote:

Sorry, unable to submit data as its sensitive.

Make up some data, as long as it represents the problem.

--
Paige Miller

Kurt_Bremser · Posted 10-20-2020 06:38 AM

Then make up some fake data that illustrates your issue. Just make sure that variable types and other attributes (length, format) are the same.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ballardw · Posted 10-19-2020 03:25 PM

One way that might get you started:

Proc freq data=have noprint;
  tables var1*var2 / out=want(where=(count>1) drop=percent) ;
run;

The noprint option suppresses the normal proc freq output. The out= creates a data set named Want with the levels of the variable combinations and the count, and only when the count is more than 1, i.e. duplicates.

If you want to treat a missing value as a valid combination add the keyword MISSING after the / .

danhopkinslewis · Posted 10-20-2020 03:31 AM

Thanks. So if I wanted to bring in all the other variable columns but not include them in the count, how would I do that?

ballardw · Posted 10-27-2020 11:04 AM

@danhopkinslewis wrote:

Thanks. So if I wanted to bring in all the other variable columns but not include them in the count, how would I do that?

Do ALL of the other variables where you have duplicates have the same values?

If not you will need to decide which ones you want.

If the idea is to add a code that the variable combination is a duplicate then you could merge this back on the original data.

Again pseudo code because you haven't mentioned names of data sets or variables: The below renames the count to indicate that it is indeed a duplicate count.

Proc sort data=have;
   by var1 var2;
run;

data final;
   merge have want(  rename=(count=dupecount) );
   by var1 var2;
run;

If you want a simple flag for "this record is part of a duplicate set" you could add something like:

Flag = (dupecount>1);

SAS will treat the result of logical comparisons as a numeric one when true and zero when false.

Other logic could be used depending on what you want.

Duplicates list

Re: Duplicates list

Re: Duplicates list

Re: Duplicates list

Re: Duplicates list

Re: Duplicates list

Re: Duplicates list

Re: Duplicates list

Re: Duplicates list

SAS Innovate 2025: Call for Content

Classroom Training Available!