## How to find out true duplicate records of this data set?

Solved
Super Contributor
Posts: 338

# How to find out true duplicate records of this data set?

Hi SAS Forum;

Could anyone help me on this?

I have this data set which has 12 records.*/

data have;

informat Current_date date9.;

input Bank_number \$ 1-2    Account_number \$ 4-8 Current_date Balance Product    \$ 27-39 Arrears_Band \$ 41-47;

format Current_date date9.;

cards;

10 44     28Feb2010 3     Personal Loan 90 +

10 44     28Feb2010 3     Personal Loan 90 +

10 44     30Apr2010 400   Personal Loan NPNA

10 44     30Apr2010 400   Personal Loan NPNA

10 44     31May2010 1     Personal Loan Current

10 44     31May2010 1     Personal Loan Current

10 44     31May2010 1     Personal Loan Current

10 44     31May2010 1     Personal Loan NPNA

10 44     31May2010 10000 Personal Loan Current

10 11111  30Nov2011 700   Res. Mortgage 1 - 30

10 11111  30Nov2011 700   Res. Mortgage 1 - 30

10 11111  30Nov2011 700   Personal Loan 1 - 30

;

run;

Per our business logic, following 3 variables jointly constitute an unique record.

i.e.

bank_number Account_number Current_date

/*I wanted to find out if there are duplicate records in the above data set when I consider

the above 3 variables as "by variables" .*/

/*I have taken the following approach */

proc sort data=have out=L;

by  bank_number Account_number Current_date  ;

run;

data dups nodups ;

set L;

by  bank_number Account_number Current_date  ;

if first.Current_date  and last.Current_date then output nodups ;

else output dups ;

run;

/* According to above code (which should be correct), all 12 records of my dataset are duplicates */

I was then prompted to use the following approach to remove duplicates.

proc sort data = have  nodupkey out =want;

by  bank_number account_number  current_date;

run;

Question:

Although the ash colored two records below are identical when we consider

bank_number Account_number Current_date” as by variables, they are in fact not identical in terms of variables “balance “ and “Arrears_Band”.

This means I was mislead by my sas code.

Q: So how to detect true duplicates?

data have;

informat Current_date date9.;

input Bank_number \$ 1-2    Account_number \$ 4-8 Current_date Balance Product    \$ 27-39 Arrears_Band \$ 41-47;

format Current_date date9.;

cards;

10 44     28Feb2010 3     Personal Loan 90 +

10 44     28Feb2010 3     Personal Loan 90 +

10 44     30Apr2010 400   Personal Loan NPNA

10 44     30Apr2010 400   Personal Loan NPNA

10 44     31May2010 1     Personal Loan Current

10 44     31May2010 1     Personal Loan Current

10 44     31May2010 1     Personal Loan Current

10 44     31May2010 1     Personal Loan NPNA

10 44     31May2010 10000 Personal Loan Current

10 11111  30Nov2011 700   Res. Mortgage 1 - 30

10 11111  30Nov2011 700   Res. Mortgage 1 - 30

10 11111  30Nov2011 700   Personal Loan 1 - 30

;

run;

Thank you for the help.

Mirisage

Accepted Solutions
Solution
‎01-21-2013 02:34 PM
PROC Star
Posts: 8,164

## Re: How to find out true duplicate records of this data set?

My two cents: You weren't misled by SAS .. your business rules aren't really correct!

Even if you included balance, product and arrears band, you may still be identifying records that look the same, but represent separate transactions.  E.g., if one were to have two records like:

10 44     28Feb2010 3     Personal Loan 90 +

10 44     28Feb2010 3     Personal Loan 90 +

Do they represent two personal loans for the same amount or are they duplicates?  Without a unique identifier, like transaction number in the data, I don't think you can know when you have duplicates .. only that you have duplicate-appearing information.

All Replies
Solution
‎01-21-2013 02:34 PM
PROC Star
Posts: 8,164

## Re: How to find out true duplicate records of this data set?

My two cents: You weren't misled by SAS .. your business rules aren't really correct!

Even if you included balance, product and arrears band, you may still be identifying records that look the same, but represent separate transactions.  E.g., if one were to have two records like:

10 44     28Feb2010 3     Personal Loan 90 +

10 44     28Feb2010 3     Personal Loan 90 +

Do they represent two personal loans for the same amount or are they duplicates?  Without a unique identifier, like transaction number in the data, I don't think you can know when you have duplicates .. only that you have duplicate-appearing information.

Super Contributor
Posts: 543

## Re: How to find out true duplicate records of this data set?

Hi;

proc sort data=have out=L nodups;

by  bank_number Account_number Current_date  ;

run;

So then your L data set has the non-duplicates records;

And below is what SAS is doing with this piece of code:

proc sort data=have out=L;

by  bank_number Account_number Current_date  ;

run;

data not_right ;

set L;

by  bank_number Account_number Current_date  ;

*I replaced your IF statement with a bunch of flags;

*      if first.Current_date  and last.Current_date then output nodups ;

if first.Current_date then flag_first = 1;

if last.Current_date then flag_last = 1;

*this is similar to your original IF I am not outputting, just flagging, instead;

if flag_first = 1 & flag_last = 1 then flag_nodups = 1; *----> never true in your case;

run;

Please let me know if I am confusing you more or the PROC SORT NODUPS worked.

Good luck,

Anca.

PROC Star
Posts: 8,164

## Re: How to find out true duplicate records of this data set?

: I don't think that using or recommending nodup or nodups or noduprecs is good practice.  Take a look at:

http://www.qsl.net/kd6ttl/sas/nodups.pdf

I think that the critical issue that faces is defining what is meant by a duplicate and determining whether the data contains sufficient fields to identify such records.

Super Contributor
Posts: 543

## Re: How to find out true duplicate records of this data set?

Thank you, Arthur.

I was not aware of it's downfalls, and now that I scan the paper you referred to it explains some things to me.

Thank you!

Anca

Contributor
Posts: 59

## Re: How to find out true duplicate records of this data set?

I'm reading through several postings on how to extract duplicate records, yet I am reminded that proc sort/nodup is an issue.

What is the best alternative to validate a known set of variables and check for duplicates?

I've seen examples like the following:

data

dups undups;

set

WORK.procsorted;

by

memberid billedproc;

retain

count;

If

Sum (first.memberid, last.billedproc) < 2 Then Output

DUPS;

Else

Output

UNDUPS;

run

;

This doesn't work - it pulls in random lines (in the DUPS table) not matching the above criteria.  Keep in mind I am really new to this SAS thing, but not to data mining and I need to make sure whatever approach I use pulls in accurate data.  I need to be able to remove the dupes but also view which items met that duplicate criteria...Thank you...

Contributor
Posts: 59

## Re: How to find out true duplicate records of this data set?

Copy and paste made it that big - Sorry

PROC Star
Posts: 8,164

## Re: How to find out true duplicate records of this data set?

Derrick: I think you ought to start a new thread, provide an example dataset, and show which records you want to end up with in each of the two new files.

Contributor
Posts: 59

Agreed

Posts: 1,147

## Re: How to find out true duplicate records of this data set?

Hi Mirisage,

yes you get the duplicates and no duplicates out. for this you need to make a small correction to the existing code. if you can see, most of the observation are matching on all the variables except on Arrears_Band and Balance. so in order to get the duplicates, you need to include Arrears_Band in the by variables like below

proc sort data=have out=L;

by  bank_number Account_number Current_date  Arrears_Band;

run;

data dups nodups ;

set L;

by  bank_number Account_number Current_date Arrears_Band ;

if first.Current_date  and last.Current_date then output nodups ;

else output dups ;

run;

Please try the code and let me know if this helped you.

Thanks,

Thanks,
Jag
Super Contributor
Posts: 338

## Re: How to find out true duplicate records of this data set?

Hi Art,

Thank you very much.

I now understand that I should have a clear definition what is meant by a duplicate.

Hi Anca,

Thank you for the alternative coding approach which is a good learning experience for me.

Your code suggestion is apprecaited. It needs a change (see highlighted in yellow).

proc sort data=have out=L;

by  bank_number Account_number Current_date  Arrears_Band;

run;

data dups nodups ;

set L;

by  bank_number Account_number Current_date Arrears_Band ;

if first.Arrears_Band and last.Arrears_Band then output nodups ;

else output dups ;

run;

Regards

Mirisage

Valued Guide
Posts: 2,191

## Re: How to find out true duplicate records of this data set?

another PROC SORT feature you might find of interest

DUPOUT=

~~~~~~~~~~~~~~~~~~~~~~~~~~~~

will output to a file the rows removed by NODUPKEY or NODUPS options

~~~~~~~~~~~~~~~~~~~~~~~~~~~~

it is not the same as the requested solution because it leaves in the OUT= table, one of the duplicate pairs. However it provides another angle that could be helpful.

Super Contributor
Posts: 338

## Re: How to find out true duplicate records of this data set?

Hi Peter, thank you very much for these inputs.

Regards

Mirisage

🔒 This topic is solved and locked.