Solved: Re: How to find out true duplicate records of this data set?

Mirisage · Posted 01-21-2013 02:08 PM

Hi SAS Forum;

Could anyone help me on this?

I have this data set which has 12 records.*/

data have;

informat Current_date date9.;

input Bank_number $ 1-2 Account_number $ 4-8 Current_date Balance Product $ 27-39 Arrears_Band $ 41-47;

format Current_date date9.;

cards;

10 44 28Feb2010 3 Personal Loan 90 +

10 44 30Apr2010 400 Personal Loan NPNA

10 44 31May2010 1 Personal Loan Current

10 44 31May2010 1 Personal Loan NPNA

10 44 31May2010 10000 Personal Loan Current

10 11111 30Nov2011 700 Res. Mortgage 1 - 30

10 11111 30Nov2011 700 Personal Loan 1 - 30

;

run;

Per our business logic, following 3 variables jointly constitute an unique record.

i.e.

bank_number Account_number Current_date

/*I wanted to find out if there are duplicate records in the above data set when I consider

the above 3 variables as "by variables" .*/

/*I have taken the following approach */

proc sort data=have out=L;

by bank_number Account_number Current_date ;

run;

data dups nodups ;

set L;

by bank_number Account_number Current_date ;

if first.Current_date and last.Current_date then output nodups ;

else output dups ;

run;

/* According to above code (which should be correct), all 12 records of my dataset are duplicates */

I was then prompted to use the following approach to remove duplicates.

proc sort data = have nodupkey out =want;

by bank_number account_number current_date;

run;

Question:

Although the ash colored two records below are identical when we consider

“bank_number Account_number Current_date” as by variables, they are in fact not identical in terms of variables “balance “ and “Arrears_Band”.

This means I was mislead by my sas code.

Q: So how to detect true duplicates?

data have;

informat Current_date date9.;

input Bank_number $ 1-2 Account_number $ 4-8 Current_date Balance Product $ 27-39 Arrears_Band $ 41-47;

format Current_date date9.;

cards;

10 44 28Feb2010 3 Personal Loan 90 +

10 44 30Apr2010 400 Personal Loan NPNA

10 44 31May2010 1 Personal Loan Current

10 44 31May2010 1 Personal Loan NPNA

10 44 31May2010 10000 Personal Loan Current

10 11111 30Nov2011 700 Res. Mortgage 1 - 30

10 11111 30Nov2011 700 Personal Loan 1 - 30

;

run;

Thank you for the help.

Mirisage

art297 · Posted 01-21-2013 02:34 PM

My two cents: You weren't misled by SAS .. your business rules aren't really correct!

Even if you included balance, product and arrears band, you may still be identifying records that look the same, but represent separate transactions. E.g., if one were to have two records like:

10 44 28Feb2010 3 Personal Loan 90 +

Do they represent two personal loans for the same amount or are they duplicates? Without a unique identifier, like transaction number in the data, I don't think you can know when you have duplicates .. only that you have duplicate-appearing information.

View solution in original post

art297 · Posted 01-21-2013 02:34 PM

My two cents: You weren't misled by SAS .. your business rules aren't really correct!

Even if you included balance, product and arrears band, you may still be identifying records that look the same, but represent separate transactions. E.g., if one were to have two records like:

10 44 28Feb2010 3 Personal Loan 90 +

Do they represent two personal loans for the same amount or are they duplicates? Without a unique identifier, like transaction number in the data, I don't think you can know when you have duplicates .. only that you have duplicate-appearing information.

AncaTilea · Posted 01-21-2013 02:41 PM

Hi;

In your sort you could add a NODUPS;

proc sort data=have out=L nodups;

by bank_number Account_number Current_date ;

run;

So then your L data set has the non-duplicates records;

And below is what SAS is doing with this piece of code:

proc sort data=have out=L;

by bank_number Account_number Current_date ;

run;

data not_right ;

set L;

by bank_number Account_number Current_date ;

*I replaced your IF statement with a bunch of flags;

* if first.Current_date and last.Current_date then output nodups ;

if first.Current_date then flag_first = 1;

if last.Current_date then flag_last = 1;

*this is similar to your original IF I am not outputting, just flagging, instead;

if flag_first = 1 & flag_last = 1 then flag_nodups = 1; *----> never true in your case;

run;

Please let me know if I am confusing you more or the PROC SORT NODUPS worked.

Good luck,

Anca.

art297 · Posted 01-21-2013 02:48 PM

: I don't think that using or recommending nodup or nodups or noduprecs is good practice. Take a look at:

http://www.qsl.net/kd6ttl/sas/nodups.pdf

I think that the critical issue that faces is defining what is meant by a duplicate and determining whether the data contains sufficient fields to identify such records.

AncaTilea · Posted 01-21-2013 02:53 PM

Thank you, Arthur.

I was not aware of it's downfalls, and now that I scan the paper you referred to it explains some things to me.

Thank you!

Anca

Dsrountree · Posted 03-27-2013 01:27 PM

I'm reading through several postings on how to extract duplicate records, yet I am reminded that proc sort/nodup is an issue.

What is the best alternative to validate a known set of variables and check for duplicates?

I've seen examples like the following:

data

dups undups;

set

WORK.procsorted;

by

memberid billedproc;

retain

count;

If

Sum (first.memberid, last.billedproc) < 2 Then Output

DUPS;

Else

Output

UNDUPS;

run

;

This doesn't work - it pulls in random lines (in the DUPS table) not matching the above criteria. Keep in mind I am really new to this SAS thing, but not to data mining and I need to make sure whatever approach I use pulls in accurate data. I need to be able to remove the dupes but also view which items met that duplicate criteria...Thank you...

Dsrountree · Posted 03-27-2013 01:27 PM

Copy and paste made it that big - Sorry

art297 · Posted 03-27-2013 03:30 PM

Derrick: I think you ought to start a new thread, provide an example dataset, and show which records you want to end up with in each of the two new files.

Dsrountree · Posted 03-27-2013 04:06 PM

Agreed

Jagadishkatam · Posted 01-22-2013 04:30 AM

Hi Mirisage,

yes you get the duplicates and no duplicates out. for this you need to make a small correction to the existing code. if you can see, most of the observation are matching on all the variables except on Arrears_Band and Balance. so in order to get the duplicates, you need to include Arrears_Band in the by variables like below

proc sort data=have out=L;

by bank_number Account_number Current_date Arrears_Band;

run;

data dups nodups ;

set L;

by bank_number Account_number Current_date Arrears_Band ;

if first.Current_date and last.Current_date then output nodups ;

else output dups ;

run;

Please try the code and let me know if this helped you.

Thanks,

Jagadish

Thanks,
Jag

Mirisage · Posted 01-25-2013 11:46 PM

Hi Art,

Thank you very much.

I now understand that I should have a clear definition what is meant by a duplicate.

Hi Anca,

Thank you for the alternative coding approach which is a good learning experience for me.

Hi Jagadish,

Your code suggestion is apprecaited. It needs a change (see highlighted in yellow).

proc sort data=have out=L;

by bank_number Account_number Current_date Arrears_Band;

run;

data dups nodups ;

set L;

by bank_number Account_number Current_date Arrears_Band ;

if first.Arrears_Band and last.Arrears_Band then output nodups ;

else output dups ;

run;

Regards

Mirisage

Peter_C · Posted 01-26-2013 05:50 AM

another PROC SORT feature you might find of interest

DUPOUT=

~~~~~~~~~~~~~~~~~~~~~~~~~~~~

will output to a file the rows removed by NODUPKEY or NODUPS options

~~~~~~~~~~~~~~~~~~~~~~~~~~~~

it is not the same as the requested solution because it leaves in the OUT= table, one of the duplicate pairs. However it provides another angle that could be helpful.

Mirisage · Posted 01-29-2013 02:37 PM

Hi Peter, thank you very much for these inputs.

Regards

Mirisage

Registration is open

SAS Training: Just a Click Away