BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Mirisage
Obsidian | Level 7

Hi SAS Forum;

Could anyone help me on this?

I have this data set which has 12 records.*/

data have;

informat Current_date date9.;

input Bank_number $ 1-2    Account_number $ 4-8 Current_date Balance Product    $ 27-39 Arrears_Band $ 41-47;

format Current_date date9.;

cards;

10 44     28Feb2010 3     Personal Loan 90 +

10 44     28Feb2010 3     Personal Loan 90 +

10 44     30Apr2010 400   Personal Loan NPNA

10 44     30Apr2010 400   Personal Loan NPNA

10 44     31May2010 1     Personal Loan Current

10 44     31May2010 1     Personal Loan Current

10 44     31May2010 1     Personal Loan Current

10 44     31May2010 1     Personal Loan NPNA

10 44     31May2010 10000 Personal Loan Current

10 11111  30Nov2011 700   Res. Mortgage 1 - 30

10 11111  30Nov2011 700   Res. Mortgage 1 - 30

10 11111  30Nov2011 700   Personal Loan 1 - 30

;

run;

Per our business logic, following 3 variables jointly constitute an unique record.

i.e.

bank_number Account_number Current_date

/*I wanted to find out if there are duplicate records in the above data set when I consider

the above 3 variables as "by variables" .*/

/*I have taken the following approach */

proc sort data=have out=L;

     by  bank_number Account_number Current_date  ;

run;

data dups nodups ;

  set L;

by  bank_number Account_number Current_date  ;

     if first.Current_date  and last.Current_date then output nodups ;

           else output dups ;

run;

/* According to above code (which should be correct), all 12 records of my dataset are duplicates */

I was then prompted to use the following approach to remove duplicates.

proc sort data = have  nodupkey out =want;

     by  bank_number account_number  current_date;

run;

Question:

Although the ash colored two records below are identical when we consider

bank_number Account_number Current_date” as by variables, they are in fact not identical in terms of variables “balance “ and “Arrears_Band”.

This means I was mislead by my sas code.

Q: So how to detect true duplicates?

data have;

informat Current_date date9.;

input Bank_number $ 1-2    Account_number $ 4-8 Current_date Balance Product    $ 27-39 Arrears_Band $ 41-47;

format Current_date date9.;

cards;

10 44     28Feb2010 3     Personal Loan 90 +

10 44     28Feb2010 3     Personal Loan 90 +

10 44     30Apr2010 400   Personal Loan NPNA

10 44     30Apr2010 400   Personal Loan NPNA

10 44     31May2010 1     Personal Loan Current

10 44     31May2010 1     Personal Loan Current

10 44     31May2010 1     Personal Loan Current

10 44     31May2010 1     Personal Loan NPNA

10 44     31May2010 10000 Personal Loan Current

10 11111  30Nov2011 700   Res. Mortgage 1 - 30

10 11111  30Nov2011 700   Res. Mortgage 1 - 30

10 11111  30Nov2011 700   Personal Loan 1 - 30

;

run;

Thank you for the help.

Mirisage

1 ACCEPTED SOLUTION

Accepted Solutions
art297
Opal | Level 21

My two cents: You weren't misled by SAS .. your business rules aren't really correct!

Even if you included balance, product and arrears band, you may still be identifying records that look the same, but represent separate transactions.  E.g., if one were to have two records like:

10 44     28Feb2010 3     Personal Loan 90 +

10 44     28Feb2010 3     Personal Loan 90 +


Do they represent two personal loans for the same amount or are they duplicates?  Without a unique identifier, like transaction number in the data, I don't think you can know when you have duplicates .. only that you have duplicate-appearing information.

View solution in original post

12 REPLIES 12
art297
Opal | Level 21

My two cents: You weren't misled by SAS .. your business rules aren't really correct!

Even if you included balance, product and arrears band, you may still be identifying records that look the same, but represent separate transactions.  E.g., if one were to have two records like:

10 44     28Feb2010 3     Personal Loan 90 +

10 44     28Feb2010 3     Personal Loan 90 +


Do they represent two personal loans for the same amount or are they duplicates?  Without a unique identifier, like transaction number in the data, I don't think you can know when you have duplicates .. only that you have duplicate-appearing information.

AncaTilea
Pyrite | Level 9

Hi;

In your sort you could add a NODUPS;


proc sort data=have out=L nodups;

by  bank_number Account_number Current_date  ;

run;

So then your L data set has the non-duplicates records;

And below is what SAS is doing with this piece of code:

proc sort data=have out=L;

     by  bank_number Account_number Current_date  ;

run;


data not_right ;

set L;

by  bank_number Account_number Current_date  ;

*I replaced your IF statement with a bunch of flags;

*      if first.Current_date  and last.Current_date then output nodups ;

    if first.Current_date then flag_first = 1;

    if last.Current_date then flag_last = 1;

*this is similar to your original IF I am not outputting, just flagging, instead;

     if flag_first = 1 & flag_last = 1 then flag_nodups = 1; *----> never true in your case;

run;

Please let me know if I am confusing you more or the PROC SORT NODUPS worked.

Good luck,

Anca.

art297
Opal | Level 21

: I don't think that using or recommending nodup or nodups or noduprecs is good practice.  Take a look at:

http://www.qsl.net/kd6ttl/sas/nodups.pdf

I think that the critical issue that faces is defining what is meant by a duplicate and determining whether the data contains sufficient fields to identify such records.

AncaTilea
Pyrite | Level 9

Thank you, Arthur.

I was not aware of it's downfalls, and now that I scan the paper you referred to it explains some things to me.

Thank you!

Anca

Dsrountree
Obsidian | Level 7

I'm reading through several postings on how to extract duplicate records, yet I am reminded that proc sort/nodup is an issue.

What is the best alternative to validate a known set of variables and check for duplicates?

I've seen examples like the following:

data

dups undups;

set

WORK.procsorted;

by

memberid billedproc;

retain

count;

If

Sum (first.memberid, last.billedproc) < 2 Then Output

DUPS;

Else

Output

UNDUPS;

run

;


This doesn't work - it pulls in random lines (in the DUPS table) not matching the above criteria.  Keep in mind I am really new to this SAS thing, but not to data mining and I need to make sure whatever approach I use pulls in accurate data.  I need to be able to remove the dupes but also view which items met that duplicate criteria...Thank you...

Dsrountree
Obsidian | Level 7

Copy and paste made it that big - Sorry

art297
Opal | Level 21

Derrick: I think you ought to start a new thread, provide an example dataset, and show which records you want to end up with in each of the two new files.

Jagadishkatam
Amethyst | Level 16

Hi Mirisage,

yes you get the duplicates and no duplicates out. for this you need to make a small correction to the existing code. if you can see, most of the observation are matching on all the variables except on Arrears_Band and Balance. so in order to get the duplicates, you need to include Arrears_Band in the by variables like below


proc sort data=have out=L;

     by  bank_number Account_number Current_date  Arrears_Band;

run;


data dups nodups ;

  set L;

by  bank_number Account_number Current_date Arrears_Band ;

     if first.Current_date  and last.Current_date then output nodups ;

           else output dups ;

run;


Please try the code and let me know if this helped you.


Thanks,

Jagadish

Thanks,
Jag
Mirisage
Obsidian | Level 7

Hi Art,

Thank you very much.

I now understand that I should have a clear definition what is meant by a duplicate.

Hi Anca,

Thank you for the alternative coding approach which is a good learning experience for me.

Hi Jagadish,

Your code suggestion is apprecaited. It needs a change (see highlighted in yellow).

proc sort data=have out=L;

     by  bank_number Account_number Current_date  Arrears_Band;

run;

data dups nodups ;

  set L;

by  bank_number Account_number Current_date Arrears_Band ;

     if first.Arrears_Band and last.Arrears_Band then output nodups ;

           else output dups ;

run;

Regards

Mirisage

Peter_C
Rhodochrosite | Level 12

another PROC SORT feature you might find of interest

     DUPOUT=

~~~~~~~~~~~~~~~~~~~~~~~~~~~~

will output to a file the rows removed by NODUPKEY or NODUPS options

~~~~~~~~~~~~~~~~~~~~~~~~~~~~

it is not the same as the requested solution because it leaves in the OUT= table, one of the duplicate pairs. However it provides another angle that could be helpful.

Mirisage
Obsidian | Level 7

Hi Peter, thank you very much for these inputs.

Regards

Mirisage

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1386 views
  • 9 likes
  • 6 in conversation