Hi SAS Forum;
Could anyone help me on this?
I have this data set which has 12 records.*/
data have;
informat Current_date date9.;
input Bank_number $ 1-2 Account_number $ 4-8 Current_date Balance Product $ 27-39 Arrears_Band $ 41-47;
format Current_date date9.;
cards;
10 44 28Feb2010 3 Personal Loan 90 +
10 44 28Feb2010 3 Personal Loan 90 +
10 44 30Apr2010 400 Personal Loan NPNA
10 44 30Apr2010 400 Personal Loan NPNA
10 44 31May2010 1 Personal Loan Current
10 44 31May2010 1 Personal Loan Current
10 44 31May2010 1 Personal Loan Current
10 44 31May2010 1 Personal Loan NPNA
10 44 31May2010 10000 Personal Loan Current
10 11111 30Nov2011 700 Res. Mortgage 1 - 30
10 11111 30Nov2011 700 Res. Mortgage 1 - 30
10 11111 30Nov2011 700 Personal Loan 1 - 30
;
run;
Per our business logic, following 3 variables jointly constitute an unique record.
i.e.
bank_number Account_number Current_date
/*I wanted to find out if there are duplicate records in the above data set when I consider
the above 3 variables as "by variables" .*/
/*I have taken the following approach */
proc sort data=have out=L;
by bank_number Account_number Current_date ;
run;
data dups nodups ;
set L;
by bank_number Account_number Current_date ;
if first.Current_date and last.Current_date then output nodups ;
else output dups ;
run;
/* According to above code (which should be correct), all 12 records of my dataset are duplicates */
I was then prompted to use the following approach to remove duplicates.
proc sort data = have nodupkey out =want;
by bank_number account_number current_date;
run;
Question:
Although the ash colored two records below are identical when we consider
“bank_number Account_number Current_date” as by variables, they are in fact not identical in terms of variables “balance “ and “Arrears_Band”.
This means I was mislead by my sas code.
Q: So how to detect true duplicates?
data have;
informat Current_date date9.;
input Bank_number $ 1-2 Account_number $ 4-8 Current_date Balance Product $ 27-39 Arrears_Band $ 41-47;
format Current_date date9.;
cards;
10 44 28Feb2010 3 Personal Loan 90 +
10 44 28Feb2010 3 Personal Loan 90 +
10 44 30Apr2010 400 Personal Loan NPNA
10 44 30Apr2010 400 Personal Loan NPNA
10 44 31May2010 1 Personal Loan Current
10 44 31May2010 1 Personal Loan Current
10 44 31May2010 1 Personal Loan Current
10 44 31May2010 1 Personal Loan NPNA
10 44 31May2010 10000 Personal Loan Current
10 11111 30Nov2011 700 Res. Mortgage 1 - 30
10 11111 30Nov2011 700 Res. Mortgage 1 - 30
10 11111 30Nov2011 700 Personal Loan 1 - 30
;
run;
Thank you for the help.
Mirisage
My two cents: You weren't misled by SAS .. your business rules aren't really correct!
Even if you included balance, product and arrears band, you may still be identifying records that look the same, but represent separate transactions. E.g., if one were to have two records like:
10 44 28Feb2010 3 Personal Loan 90 +
10 44 28Feb2010 3 Personal Loan 90 +
Do they represent two personal loans for the same amount or are they duplicates? Without a unique identifier, like transaction number in the data, I don't think you can know when you have duplicates .. only that you have duplicate-appearing information.
My two cents: You weren't misled by SAS .. your business rules aren't really correct!
Even if you included balance, product and arrears band, you may still be identifying records that look the same, but represent separate transactions. E.g., if one were to have two records like:
10 44 28Feb2010 3 Personal Loan 90 +
10 44 28Feb2010 3 Personal Loan 90 +
Do they represent two personal loans for the same amount or are they duplicates? Without a unique identifier, like transaction number in the data, I don't think you can know when you have duplicates .. only that you have duplicate-appearing information.
Hi;
In your sort you could add a NODUPS;
proc sort data=have out=L nodups;
by bank_number Account_number Current_date ;
run;
So then your L data set has the non-duplicates records;
And below is what SAS is doing with this piece of code:
proc sort data=have out=L;
by bank_number Account_number Current_date ;
run;
data not_right ;
set L;
by bank_number Account_number Current_date ;
*I replaced your IF statement with a bunch of flags;
* if first.Current_date and last.Current_date then output nodups ;
if first.Current_date then flag_first = 1;
if last.Current_date then flag_last = 1;
*this is similar to your original IF I am not outputting, just flagging, instead;
if flag_first = 1 & flag_last = 1 then flag_nodups = 1; *----> never true in your case;
run;
Please let me know if I am confusing you more or the PROC SORT NODUPS worked.
Good luck,
Anca.
: I don't think that using or recommending nodup or nodups or noduprecs is good practice. Take a look at:
http://www.qsl.net/kd6ttl/sas/nodups.pdf
I think that the critical issue that faces is defining what is meant by a duplicate and determining whether the data contains sufficient fields to identify such records.
Thank you, Arthur.
I was not aware of it's downfalls, and now that I scan the paper you referred to it explains some things to me.
Thank you!
Anca
I'm reading through several postings on how to extract duplicate records, yet I am reminded that proc sort/nodup is an issue.
What is the best alternative to validate a known set of variables and check for duplicates?
I've seen examples like the following:
data
dups undups;
set
WORK.procsorted;
by
memberid billedproc;
retain
count;
If
Sum (first.memberid, last.billedproc) < 2 Then Output
DUPS;
Else
Output
UNDUPS;
run
;
This doesn't work - it pulls in random lines (in the DUPS table) not matching the above criteria. Keep in mind I am really new to this SAS thing, but not to data mining and I need to make sure whatever approach I use pulls in accurate data. I need to be able to remove the dupes but also view which items met that duplicate criteria...Thank you...
Copy and paste made it that big - Sorry
Derrick: I think you ought to start a new thread, provide an example dataset, and show which records you want to end up with in each of the two new files.
Agreed
Hi Mirisage,
yes you get the duplicates and no duplicates out. for this you need to make a small correction to the existing code. if you can see, most of the observation are matching on all the variables except on Arrears_Band and Balance. so in order to get the duplicates, you need to include Arrears_Band in the by variables like below
proc sort data=have out=L;
by bank_number Account_number Current_date Arrears_Band;
run;
data dups nodups ;
set L;
by bank_number Account_number Current_date Arrears_Band ;
if first.Current_date and last.Current_date then output nodups ;
else output dups ;
run;
Please try the code and let me know if this helped you.
Thanks,
Jagadish
Hi Art,
Thank you very much.
I now understand that I should have a clear definition what is meant by a duplicate.
Hi Anca,
Thank you for the alternative coding approach which is a good learning experience for me.
Hi Jagadish,
Your code suggestion is apprecaited. It needs a change (see highlighted in yellow).
proc sort data=have out=L;
by bank_number Account_number Current_date Arrears_Band;
run;
data dups nodups ;
set L;
by bank_number Account_number Current_date Arrears_Band ;
if first.Arrears_Band and last.Arrears_Band then output nodups ;
else output dups ;
run;
Regards
Mirisage
another PROC SORT feature you might find of interest
DUPOUT=
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
will output to a file the rows removed by NODUPKEY or NODUPS options
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
it is not the same as the requested solution because it leaves in the OUT= table, one of the duplicate pairs. However it provides another angle that could be helpful.
Hi Peter, thank you very much for these inputs.
Regards
Mirisage
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.