Hi Community,
I have the attached data set.
It is clear that Bank_number, Account_number and Current_date are the combination of variables that would uniquely identify a record.
I needed to make sure it objectively.
What I have done is below.
proc sort data=a.post_this2 out=t;
by account_number current_date bank_number;
run;
data dups nodups ;
set t ;
by account_number current_date bank_number;
if first.bank_number and last.bank_number then output nodups;
else output dups ;
run;
Result
It says there are nodups which is correct.
Q: I then changed the yellow place below from bank_number to current_date.
Then it says there are duplicates. Also, when changed it to account_number too, it says there are duplicate numbers.
Could you please let me know what is happening?
data dups nodups ;
set t ;
by account_number current_date bank_number;
if first.current_date and last.current_date then output nodups;
else output dups ;
run;
Thanks
Mirisage
Consider this example (a for account number, c for current date, b for bank).
Notice that any time you have multipe banks a current_date, then you will always have duplicate dates, even though you never have duplicate banks within a given account/date. Hence in your second program only the underscored line below would escape the DUP datasets.
The general rule is this: Any time a particular BY variable changes its first. and last. dummies will change as will all first. and last. dummies to the right of it in the by list - i.e. it's hierarchical sorting.
a c b first.a last.a first.c last.c first.b last.b
1 01jan2012 1001 1 0 1 0 1 1
1 01jan2012 1002 0 0 0 0 1 1
1 01jan2012 1003 0 0 0 0 1 1
1 01jan2012 1004 0 0 0 1 1 1
1 02jan2012 1004 0 0 1 1 1 1
1 03jan2012 1001 0 0 1 0 1 1
1 03jan2012 1001 0 0 0 0 1 1
1 03jan2012 1001 0 0 0 0 1 1
1 03jan2012 1001 0 1 0 1 1 1
2 ....
account_number current_date bank_number f
Consider this example (a for account number, c for current date, b for bank).
Notice that any time you have multipe banks a current_date, then you will always have duplicate dates, even though you never have duplicate banks within a given account/date. Hence in your second program only the underscored line below would escape the DUP datasets.
The general rule is this: Any time a particular BY variable changes its first. and last. dummies will change as will all first. and last. dummies to the right of it in the by list - i.e. it's hierarchical sorting.
a c b first.a last.a first.c last.c first.b last.b
1 01jan2012 1001 1 0 1 0 1 1
1 01jan2012 1002 0 0 0 0 1 1
1 01jan2012 1003 0 0 0 0 1 1
1 01jan2012 1004 0 0 0 1 1 1
1 02jan2012 1004 0 0 1 1 1 1
1 03jan2012 1001 0 0 1 0 1 1
1 03jan2012 1001 0 0 0 0 1 1
1 03jan2012 1001 0 0 0 0 1 1
1 03jan2012 1001 0 1 0 1 1 1
2 ....
account_number current_date bank_number f
Hi Mkeintz,
Thank you very much for taking time to illustrate the concept in a nice example.
This is great!
Best regards
Mirisage.
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.