Hi Community,
I have the attached data set.
It is clear that Bank_number, Account_number and Current_date are the combination of variables that would uniquely identify a record.
I needed to make sure it objectively.
What I have done is below.
proc sort data=a.post_this2 out=t;
by account_number current_date bank_number;
run;
data dups nodups ;
set t ;
by account_number current_date bank_number;
if first.bank_number and last.bank_number then output nodups;
else output dups ;
run;
Result
It says there are nodups which is correct.
Q: I then changed the yellow place below from bank_number to current_date.
Then it says there are duplicates. Also, when changed it to account_number too, it says there are duplicate numbers.
Could you please let me know what is happening?
data dups nodups ;
set t ;
by account_number current_date bank_number;
if first.current_date and last.current_date then output nodups;
else output dups ;
run;
Thanks
Mirisage
Consider this example (a for account number, c for current date, b for bank).
Notice that any time you have multipe banks a current_date, then you will always have duplicate dates, even though you never have duplicate banks within a given account/date. Hence in your second program only the underscored line below would escape the DUP datasets.
The general rule is this: Any time a particular BY variable changes its first. and last. dummies will change as will all first. and last. dummies to the right of it in the by list - i.e. it's hierarchical sorting.
a c b first.a last.a first.c last.c first.b last.b
1 01jan2012 1001 1 0 1 0 1 1
1 01jan2012 1002 0 0 0 0 1 1
1 01jan2012 1003 0 0 0 0 1 1
1 01jan2012 1004 0 0 0 1 1 1
1 02jan2012 1004 0 0 1 1 1 1
1 03jan2012 1001 0 0 1 0 1 1
1 03jan2012 1001 0 0 0 0 1 1
1 03jan2012 1001 0 0 0 0 1 1
1 03jan2012 1001 0 1 0 1 1 1
2 ....
account_number current_date bank_number f
Consider this example (a for account number, c for current date, b for bank).
Notice that any time you have multipe banks a current_date, then you will always have duplicate dates, even though you never have duplicate banks within a given account/date. Hence in your second program only the underscored line below would escape the DUP datasets.
The general rule is this: Any time a particular BY variable changes its first. and last. dummies will change as will all first. and last. dummies to the right of it in the by list - i.e. it's hierarchical sorting.
a c b first.a last.a first.c last.c first.b last.b
1 01jan2012 1001 1 0 1 0 1 1
1 01jan2012 1002 0 0 0 0 1 1
1 01jan2012 1003 0 0 0 0 1 1
1 01jan2012 1004 0 0 0 1 1 1
1 02jan2012 1004 0 0 1 1 1 1
1 03jan2012 1001 0 0 1 0 1 1
1 03jan2012 1001 0 0 0 0 1 1
1 03jan2012 1001 0 0 0 0 1 1
1 03jan2012 1001 0 1 0 1 1 1
2 ....
account_number current_date bank_number f
Hi Mkeintz,
Thank you very much for taking time to illustrate the concept in a nice example.
This is great!
Best regards
Mirisage.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.