Hi Community,
I have the attached data set.
It is clear that Bank_number, Account_number and Current_date are the combination of variables that would uniquely identify a record.
I needed to make sure it objectively.
What I have done is below.
proc sort data=a.post_this2 out=t;
by account_number current_date bank_number;
run;
data dups nodups ;
set t ;
by account_number current_date bank_number;
if first.bank_number and last.bank_number then output nodups;
else output dups ;
run;
Result
It says there are nodups which is correct.
Q: I then changed the yellow place below from bank_number to current_date.
Then it says there are duplicates. Also, when changed it to account_number too, it says there are duplicate numbers.
Could you please let me know what is happening?
data dups nodups ;
set t ;
by account_number current_date bank_number;
if first.current_date and last.current_date then output nodups;
else output dups ;
run;
Thanks
Mirisage
Consider this example (a for account number, c for current date, b for bank).
Notice that any time you have multipe banks a current_date, then you will always have duplicate dates, even though you never have duplicate banks within a given account/date. Hence in your second program only the underscored line below would escape the DUP datasets.
The general rule is this: Any time a particular BY variable changes its first. and last. dummies will change as will all first. and last. dummies to the right of it in the by list - i.e. it's hierarchical sorting.
a c b first.a last.a first.c last.c first.b last.b
1 01jan2012 1001 1 0 1 0 1 1
1 01jan2012 1002 0 0 0 0 1 1
1 01jan2012 1003 0 0 0 0 1 1
1 01jan2012 1004 0 0 0 1 1 1
1 02jan2012 1004 0 0 1 1 1 1
1 03jan2012 1001 0 0 1 0 1 1
1 03jan2012 1001 0 0 0 0 1 1
1 03jan2012 1001 0 0 0 0 1 1
1 03jan2012 1001 0 1 0 1 1 1
2 ....
account_number current_date bank_number f
Consider this example (a for account number, c for current date, b for bank).
Notice that any time you have multipe banks a current_date, then you will always have duplicate dates, even though you never have duplicate banks within a given account/date. Hence in your second program only the underscored line below would escape the DUP datasets.
The general rule is this: Any time a particular BY variable changes its first. and last. dummies will change as will all first. and last. dummies to the right of it in the by list - i.e. it's hierarchical sorting.
a c b first.a last.a first.c last.c first.b last.b
1 01jan2012 1001 1 0 1 0 1 1
1 01jan2012 1002 0 0 0 0 1 1
1 01jan2012 1003 0 0 0 0 1 1
1 01jan2012 1004 0 0 0 1 1 1
1 02jan2012 1004 0 0 1 1 1 1
1 03jan2012 1001 0 0 1 0 1 1
1 03jan2012 1001 0 0 0 0 1 1
1 03jan2012 1001 0 0 0 0 1 1
1 03jan2012 1001 0 1 0 1 1 1
2 ....
account_number current_date bank_number f
Hi Mkeintz,
Thank you very much for taking time to illustrate the concept in a nice example.
This is great!
Best regards
Mirisage.
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.