- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Community,
I have the attached data set.
It is clear that Bank_number, Account_number and Current_date are the combination of variables that would uniquely identify a record.
I needed to make sure it objectively.
What I have done is below.
proc sort data=a.post_this2 out=t;
by account_number current_date bank_number;
run;
data dups nodups ;
set t ;
by account_number current_date bank_number;
if first.bank_number and last.bank_number then output nodups;
else output dups ;
run;
Result
It says there are nodups which is correct.
Q: I then changed the yellow place below from bank_number to current_date.
Then it says there are duplicates. Also, when changed it to account_number too, it says there are duplicate numbers.
Could you please let me know what is happening?
data dups nodups ;
set t ;
by account_number current_date bank_number;
if first.current_date and last.current_date then output nodups;
else output dups ;
run;
Thanks
Mirisage
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Consider this example (a for account number, c for current date, b for bank).
Notice that any time you have multipe banks a current_date, then you will always have duplicate dates, even though you never have duplicate banks within a given account/date. Hence in your second program only the underscored line below would escape the DUP datasets.
The general rule is this: Any time a particular BY variable changes its first. and last. dummies will change as will all first. and last. dummies to the right of it in the by list - i.e. it's hierarchical sorting.
a c b first.a last.a first.c last.c first.b last.b
1 01jan2012 1001 1 0 1 0 1 1
1 01jan2012 1002 0 0 0 0 1 1
1 01jan2012 1003 0 0 0 0 1 1
1 01jan2012 1004 0 0 0 1 1 1
1 02jan2012 1004 0 0 1 1 1 1
1 03jan2012 1001 0 0 1 0 1 1
1 03jan2012 1001 0 0 0 0 1 1
1 03jan2012 1001 0 0 0 0 1 1
1 03jan2012 1001 0 1 0 1 1 1
2 ....
account_number current_date bank_number f
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Consider this example (a for account number, c for current date, b for bank).
Notice that any time you have multipe banks a current_date, then you will always have duplicate dates, even though you never have duplicate banks within a given account/date. Hence in your second program only the underscored line below would escape the DUP datasets.
The general rule is this: Any time a particular BY variable changes its first. and last. dummies will change as will all first. and last. dummies to the right of it in the by list - i.e. it's hierarchical sorting.
a c b first.a last.a first.c last.c first.b last.b
1 01jan2012 1001 1 0 1 0 1 1
1 01jan2012 1002 0 0 0 0 1 1
1 01jan2012 1003 0 0 0 0 1 1
1 01jan2012 1004 0 0 0 1 1 1
1 02jan2012 1004 0 0 1 1 1 1
1 03jan2012 1001 0 0 1 0 1 1
1 03jan2012 1001 0 0 0 0 1 1
1 03jan2012 1001 0 0 0 0 1 1
1 03jan2012 1001 0 1 0 1 1 1
2 ....
account_number current_date bank_number f
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mkeintz,
Thank you very much for taking time to illustrate the concept in a nice example.
This is great!
Best regards
Mirisage.