BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Mirisage
Obsidian | Level 7

Hi Community,

I have the attached data set.

It is clear that Bank_number, Account_number and Current_date are the combination of variables that would uniquely identify a record.

I needed to make sure it objectively.

What I have done is below.

proc sort data=a.post_this2 out=t;

  by  account_number current_date bank_number;

run;

data dups nodups ;

  set t ;

  by  account_number current_date bank_number;

  if first.bank_number and last.bank_number then output nodups;

  else output dups ;

run;

Result

It says there are nodups which is correct.

Q: I then changed the yellow place below from bank_number to current_date.

Then it says there are duplicates. Also, when changed it to account_number too, it says there are duplicate numbers.

Could you please let me know what is happening?

data dups nodups ;

  set t ;

  by  account_number current_date bank_number;

  if first.current_date and last.current_date then output nodups;

  else output dups ;

run;

Thanks

Mirisage

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

Consider this example (a for account number, c for current date, b for bank).

Notice that any time you have multipe banks a current_date, then you will always have duplicate dates, even though you never have duplicate banks within a given account/date.  Hence in your second program only the underscored line below would escape the DUP datasets.

The general rule is this:  Any time a particular BY variable changes its first. and last. dummies will change as will all first. and last. dummies to the right of it in the by list - i.e. it's hierarchical sorting.

a                 c       b              first.a last.a   first.c  last.c    first.b   last.b

1   01jan2012    1001             1         0        1        0            1         1

1   01jan2012    1002             0         0        0        0            1         1

1   01jan2012    1003             0         0        0         0            1         1

1   01jan2012    1004             0         0        0         1            1         1

1   02jan2012    1004             0          0      1          1            1          1

1   03jan2012    1001             0         0        1         0            1         1

1   03jan2012    1001             0         0        0         0            1         1

1   03jan2012    1001             0         0        0         0            1         1

1   03jan2012    1001             0         1        0         1            1         1

2  ....

account_number current_date   bank_number     f

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

2 REPLIES 2
mkeintz
PROC Star

Consider this example (a for account number, c for current date, b for bank).

Notice that any time you have multipe banks a current_date, then you will always have duplicate dates, even though you never have duplicate banks within a given account/date.  Hence in your second program only the underscored line below would escape the DUP datasets.

The general rule is this:  Any time a particular BY variable changes its first. and last. dummies will change as will all first. and last. dummies to the right of it in the by list - i.e. it's hierarchical sorting.

a                 c       b              first.a last.a   first.c  last.c    first.b   last.b

1   01jan2012    1001             1         0        1        0            1         1

1   01jan2012    1002             0         0        0        0            1         1

1   01jan2012    1003             0         0        0         0            1         1

1   01jan2012    1004             0         0        0         1            1         1

1   02jan2012    1004             0          0      1          1            1          1

1   03jan2012    1001             0         0        1         0            1         1

1   03jan2012    1001             0         0        0         0            1         1

1   03jan2012    1001             0         0        0         0            1         1

1   03jan2012    1001             0         1        0         1            1         1

2  ....

account_number current_date   bank_number     f

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Mirisage
Obsidian | Level 7

Hi Mkeintz,

Thank you very much for taking time to illustrate the concept in a nice example.

This is great!

Best regards

Mirisage.

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 1390 views
  • 0 likes
  • 2 in conversation