BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Mirisage
Obsidian | Level 7

Hi Community,

I have the attached data set.

It is clear that Bank_number, Account_number and Current_date are the combination of variables that would uniquely identify a record.

I needed to make sure it objectively.

What I have done is below.

proc sort data=a.post_this2 out=t;

  by  account_number current_date bank_number;

run;

data dups nodups ;

  set t ;

  by  account_number current_date bank_number;

  if first.bank_number and last.bank_number then output nodups;

  else output dups ;

run;

Result

It says there are nodups which is correct.

Q: I then changed the yellow place below from bank_number to current_date.

Then it says there are duplicates. Also, when changed it to account_number too, it says there are duplicate numbers.

Could you please let me know what is happening?

data dups nodups ;

  set t ;

  by  account_number current_date bank_number;

  if first.current_date and last.current_date then output nodups;

  else output dups ;

run;

Thanks

Mirisage

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

Consider this example (a for account number, c for current date, b for bank).

Notice that any time you have multipe banks a current_date, then you will always have duplicate dates, even though you never have duplicate banks within a given account/date.  Hence in your second program only the underscored line below would escape the DUP datasets.

The general rule is this:  Any time a particular BY variable changes its first. and last. dummies will change as will all first. and last. dummies to the right of it in the by list - i.e. it's hierarchical sorting.

a                 c       b              first.a last.a   first.c  last.c    first.b   last.b

1   01jan2012    1001             1         0        1        0            1         1

1   01jan2012    1002             0         0        0        0            1         1

1   01jan2012    1003             0         0        0         0            1         1

1   01jan2012    1004             0         0        0         1            1         1

1   02jan2012    1004             0          0      1          1            1          1

1   03jan2012    1001             0         0        1         0            1         1

1   03jan2012    1001             0         0        0         0            1         1

1   03jan2012    1001             0         0        0         0            1         1

1   03jan2012    1001             0         1        0         1            1         1

2  ....

account_number current_date   bank_number     f

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

2 REPLIES 2
mkeintz
PROC Star

Consider this example (a for account number, c for current date, b for bank).

Notice that any time you have multipe banks a current_date, then you will always have duplicate dates, even though you never have duplicate banks within a given account/date.  Hence in your second program only the underscored line below would escape the DUP datasets.

The general rule is this:  Any time a particular BY variable changes its first. and last. dummies will change as will all first. and last. dummies to the right of it in the by list - i.e. it's hierarchical sorting.

a                 c       b              first.a last.a   first.c  last.c    first.b   last.b

1   01jan2012    1001             1         0        1        0            1         1

1   01jan2012    1002             0         0        0        0            1         1

1   01jan2012    1003             0         0        0         0            1         1

1   01jan2012    1004             0         0        0         1            1         1

1   02jan2012    1004             0          0      1          1            1          1

1   03jan2012    1001             0         0        1         0            1         1

1   03jan2012    1001             0         0        0         0            1         1

1   03jan2012    1001             0         0        0         0            1         1

1   03jan2012    1001             0         1        0         1            1         1

2  ....

account_number current_date   bank_number     f

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Mirisage
Obsidian | Level 7

Hi Mkeintz,

Thank you very much for taking time to illustrate the concept in a nice example.

This is great!

Best regards

Mirisage.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 769 views
  • 0 likes
  • 2 in conversation