Help using Base SAS procedures

Why duplicate records vary depending on “first.variable”?

Accepted Solution Solved
Reply
Super Contributor
Posts: 338
Accepted Solution

Why duplicate records vary depending on “first.variable”?

Hi Community,

I have the attached data set.

It is clear that Bank_number, Account_number and Current_date are the combination of variables that would uniquely identify a record.

I needed to make sure it objectively.

What I have done is below.

proc sort data=a.post_this2 out=t;

  by  account_number current_date bank_number;

run;

data dups nodups ;

  set t ;

  by  account_number current_date bank_number;

  if first.bank_number and last.bank_number then output nodups;

  else output dups ;

run;

Result

It says there are nodups which is correct.

Q: I then changed the yellow place below from bank_number to current_date.

Then it says there are duplicates. Also, when changed it to account_number too, it says there are duplicate numbers.

Could you please let me know what is happening?

data dups nodups ;

  set t ;

  by  account_number current_date bank_number;

  if first.current_date and last.current_date then output nodups;

  else output dups ;

run;

Thanks

Mirisage

Attachment

Accepted Solutions
Solution
‎08-17-2012 04:07 PM
Super User
Posts: 768

Re: Why duplicate records vary depending on “first.variable”?

Consider this example (a for account number, c for current date, b for bank).

Notice that any time you have multipe banks a current_date, then you will always have duplicate dates, even though you never have duplicate banks within a given account/date.  Hence in your second program only the underscored line below would escape the DUP datasets.

The general rule is this:  Any time a particular BY variable changes its first. and last. dummies will change as will all first. and last. dummies to the right of it in the by list - i.e. it's hierarchical sorting.

a                 c       b              first.a last.a   first.c  last.c    first.b   last.b

1   01jan2012    1001             1         0        1        0            1         1

1   01jan2012    1002             0         0        0        0            1         1

1   01jan2012    1003             0         0        0         0            1         1

1   01jan2012    1004             0         0        0         1            1         1

1   02jan2012    1004             0          0      1          1            1          1

1   03jan2012    1001             0         0        1         0            1         1

1   03jan2012    1001             0         0        0         0            1         1

1   03jan2012    1001             0         0        0         0            1         1

1   03jan2012    1001             0         1        0         1            1         1

2  ....

account_number current_date   bank_number     f

View solution in original post


All Replies
Solution
‎08-17-2012 04:07 PM
Super User
Posts: 768

Re: Why duplicate records vary depending on “first.variable”?

Consider this example (a for account number, c for current date, b for bank).

Notice that any time you have multipe banks a current_date, then you will always have duplicate dates, even though you never have duplicate banks within a given account/date.  Hence in your second program only the underscored line below would escape the DUP datasets.

The general rule is this:  Any time a particular BY variable changes its first. and last. dummies will change as will all first. and last. dummies to the right of it in the by list - i.e. it's hierarchical sorting.

a                 c       b              first.a last.a   first.c  last.c    first.b   last.b

1   01jan2012    1001             1         0        1        0            1         1

1   01jan2012    1002             0         0        0        0            1         1

1   01jan2012    1003             0         0        0         0            1         1

1   01jan2012    1004             0         0        0         1            1         1

1   02jan2012    1004             0          0      1          1            1          1

1   03jan2012    1001             0         0        1         0            1         1

1   03jan2012    1001             0         0        0         0            1         1

1   03jan2012    1001             0         0        0         0            1         1

1   03jan2012    1001             0         1        0         1            1         1

2  ....

account_number current_date   bank_number     f

Super Contributor
Posts: 338

Re: Why duplicate records vary depending on “first.variable”?

Hi Mkeintz,

Thank you very much for taking time to illustrate the concept in a nice example.

This is great!

Best regards

Mirisage.

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 2 replies
  • 161 views
  • 0 likes
  • 2 in conversation