Solved: Why duplicate records vary depending on “first.variable”?

Mirisage · Posted 08-17-2012 03:42 PM

Hi Community,

I have the attached data set.

It is clear that Bank_number, Account_number and Current_date are the combination of variables that would uniquely identify a record.

I needed to make sure it objectively.

What I have done is below.

proc sort data=a.post_this2 out=t;

by account_number current_date bank_number;

run;

data dups nodups ;

set t ;

by account_number current_date bank_number;

if first.bank_number and last.bank_number then output nodups;

else output dups ;

run;

Result

It says there are nodups which is correct.

Q: I then changed the yellow place below from bank_number to current_date.

Then it says there are duplicates. Also, when changed it to account_number too, it says there are duplicate numbers.

Could you please let me know what is happening?

data dups nodups ;

set t ;

by account_number current_date bank_number;

if first.current_date and last.current_date then output nodups;

else output dups ;

run;

Thanks

Mirisage

mkeintz · Posted 08-17-2012 04:07 PM

Consider this example (a for account number, c for current date, b for bank).

Notice that any time you have multipe banks a current_date, then you will always have duplicate dates, even though you never have duplicate banks within a given account/date. Hence in your second program only the underscored line below would escape the DUP datasets.

The general rule is this: Any time a particular BY variable changes its first. and last. dummies will change as will all first. and last. dummies to the right of it in the by list - i.e. it's hierarchical sorting.

a c b first.a last.a first.c last.c first.b last.b

1 01jan2012 1001 1 0 1 0 1 1

1 01jan2012 1002 0 0 0 0 1 1

1 01jan2012 1003 0 0 0 0 1 1

1 01jan2012 1004 0 0 0 1 1 1

1 02jan2012 1004 0 0 1 1 1 1

1 03jan2012 1001 0 0 1 0 1 1

1 03jan2012 1001 0 0 0 0 1 1

1 03jan2012 1001 0 1 0 1 1 1

2 ....

account_number current_date bank_number f

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 08-17-2012 04:07 PM

Consider this example (a for account number, c for current date, b for bank).

Notice that any time you have multipe banks a current_date, then you will always have duplicate dates, even though you never have duplicate banks within a given account/date. Hence in your second program only the underscored line below would escape the DUP datasets.

The general rule is this: Any time a particular BY variable changes its first. and last. dummies will change as will all first. and last. dummies to the right of it in the by list - i.e. it's hierarchical sorting.

a c b first.a last.a first.c last.c first.b last.b

1 01jan2012 1001 1 0 1 0 1 1

1 01jan2012 1002 0 0 0 0 1 1

1 01jan2012 1003 0 0 0 0 1 1

1 01jan2012 1004 0 0 0 1 1 1

1 02jan2012 1004 0 0 1 1 1 1

1 03jan2012 1001 0 0 1 0 1 1

1 03jan2012 1001 0 0 0 0 1 1

1 03jan2012 1001 0 1 0 1 1 1

2 ....

account_number current_date bank_number f

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Mirisage · Posted 08-18-2012 04:25 PM

Hi Mkeintz,

Thank you very much for taking time to illustrate the concept in a nice example.

This is great!

Best regards

Mirisage.

Why duplicate records vary depending on “first.variable”?

Re: Why duplicate records vary depending on “first.variable”?

Re: Why duplicate records vary depending on “first.variable”?

Re: Why duplicate records vary depending on “first.variable”?

Catch up on SAS Innovate 2026

Why duplicate records vary depending on “first.variable”?

Re: Why duplicate records vary depending on “first.variable”?

Re: Why duplicate records vary depending on “first.variable”?

Re: Why duplicate records vary depending on “first.variable”?

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away