Help using Base SAS procedures

Conditional processing of a longitudinal dataset

Accepted Solution Solved
Reply
Super Contributor
Posts: 338
Accepted Solution

Conditional processing of a longitudinal dataset

Hi Forum;

I have 9 distinct accounts in this dataset (this data set can be readily read in SAS).

data this;

informat current_date date9.;

input Bank_number    Account_number  $ 5-7 Current_date   Arrears_Band $19-25;

format current_date date9.;

cards;

10  111    30SEP2010 NPNA

10  111    31OCT2010 Current

10  111    30NOV2010 NPNA

10  111    30JUN2011 NPNA

10  111    01JAN2012 writoff

70  222    31DEC2011 writoff

70  222    31JAN2012 NPNA

40  333    30NOV2010 Current

40  333    31DEC2010 NPNA

50  333    31JAN2011 NPNA

50  333    28FEB2011 writoff

20  333    31MAR2011 NPNA

20  333    30APR2011 NPNA

20  333    31MAY2011 NPNA

20  333    30JUN2011 NPNA

100 111    30APR2011 90 +

100 111    31MAY2011 90 +

100 111    31JUL2011 NPNA

100 111    31AUG2011 NPNA

100 111    04JAN2012 NPNA

40  555    08FEB2010 30 - 60

40  555    31MAR2010 NPNA

40  555    30APR2010 60 - 90

40  555    31MAY2010 NPNA

40  666    03JAN2012 NPNA

10  777    04FEB2010 1 - 30

10  333    03MAR2010 Current

;

run;

I have a Business Rule:

  • The only arrears_band that you can get after NPNA is either "NPNA" or “writoff”.
  • Question: I need to separate all subjects (with its complete list of records) that violate this rule.

Note:

1). If a subject has been exclusively in NPNA arrears_band throughout its life,

then that subject does not violate our condition.

2). if a subject has only a single record and its Arrears_Band = “NPNA”,

then that subject does not violate our condition.

Answer:

10  111    30SEP2010 NPNA

10  111    31OCT2010 Current /*after NPNA you will not get either "NPNA" or “writoff” which violates our rule */

10  111    30NOV2010 NPNA

10  111    30JUN2011 NPNA

10  111    01JAN2012 writoff

40  555    08FEB2010 30 - 60

40  555    31MAR2010 NPNA

40  555    30APR2010 60 - 90 /*after NPNA you will not get either "NPNA" or “writoff” which violates our rule */

40  555    31MAY2010 NPNA

I have just tweaked a code developed by mkeintz for a somewhat similar situation but it doesn’t work for this situation because I cannot correctly incorporate the logic.

/*Sorting by three variables that uniquely identify a subject*/

proc sort data=this out=have;

  by bank_number account_number current_date;

run;

data want;

  set this;

  by bank_number account_number notsorted;

  retain NPNA_encountered 0;

  if first.account_number then NPNA_encountered=0;

  if last.account_number and NPNA_encountered=1 then output;

  if arrears_bank='NPNA' then NPNA_encountered=1;

run;

I really appreciate your help to develop my code.

Thanks

Miris


Accepted Solutions
Solution
‎10-09-2012 09:31 AM
Respected Advisor
Posts: 3,156

Re: Conditional processing of a longitudinal dataset

Hi Mirisage,

My bad. I switched around some statements and failed to put them back to order, so I fell into this lag() caveat, here it goes:

data want;

  do until (last.account_number);

     set have;

       by bank_number account_number current_date;

          flag=ifn(Arrears_Band not in ('NPNA','writoff') and lag(Arrears_Band)='NPNA',1,flag);

              if first.account_number then flag=0;

  end;

  do until (last.account_number);

     set have;

       by bank_number account_number current_date;

       if flag then output;

  end;

  run;

Haikuo

View solution in original post


All Replies
Respected Advisor
Posts: 3,156

Re: Conditional processing of a longitudinal dataset

Hi,

If you are reading your data from raw and planning to use your input code as is, it won't work. Your main problem can be resolved by 2XDOW:

data this;

input Bank_number:$    Account_number:$ Current_date :date9.   Arrears_Band :&$;

format current_date date9.;

cards;

10  111    30SEP2010 NPNA

10  111    31OCT2010 Current

10  111    30NOV2010 NPNA

10  111    30JUN2011 NPNA

10  111    01JAN2012 writoff

70  222    31DEC2011 writoff

70  222    31JAN2012 NPNA

40  333    30NOV2010 Current

40  333    31DEC2010 NPNA

50  333    31JAN2011 NPNA

50  333    28FEB2011 writoff

20  333    31MAR2011 NPNA

20  333    30APR2011 NPNA

20  333    31MAY2011 NPNA

20  333    30JUN2011 NPNA

100 111    30APR2011 90 +

100 111    31MAY2011 90 +

100 111    31JUL2011 NPNA

100 111    31AUG2011 NPNA

100 111    04JAN2012 NPNA

40  555    08FEB2010 30 - 60

40  555    31MAR2010 NPNA

40  555    30APR2010 60 - 90

40  555    31MAY2010 NPNA

40  666    03JAN2012 NPNA

10  777    04FEB2010 1 - 30

10  333    03MAR2010 Current

;

proc sort data=this out=have;

  by bank_number account_number current_date;

run;

/*2XDOW*/

data want;

  do until (last.account_number);

     set have;

       by bank_number account_number current_date;

        if first.account_number then flag=0;

        else flag=ifn(Arrears_Band not in ('NPNA','writoff') and lag(Arrears_Band)='NPNA',1,flag);

  end;

  do until (last.account_number);

     set have;

       by bank_number account_number current_date;

       if flag then output;

  end;

  run;

proc print;run;

Haikuo

Super Contributor
Posts: 338

Re: Conditional processing of a longitudinal dataset

Hi Haikuo,

Thank you very much.

This code correctly identifies one account which violates the business rule. This is the account.

Obs     Bank_number     Account_number      Current_date      Arrears_Band     flag (these are variable names)

1

40

555

08FEB2010

30 - 60

1

2

40

555

31MAR2010

NPNA

1

3

40

555

30APR2010

60 - 90

1

4

40

555

31MAY2010

NPNA

1

However, there is another account which violates our business rule. Below shown is that account which is not identified by the code.

10  111    30SEP2010 NPNA

10  111    31OCT2010 Current /*after NPNA you will not get either "NPNA" or “writoff” which violates our rule */

10  111    30NOV2010 NPNA

10  111    30JUN2011 NPNA

10  111    01JAN2012 writoff

I wonder if you have sometime to have a look.

Thank you for your time and expertise.

Mirisage

Solution
‎10-09-2012 09:31 AM
Respected Advisor
Posts: 3,156

Re: Conditional processing of a longitudinal dataset

Hi Mirisage,

My bad. I switched around some statements and failed to put them back to order, so I fell into this lag() caveat, here it goes:

data want;

  do until (last.account_number);

     set have;

       by bank_number account_number current_date;

          flag=ifn(Arrears_Band not in ('NPNA','writoff') and lag(Arrears_Band)='NPNA',1,flag);

              if first.account_number then flag=0;

  end;

  do until (last.account_number);

     set have;

       by bank_number account_number current_date;

       if flag then output;

  end;

  run;

Haikuo

Super Contributor
Posts: 338

Re: Conditional processing of a longitudinal dataset

Hi Haikuo,

Thank you very much for this code which worked very well.

I am trying to understand what this code is doing and could you please see if I have understood correctly, time permitting.

This is what “ifn” function does based on literature I have reviewed.

/*In the following code, we assign a value of 1 or 0 to flag variable*/

data test;

set test;

if sex= ‘Male’ then      flag =1;

                         else  flag=0;

run;

/*Some programmers use the following code to do the same task above*/

data test;

  set test;

flag=ifn(sex=’Male’,1,0);

run;

This is your code

data want;

  do until (last.account_number);

     set have;

       by bank_number account_number current_date;

          flag=ifn(Arrears_Band not in ('NPNA','writoff') and lag(Arrears_Band)='NPNA',1,flag);

              if first.account_number then flag=0;

  end;

  do until (last.account_number);

     set have;

       by bank_number account_number current_date;

       if flag then output;

  end;

  run;

• Do until starts to read all the records pertinent to first account found in our data set called “this”.

So, these are the records.

Bank_number    Account_number  Current_date   Arrears_Band

10  111    30SEP2010 NPNA

10  111    31OCT2010 Current

10  111    30NOV2010 NPNA

10  111    30JUN2011 NPNA

10  111    01JAN2012 writoff

when do until executes below statement, I think this is what happening.

flag=ifn(Arrears_Band not in ('NPNA','writoff') and lag(Arrears_Band)='NPNA',1,flag);

Above statement says:

If Arrears_Band not in ('NPNA','writoff') and lag(Arrears_Band)='NPNA’

then a new variable called “flag” is created and its value is returned as 1.

If this condition is not satisfied, then then the value for the variable “flag” is returned as "flag".

(I am not too sure).

Then comes to the next statement:

              if first.account_number then flag=0;

Above statement returns zero for the first account_number.

Now this should be the answer after processing the account_number that do until first hits.

Bank_number   Account_number Current_date Arrears_Band lag(Arrears_Band ) Flag
10 111   30SEP2010NPNA0
10 111 31OCT2010CurrentNPNA1
10 111   30NOV2010NPNACurrentflag
10 111   30JUN2011NPNANPNAflag
10 111   01JAN2012writoffNPNAflag

Then in the next do until pass, all the records whose flag = 1 are outputted.

But then only the last 3 records of the above table should be outputted which is not our intended answer.

But your code generated the intended answer.

How come?

Would appreciate if you could shed some light.

Thank you for your time.

Mirisage

Respected Advisor
Posts: 3,156

Re: Conditional processing of a longitudinal dataset

Hi Mirisage,

Your understanding of my code is largely correct. The key point of my code is A) Unconditionally use lag().  B) When first record of each group, reset 'flag' value=0, while because of A), B) needs to be put after A), otherwise B) will be overridden by A), the order is important here, this is why my first code was not really working.

Now I see what puzzles you is "flag=ifn(Arrears_Band not in ('NPNA','writoff') and lag(Arrears_Band)='NPNA',1,flag)", when condition is not satisfied, flag=flag, meaning flag stays the same, holding the whatever the value it has. So this is an One-way switch, whenever the condition is met, flag will be set to '1', and stays at '1', until the first record of next group. So we will see the following instead:

Bank_number   Account_number Current_date Arrears_Band lag(Arrears_Band ) Flag
10 111   30SEP2010NPNA0
10 111 31OCT2010CurrentNPNA1
10 111   30NOV2010NPNACurrent1
10 111   30JUN2011NPNANPNA1
10 111   01JAN2012writoffNPNA1

By the end of loop (for the use this do-loop, please google "DOW SAS"), the last record of this group will be left in PDV:

10 111   01JAN2012writoffNPNA1

And all we need is flag=1, this flag of '1' will be carried over the second do-loop and thus output every single one of obs in that group. Again, to understand this, some knowledge on DOW will help.

HTH,

Haikuo

Super Contributor
Posts: 338

Re: Conditional processing of a longitudinal dataset

Hi Haikuo,

Thank you very much for this great descrition.

I really appreciate it.

Regards

Mirisage

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 306 views
  • 0 likes
  • 2 in conversation