Solved: Re: Removing "direct jumping" customers in a time series?

Mirisage · Posted 07-03-2012 05:15 PM

Hi Collegues;

I have this data set - note it has 4 accounts.

data a;

informat Current_date date9.;

Input Current_date Account_number $11-13 Arrears_Band $ 15-21;

Format Current_date date9.;

cards;

28FEB2010 111 NPNA

31MAR2010 111 Current

30APR2010 111 30 - 60

28FEB2010 444 Current

31MAR2010 444 30 - 60

30APR2010 444 30 - 60

31MAY2010 444 Current

31OCT2010 444 Current

31DEC2010 444 Current

31JAN2011 444 1 - 30

28FEB2011 444 30 - 60

31MAR2011 444 60 - 90

30JUN2011 444 90 +

31JUL2011 444 NPNA

31AUG2011 444 NPNA

28FEB2010 555 30 - 60

31MAR2010 555 30 - 60

30APR2010 555 60 - 90

31MAY2010 555 NPNA

31AUG2011 555 90 +

30SEP2011 555 NPNA

31MAY2010 666 Current

31AUG2011 666 90 +

30SEP2011 666 Current

;

run;

Question:

I want to remove any Account_number that has fallen at least once from a "non 90 +” Arrears_Band to NPNA

Answer

Based on this condition, Account_number 555 has to be removed.

My appraoch

1. I have used the below code helped by PG Stat to track

how each account has successively fallen across different Arrears_Bands.

2. Then visually spotted the account_numbers that satisfy the

above condition.

3. Finally, I did a "sub setting if" step, using those spotted account_numbers.

But I have a huge data set which takes time to do this.

Could any one suggest an efficient alternative for the last step?

I used this approach.

proc sort data=a;

by account_number Current_date;

run;

data cat(keep=account_number category);

length Category $200;

do until(last.account_number);

do until (last.Arrears_Band);

set a;

by account_number Arrears_Band notsorted;

end;

category = catx(" to ", category, Arrears_Band);

end;

run;

data b;

set a;

if account_number not in ('555');

run;

/*Thanks

Mirisage

Mirisage · Posted 07-06-2012 01:11 PM

Hi Haikuo,

Your revised code is great!

I applied it into my original big data set and it is extremely efficicent.

In fact, I applied your code and PGStat's two codes into my big data set and then compared the final 3 results.

They are identical.

Thank you very much for this support.

Best regards

Mirisage

View solution in original post

Haikuo · Posted 07-03-2012 06:02 PM

Hi, To match your output, one condition needs to be modified: instead of just "non 90 +”, it should be 'non 90 +' and 'non NPNA'.

2X DOWs seems to do the job:

data a;

informat Current_date date9.;

Input Current_date Account_number :$4. Arrears_Band $&10.;

Format Current_date date9.;

cards;

28FEB2010 111 NPNA

31MAR2010 111 Current

30APR2010 111 30 - 60

28FEB2010 444 Current

31MAR2010 444 30 - 60

30APR2010 444 30 - 60

31MAY2010 444 Current

31OCT2010 444 Current

31DEC2010 444 Current

31JAN2011 444 1 - 30

28FEB2011 444 30 - 60

31MAR2011 444 60 - 90

30JUN2011 444 90 +

31JUL2011 444 NPNA

31AUG2011 444 NPNA

28FEB2010 555 30 - 60

31MAR2010 555 30 - 60

30APR2010 555 60 - 90

31MAY2010 555 NPNA

31AUG2011 555 90 +

30SEP2011 555 NPNA

31MAY2010 666 Current

31AUG2011 666 90 +

30SEP2011 666 Current

;

data want;

do until (last.account_number);

set a ; by account_number notsorted;

length _ab $10.;

if not missing(_ab) and _ab not in ('90 +','NPNA') and Arrears_Band='NPNA' then do;

_flag=1;

leave;

end;

_ab=Arrears_Band;

end;

do until (last.account_number);

set a ; by account_number notsorted;

if _flag ne 1 then output;

end;

drop _:;

run;

proc print;run;

Haikuo

Mirisage · Posted 07-05-2012 04:56 PM

Hi Hai,

Many Thanks.

Your code works well for the simplyfiled data set I have provided in the querry.

However, I have come across several other accounts in my original data set which is represented by account_number 777 in the below data set. It is not captured by the code. This account also has fallen directly from an arreres_band other than 90+.

So, account_number 777 also has to be removed because

I want to remove any Account_number that has fallen at least once to NPNA from an Arrears_Band other than 90+.

Thank you very much. Mirisage

data a;

informat Current_date date9.;

Input Current_date Account_number :$4. Arrears_Band $&10.;

Format Current_date date9.;

cards;

28FEB2010 111 NPNA

31MAR2010 111 Current

30APR2010 111 30 - 60

28FEB2010 444 Current

31MAR2010 444 30 - 60

30APR2010 444 30 - 60

31MAY2010 444 Current

31OCT2010 444 Current

31DEC2010 444 Current

31JAN2011 444 1 - 30

28FEB2011 444 30 - 60

31MAR2011 444 60 - 90

30JUN2011 444 90 +

31JUL2011 444 NPNA

31AUG2011 444 NPNA

28FEB2010 555 30 - 60

31MAR2010 555 30 - 60

30APR2010 555 60 - 90

31MAY2010 555 NPNA

31AUG2011 555 90 +

30SEP2011 555 NPNA

31MAY2010 666 Current

31AUG2011 666 90 +

30SEP2011 666 Current

28FEB2010 777 Current

31MAR2010 777 Current

30APR2010 777 Current

31DEC2010 777 Current

31JAN2011 777 NPNA

28FEB2011 777 NPNA

31MAR2011 777 NPNA

;

run;

PGStats · Posted 07-03-2012 08:45 PM

Try this :

data a;
informat Current_date date9.;
Input Current_date   Account_number :$4. Arrears_Band $&10.;
Format Current_date date9.;
cards;
28FEB2010 111 NPNA
31MAR2010 111 Current
30APR2010 111 30 - 60
28FEB2010 444 Current
31MAR2010 444 30 - 60
30APR2010 444 30 - 60
31MAY2010 444 Current
31OCT2010 444 Current
31DEC2010 444 Current
31JAN2011 444 1 - 30
28FEB2011 444 30 - 60
31MAR2011 444 60 - 90
30JUN2011 444 90 +
31JUL2011 444 NPNA
31AUG2011 444 NPNA
28FEB2010 555 30 - 60
31MAR2010 555 30 - 60
30APR2010 555 60 - 90
31MAY2010 555 NPNA
31AUG2011 555 90 +
30SEP2011 555 NPNA
31MAY2010 666 Current
31AUG2011 666 90 +
30SEP2011 666 Current
28FEB2010 777 Current
31MAR2010 777 30 - 60
30APR2010 777 30 - 60
31MAY2010 777 Current
31OCT2010 777 Current
;

proc sort data=a;
by account_number Current_date;
run;

data nonJumpers(keep=account_number);
length lastBand $10;
do until(last.account_number);
     do until (last.Arrears_Band);
          set a;
          by account_number Arrears_Band notsorted;
          end;
     if Arrears_Band = "NPNA" and

lastBand ne "90 +" and

          not missing(lastBand) then drop = 1;
     lastBand = Arrears_Band;
     end;
if drop = 1 then delete;
run;

proc sql;
create table want as
select * from a
where account_number in (select * from nonJumpers);
drop table nonJumpers;
select * from want;
quit;

PG

Mirisage · Posted 07-05-2012 05:09 PM

Hi PGStats,

Thank you very much.

Your code work exactly.

I have added another possible account succession like below into the previous data set and applied this code. The code removes this account too which is correct because I want to remove any Account_number that has fallen at least once to NPNA from an Arrears_Band other than 90+.

data a;

informat Current_date date9.;

Input Current_date Account_number :$4. Arrears_Band $&10.;

Format Current_date date9.;

cards;

28FEB2010 777 Current

31MAR2010 777 Current

30APR2010 777 Current

31DEC2010 777 Current

31JAN2011 777 NPNA

28FEB2011 777 NPNA

31MAR2011 777 NPNA

;

run;

When I applied the second part of your code to my original huge data set, it gives the message “will take a large amount of system resources” and got stuck.

So, I used the following approach which worked but never know my approach is efficient.

proc sort data=a;

by account_number;

run;

proc sort data=nonJumpers;

by account_number;

run;

data want;

merge a (in=a)

nonJumpers (in=b)

;

by account_number;

if a and b;

run;

Thank you again for your support.

Mirisage

Haikuo · Posted 07-05-2012 05:52 PM

Hi,

The code being provided did have a glitch, now try this one:

data a;

informat Current_date date9.;

Input Current_date Account_number :$4. Arrears_Band $&10.;

Format Current_date date9.;

cards;

28FEB2010 111 NPNA

31MAR2010 111 Current

30APR2010 111 30 - 60

28FEB2010 444 Current

31MAR2010 444 30 - 60

30APR2010 444 30 - 60

31MAY2010 444 Current

31OCT2010 444 Current

31DEC2010 444 Current

31JAN2011 444 1 - 30

28FEB2011 444 30 - 60

31MAR2011 444 60 - 90

30JUN2011 444 90 +

31JUL2011 444 NPNA

31AUG2011 444 NPNA

28FEB2010 555 30 - 60

31MAR2010 555 30 - 60

30APR2010 555 60 - 90

31MAY2010 555 NPNA

31AUG2011 555 90 +

30SEP2011 555 NPNA

31MAY2010 666 Current

31AUG2011 666 90 +

30SEP2011 666 Current

28FEB2010 777 Current

31MAR2010 777 Current

30APR2010 777 Current

31DEC2010 777 Current

31JAN2011 777 NPNA

28FEB2011 777 NPNA

31MAR2011 777 NPNA

;

run;

data want;

do until (last.account_number);

set a ; by account_number notsorted;

length _ab $10.;

if first.account_number then call missing(_ab);

if not missing(_ab) and _ab not in ('90 +','NPNA') and Arrears_Band='NPNA' then do;

_flag=1;

end;

_ab=Arrears_Band;

if _flage=1 then leave;

end;

do until (last.account_number);

set a ; by account_number notsorted;

if _flag ne 1 then output;

end;

drop _:;

run;

proc print;run;

Please report back about the efficiency if you decide to use it on your real data.

Regards,

Haikuo

Mirisage · Posted 07-06-2012 01:11 PM

Hi Haikuo,

Your revised code is great!

I applied it into my original big data set and it is extremely efficicent.

In fact, I applied your code and PGStat's two codes into my big data set and then compared the final 3 results.

They are identical.

Thank you very much for this support.

Best regards

Mirisage

PGStats · Posted 07-05-2012 05:53 PM

Good idea. Note that dataset a is already sorted by account_number, and since dataset nonJumbers is created by scanning dataset a, it is also sorted.

But you might be able to get rid of the sorts altogether, as long as the dates are sorted in increasing order within each account_number, you could do:

data a;
informat Current_date date9.;
Input Current_date Account_number :$4. Arrears_Band $&10.;
Format Current_date date9.;
cards;
28FEB2010 111 NPNA
31MAR2010 111 Current
30APR2010 111 30 - 60
28FEB2010 555 30 - 60
31MAR2010 555 30 - 60
30APR2010 555 60 - 90
31MAY2010 555 NPNA
31AUG2011 555 90 +
30SEP2011 555 NPNA
28FEB2010 444 Current
31MAR2010 444 30 - 60
30APR2010 444 30 - 60
31MAY2010 444 Current
31OCT2010 444 Current
31DEC2010 444 Current
31JAN2011 444 1 - 30
28FEB2011 444 30 - 60
31MAR2011 444 60 - 90
30JUN2011 444 90 +
31JUL2011 444 NPNA
31AUG2011 444 NPNA
31MAY2010 666 Current
31AUG2011 666 90 +
30SEP2011 666 Current
28FEB2010 777 Current
31MAR2010 777 Current
30APR2010 777 Current
31DEC2010 777 Current
31JAN2011 777 NPNA
28FEB2011 777 NPNA
31MAR2011 777 NPNA
;

data nonJumpers(keep=account_number);
length lastBand $10;
do until(last.account_number);
     do until (last.Arrears_Band);
          set a;
          by account_number Arrears_Band notsorted;
          end;
     if Arrears_Band = "NPNA" and
          lastBand ne "90 +" and
          not missing(lastBand) then drop = 1;
     lastBand = Arrears_Band;
     end;
if drop = 1 then delete;
run;

/* You can't merge by notsorted but you can do this */

data want;
set nonJumpers(rename=(account_number=act));
do until(account_number=act and last.account_number);
     set a;
     by account_number notsorted;
     if account_number=act then output;
     end;
run;

PG

Mirisage · Posted 07-06-2012 01:15 PM

Hi PGStats,

Your alternative code which avoids sorting step is a magic!

I applied it into my original big data set and it runs pretty fast.

In fact, I applied your two codes and Haikuo's revised code into my big data set and then compared the final 3 results.

They are identical.

Big thank for your help.

Best regards

Mirisage

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away