Help using Base SAS procedures

Removing "direct jumping" customers in a time series?

Accepted Solution Solved
Reply
Super Contributor
Posts: 338
Accepted Solution

Removing "direct jumping" customers in a time series?

Hi Collegues;

I have this data set - note it has 4 accounts.

data a;

informat Current_date date9.;

Input Current_date   Account_number $11-13 Arrears_Band $ 15-21;

Format Current_date date9.;

cards;

28FEB2010  111  NPNA

31MAR2010  111  Current

30APR2010  111  30 - 60

28FEB2010  444  Current

31MAR2010  444  30 - 60

30APR2010  444  30 - 60

31MAY2010  444  Current

31OCT2010  444  Current

31DEC2010  444  Current

31JAN2011  444  1 - 30

28FEB2011  444  30 - 60

31MAR2011  444  60 - 90

30JUN2011  444  90 +

31JUL2011  444  NPNA

31AUG2011  444  NPNA

28FEB2010  555  30 - 60

31MAR2010  555  30 - 60

30APR2010  555  60 - 90

31MAY2010  555  NPNA

31AUG2011  555  90 +

30SEP2011  555  NPNA

31MAY2010  666  Current

31AUG2011  666  90 +

30SEP2011  666  Current

;

run;

Question:

I want to remove any Account_number that has fallen at least once from a "non 90 +” Arrears_Band to NPNA

Answer

Based on this condition, Account_number 555 has to be removed.

My appraoch

1. I have used the below code helped by PG Stat to track

how each account has successively fallen across different Arrears_Bands.

2. Then visually spotted the account_numbers that satisfy the

above condition. 

3. Finally, I did a "sub setting if" step, using those spotted account_numbers.

But I have a huge data set which takes time to do this.

Could any one suggest an efficient alternative for the last step?

I used this approach

proc sort data=a;

by account_number Current_date;

run;

data cat(keep=account_number category);

length Category $200;

do until(last.account_number);

do until (last.Arrears_Band);

  set a;

  by account_number Arrears_Band notsorted;

end;

category = catx(" to ", category, Arrears_Band);

end;

run;

data b;

set a;

if account_number not in ('555');

run;

/*Thanks

Mirisage


Accepted Solutions
Solution
‎07-06-2012 01:11 PM
Super Contributor
Posts: 338

Re: Removing "direct jumping" customers in a time series?

Hi Haikuo,

Your revised code is great!

I applied it into my original big data set and it is extremely efficicent.

In fact, I applied your code and PGStat's two codes into my big data set and then compared the final 3 results.

They are identical.

Thank you very much for this support.

Best regards

Mirisage

View solution in original post


All Replies
Respected Advisor
Posts: 3,156

Re: Removing "direct jumping" customers in a time series?

Hi, To match your output, one condition needs to be modified: instead of just "non 90 +”, it should be 'non 90 +' and 'non NPNA'.

2X DOWs seems to do the job:


data a;

informat Current_date date9.;

Input Current_date   Account_number :$4. Arrears_Band $&10.;

Format Current_date date9.;

cards;

28FEB2010  111  NPNA

31MAR2010  111  Current

30APR2010  111  30 - 60

28FEB2010  444  Current

31MAR2010  444  30 - 60

30APR2010  444  30 - 60

31MAY2010  444  Current

31OCT2010  444  Current

31DEC2010  444  Current

31JAN2011  444  1 - 30

28FEB2011  444  30 - 60

31MAR2011  444  60 - 90

30JUN2011  444  90 +

31JUL2011  444  NPNA

31AUG2011  444  NPNA

28FEB2010  555  30 - 60

31MAR2010  555  30 - 60

30APR2010  555  60 - 90

31MAY2010  555  NPNA

31AUG2011  555  90 +

30SEP2011  555  NPNA

31MAY2010  666  Current

31AUG2011  666  90 +

30SEP2011  666  Current

;

data want;

  do until (last.account_number);

    set a ;    by account_number notsorted;

    length _ab $10.;

    if not missing(_ab) and _ab not in ('90 +','NPNA') and  Arrears_Band='NPNA' then do;

      _flag=1;

      leave;

    end;

    _ab=Arrears_Band;

    end;

      do until (last.account_number);

    set a ;    by account_number notsorted;

    if _flag ne 1 then output;

    end;

drop _:;  

run;

proc print;run;

Haikuo


Super Contributor
Posts: 338

Re: Removing "direct jumping" customers in a time series?

Hi Hai,

Many Thanks.

Your code works well for the simplyfiled data set I have provided in the querry.

However, I have come across several other accounts in my original data set which is represented by account_number 777 in the below data set. It is not captured by the code. This account also has fallen directly from an arreres_band other than 90+.

So,  account_number 777 also has to be removed because

I want to remove any Account_number that has fallen at least once to NPNA from an Arrears_Band other than 90+.

Thank you very much. Mirisage

data a;

informat Current_date date9.;

Input Current_date   Account_number :$4. Arrears_Band $&10.;

Format Current_date date9.;

cards;

28FEB2010  111  NPNA

31MAR2010  111  Current

30APR2010  111  30 - 60

28FEB2010  444  Current

31MAR2010  444  30 - 60

30APR2010  444  30 - 60

31MAY2010  444  Current

31OCT2010  444  Current

31DEC2010  444  Current

31JAN2011  444  1 - 30

28FEB2011  444  30 - 60

31MAR2011  444  60 - 90

30JUN2011  444  90 +

31JUL2011  444  NPNA

31AUG2011  444  NPNA

28FEB2010  555  30 - 60

31MAR2010  555  30 - 60

30APR2010  555  60 - 90

31MAY2010  555  NPNA

31AUG2011  555  90 +

30SEP2011  555  NPNA

31MAY2010  666  Current

31AUG2011  666  90 +

30SEP2011  666  Current

28FEB2010  777  Current

31MAR2010  777  Current

30APR2010  777  Current

31DEC2010  777  Current

31JAN2011  777  NPNA

28FEB2011  777  NPNA

31MAR2011  777  NPNA

;

run;

Respected Advisor
Posts: 4,927

Re: Removing "direct jumping" customers in a time series?

Try this :

data a;
informat Current_date date9.;
Input Current_date   Account_number :$4. Arrears_Band $&10.;
Format Current_date date9.;
cards;
28FEB2010  111  NPNA
31MAR2010  111  Current
30APR2010  111  30 - 60
28FEB2010  444  Current
31MAR2010  444  30 - 60
30APR2010  444  30 - 60
31MAY2010  444  Current
31OCT2010  444  Current
31DEC2010  444  Current
31JAN2011  444  1 - 30
28FEB2011  444  30 - 60
31MAR2011  444  60 - 90
30JUN2011  444  90 +
31JUL2011  444  NPNA
31AUG2011  444  NPNA
28FEB2010  555  30 - 60
31MAR2010  555  30 - 60
30APR2010  555  60 - 90
31MAY2010  555  NPNA
31AUG2011  555  90 +
30SEP2011  555  NPNA
31MAY2010  666  Current
31AUG2011  666  90 +
30SEP2011  666  Current
28FEB2010  777  Current
31MAR2010  777  30 - 60
30APR2010  777  30 - 60
31MAY2010  777  Current
31OCT2010  777  Current
;

proc sort data=a;
by account_number Current_date;
run;
 
data nonJumpers(keep=account_number);
length lastBand $10;
do until(last.account_number);
     do until (last.Arrears_Band);
          set a;
          by account_number Arrears_Band notsorted;
          end;
     if Arrears_Band = "NPNA" and

          lastBand ne "90 +" and

          not missing(lastBand) then drop = 1;
     lastBand = Arrears_Band;
     end;
if drop = 1 then delete;
run;

 

proc sql;
create table want as
     select * from a
     where account_number in (select * from nonJumpers);
drop table nonJumpers;
select * from want;
quit;

PG

PG
Super Contributor
Posts: 338

Re: Removing "direct jumping" customers in a time series?

Hi PGStats,

Thank you very much.

Your code work exactly.

I have added another possible account succession like below into the previous data set  and applied this code. The code removes this account too which is correct because I want to remove any Account_number that has fallen at least once to NPNA from an Arrears_Band other than 90+.

data a;

informat Current_date date9.;

Input Current_date   Account_number :$4. Arrears_Band $&10.;

Format Current_date date9.;

cards;

28FEB2010 777 Current

31MAR2010 777 Current

30APR2010 777 Current

31DEC2010 777 Current

31JAN2011 777 NPNA

28FEB2011 777 NPNA

31MAR2011 777 NPNA

;

run;

When I applied the second part of your code to my original huge data set, it gives the message “will take a large amount of system resources” and got stuck.

So, I used the following approach which worked but never know my approach is efficient.

proc sort data=a;

by account_number;

run;

proc sort data=nonJumpers;

by account_number;

run;

data want;

merge a (in=a)

       nonJumpers  (in=b)

        ;

by account_number;

if a and b;

run;

Thank you again for your support.

Mirisage

Respected Advisor
Posts: 3,156

Re: Removing "direct jumping" customers in a time series?

Hi,

The code being provided did have a glitch, now try this one:

data a;

informat Current_date date9.;

Input Current_date   Account_number :$4. Arrears_Band $&10.;

Format Current_date date9.;

cards;

28FEB2010  111  NPNA

31MAR2010  111  Current

30APR2010  111  30 - 60

28FEB2010  444  Current

31MAR2010  444  30 - 60

30APR2010  444  30 - 60

31MAY2010  444  Current

31OCT2010  444  Current

31DEC2010  444  Current

31JAN2011  444  1 - 30

28FEB2011  444  30 - 60

31MAR2011  444  60 - 90

30JUN2011  444  90 +

31JUL2011  444  NPNA

31AUG2011  444  NPNA

28FEB2010  555  30 - 60

31MAR2010  555  30 - 60

30APR2010  555  60 - 90

31MAY2010  555  NPNA

31AUG2011  555  90 +

30SEP2011  555  NPNA

31MAY2010  666  Current

31AUG2011  666  90 +

30SEP2011  666  Current

28FEB2010  777  Current

31MAR2010  777  Current

30APR2010  777  Current

31DEC2010  777  Current

31JAN2011  777  NPNA

28FEB2011  777  NPNA

31MAR2011  777  NPNA

;

run;

data want;

  do until (last.account_number);

    set a ;    by account_number notsorted;

    length _ab $10.;

    if first.account_number then call missing(_ab);

    if not missing(_ab) and _ab not in ('90 +','NPNA') and  Arrears_Band='NPNA' then do;

      _flag=1;

    end;

    _ab=Arrears_Band;

    if _flage=1 then leave;

    end;

      do until (last.account_number);

    set a ;    by account_number notsorted;

    if _flag ne 1 then output;

    end;

drop _:; 

run;

proc print;run;

Please report back about the efficiency if you decide to use it on your real data.

Regards,

Haikuo

Solution
‎07-06-2012 01:11 PM
Super Contributor
Posts: 338

Re: Removing "direct jumping" customers in a time series?

Hi Haikuo,

Your revised code is great!

I applied it into my original big data set and it is extremely efficicent.

In fact, I applied your code and PGStat's two codes into my big data set and then compared the final 3 results.

They are identical.

Thank you very much for this support.

Best regards

Mirisage

Respected Advisor
Posts: 4,927

Re: Removing "direct jumping" customers in a time series?

Good idea. Note that dataset a is already sorted by account_number, and since dataset nonJumbers is created by scanning dataset a, it is also sorted.

But you might be able to get rid of the sorts altogether, as long as the dates are sorted in increasing order within each account_number, you could do:

data a;
informat Current_date date9.;
Input Current_date   Account_number :$4. Arrears_Band $&10.;
Format Current_date date9.;
cards;
28FEB2010  111  NPNA
31MAR2010  111  Current
30APR2010  111  30 - 60
28FEB2010  555  30 - 60
31MAR2010  555  30 - 60
30APR2010  555  60 - 90
31MAY2010  555  NPNA
31AUG2011  555  90 +
30SEP2011  555  NPNA
28FEB2010  444  Current
31MAR2010  444  30 - 60
30APR2010  444  30 - 60
31MAY2010  444  Current
31OCT2010  444  Current
31DEC2010  444  Current
31JAN2011  444  1 - 30
28FEB2011  444  30 - 60
31MAR2011  444  60 - 90
30JUN2011  444  90 +
31JUL2011  444  NPNA
31AUG2011  444  NPNA
31MAY2010  666  Current
31AUG2011  666  90 +
30SEP2011  666  Current
28FEB2010  777 Current
31MAR2010  777 Current
30APR2010  777 Current
31DEC2010  777 Current
31JAN2011  777 NPNA
28FEB2011  777 NPNA
31MAR2011  777 NPNA
;

data nonJumpers(keep=account_number);
length lastBand $10;
do until(last.account_number);
     do until (last.Arrears_Band);
          set a;
          by account_number Arrears_Band notsorted;
          end;
     if Arrears_Band = "NPNA" and
          lastBand ne "90 +" and
          not missing(lastBand) then drop = 1;
     lastBand = Arrears_Band;
     end;
if drop = 1 then delete;
run;

 

/* You can't merge by notsorted but you can do this */

data want;
set nonJumpers(rename=(account_number=act));
do until(account_number=act and last.account_number);
     set a;
     by account_number notsorted;
     if account_number=act then output;
     end;
run;

PG

PG
Super Contributor
Posts: 338

Re: Removing "direct jumping" customers in a time series?

Hi PGStats,

Your alternative code which avoids sorting step is a magic!

I applied it into my original big data set and it runs pretty fast.

In fact, I applied your two codes and Haikuo's revised code into my big data set and then compared the final 3 results.

They are identical.

Big thank for your help.

Best regards

Mirisage

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 264 views
  • 3 likes
  • 3 in conversation