Help using Base SAS procedures

Why this hash code works for a single bank but not for 2 banks ?

Accepted Solution Solved
Reply
Super Contributor
Posts: 338
Accepted Solution

Why this hash code works for a single bank but not for 2 banks ?

Hi SAS Discussion Forum,

Attached longitudinal data set_1 shows how 5 accounts in a single bank (bank_number=10) is

swinging from one delinquency bucket to the other  (original data set has over million

records and 9 banks).

Alpey has generously created the hash code below which cleverly returns the sum of variable "balance" for each arrears_band

grouped by current_date subject to a few conditions (conditions are presented at the end to avoid the reader getting sick).

Hash code (by Alpey, 2012)

data temp;

    if _n_ = 1 then do;

        declare hash h(ordered:'a');

        h.defineKey('Current_Date');

        h.defineData('Current_Date','Bank_Number','Account_Number','Balance','Product','Arrears_Band','NPNA_Start_Month');

        h.defineDone();

        declare hiter hi('h');

    end;

    NPNA_Start_Month = .;

    do until(last.Account_Number);

        set b; /*inputting our data set*/

        where  Product IN ('Personal OD','Personal Loan','Res. Mortgage') and

        Bank_number = 10; /*when have more than one bank, we can remove this condition */

        by Account_Number;

        if Arrears_Band = 'NPNA' and NPNA_Start_Month = . then NPNA_Start_Month = Current_Date;

        else if Arrears_Band ne 'NPNA' then NPNA_Start_Month = .;

        h.replace();

    end;

    Last_NPNA_Start_Month = NPNA_Start_Month; /* Added 6/18/2012 */

    rc = hi.first();

    do while(rc=0);

        Current = 0;

        One_to_30 = 0;

        Thirty_to_60 = 0;

        Sixty_to_90 = 0;

        Ninety_plus = 0;

        NPNA = .;

        if NPNA_Start_Month = . or (

                NPNA_Start_Month = Current_Date and /* NPNA_Start_Month > '28Feb2010'd */

                Last_NPNA_Start_Month ne '28Feb2010'd) then do;

            select (Arrears_Band);

                when ('Current') do;Current = Balance;end;

                when ('1 - 30') do;One_to_30 = Balance;end;

                when ('30 - 60') do;Thirty_to_60 = Balance;end;

                when ('60 - 90') do;Sixty_to_90 = Balance;end;

                when ('90 +') do;Ninety_plus = Balance;end;

                when ('NPNA') do;NPNA = Balance;end;

            end;

        end;

        output;

        rc = hi.next();

    end;

    h.clear();

run;

proc summary data=temp nway missing;

    class Current_Date;

    var Current One_to_30 Thirty_to_60 Sixty_to_90 Ninety_plus NPNA;

    output out=want(drop=_Smiley Happy

    sum=;

run;

Applying above code

Step 1:

I sorted the set_1 data set:

proc sort data=set_1 out = b;

  by account_number  current_date;

run;

Step 2:

Then ran the hash code.

Step 3:

It generated the following results which is very correct given my conditions.

These are the results:

Obs  Current_date    Current    1_to_30     30_to_60      60_to_90     90+     NPNA (these are vairable headings)

1

28FEB2010

24

0

0

0

0

25

2

31MAR2010

25

23

0

0

0

.

3

30APR2010

25

0

23

0

0

.

4

31MAY2010

25

0

0

23

0

.

5

30JUN2010

0

0

0

0

23

.

6

31JUL2010

0

0

0

0

0

85

7

31DEC2010

0

0

0

0

0

.

8

31JAN2011

25

18

0

0

0

.

9

28FEB2011

0

0

0

0

0

43

10

31MAR2011

0

0

0

0

0

.

11

30APR2011

0

0

0

0

0

.

12

31MAY2011

85

0

0

0

0

.

New problem:

I added another bank (bank_number=20) to the set_1 and populated the same data values found in bank=10

in set_1 for the new bank too, and created a new data set (see attached data set_2).

This time I should do the sorting taking "bank_number account_number  current_date" as sorting variables. Why? Because these are the 3 variables that would uniquely identify a record (this is sure!). 

proc sort data=set_2 out = c;

  by bank_number account_number  current_date;

run;

Question:

When I now try to apply the hash code for data set "c", log gives this error message.

"ERROR: BY variables are not properly sorted on data set WORK.C."

Could an expert please help me to tweak this hash code to avoid this problem?

(may be we need to incorporate my new sorting order into hash code which

is "bank_number account_number  current_date"

One more thing, if I just sort the data set_2 by "account_number  current_date" and apply the hash code, then the replicated bank_number 20's data are ignored from summing up.

(Optional reading)

These are the conditions Hash code cleverly captures

Condition 1:

"Balances corresponding to NPNA Arrears_Band have to be summed up for only those accounts that have fallen into NPNA status after

28FEB2010. NPNA balances in 28FEB2010 and in all subsequent months of those accounts that were already in NPNA status

in 28FEB2010 should be omitted from summing up".

  1. E.g. In bank number 10, Account 1111111111 's

NPNA balance should be excluded from summing up

in 28FEB2010 and in all subsequent months because it was already in NPNA status when we begin our modeling period in 28FEB2010 and this account continued to be NPNA throughout its life.

Condition 2:

However, if an account has started with "NPNA" status from 28FEB2010 but it has not continued its NPNA status throughout its life, then these type of accounts should be excluded from the

condition 1 above although

these accounts also have started with "NPNA" status from 28FEB2010 itself.

E.g. A/C 4444444444 has started its life with "arreres_band=NPNA in 28FEB2010 but then improved in 31Jan2011 into “1-30” arrears  bucket.  Again fallen into NPNA delq. bucket in

28FEB2011. Then again improved to "current" in 31May2011.

So, this accounts NPNA balance in 28Feb2010 has to be taken for summing up.

And its NPNA balance in 28Feb2011 has to be taken for summing up.

Your help is greatly apprecaited!

Mirisage

Attachment
Attachment

Accepted Solutions
Solution
‎08-23-2012 11:58 AM
Occasional Contributor
Posts: 9

Re: Why this hash code works for a single bank but not for 2 banks ?

Mirisage:

The code you present has a DO UNTIL (LAST.ACCOUNT_NUMBER) loop in which you accrete the last occuring record for each CURRENT_DATE and then output the balance from each date.

When you add data for a second bank, there is no guarantee the account numbers will be sequential across ALL banks, thus you can receive the "not properly sorted" log message.  Try BY BANK ACCOUNT_NUMBER inside the first loop, or BY ACCOUNT_NUMBER NOTSORTED if there is a guarantee that account numbers are distinct across all banks.

With only a million records, the use of hash may be hiding more of the processing logic than desired.  Additionally the data step is transposing the band data into metadata as a column name.  Many coder prefer highly categorical data and present the transpositions via the reporting procs REPORT or TABULATE, or as a data processing step if use proc TRANSPOSE.

Conversely, if you go with hash, you might want to do the summary in a hash as well.

Why is NPNA missing for 31MAR2011 and 30APR2011.

According to attached SET_1, both accounts 4444 and 5555 fell into NPNA on 28FEB2011

Because the business logic is the most important feature, my recommendation is to rewrite to code without the hash, for example:


* libname MIRISAGE 'C:\Users\Richard\Downloads';

ods chtml file="%sysfunc(pathname(WORK))\sample.html" style=journal;

* format $band. will be used in tabulate and will control the order in which the categorical columns are layed out;

proc format ;
  value $band (notsorted)
    'Current' = 'Current'
    '1 - 30'  = '1 - 30'
    '30 - 60' = '30 - 60'
    '60 - 90' = '60 - 90'
    '90 +'    = '90 +'
    'NPNA'    = 'NPNA'
  ;
run;

* view performs business logic of transactional sequence evaluation for computing which NPNA balances are summed;

data evaluated_transactions / view=evaluated_transactions;
  length prior_band $20;
  prior_band = '';
  became_npna = .;
  do _n_ = 1 by 1 until (last.account_number);
    set mirisage.set_1;
    by bank_number account_number;
    if current_date > '28feb2010'd and prior_band ne arrears_band then do;
      if prior_band ne '' and arrears_band = 'NPNA' then became_npna = current_date;
      prior_band = arrears_band;
    end;
  end;

  do _n_ = 1 to _n_;
    set mirisage.set_1;
    if arrears_band = 'NPNA' and became_npna <= '28feb2010'd then balance = .;
    output;
  end;

  drop prior_band;
  format became_npna date9.;
run;

* use tabulate to report the data;

proc tabulate data=evaluated_transactions;
  where current_date >= '28feb2010'd;
  class current_date;
  class arrears_band / PRELOADFMT order=data ;
  var balance;

    * tweak this, balance*N shows how many balances were set to missing by business logic;
  table current_date, arrears_band*(N balance*SUM balance*NMISS);

  format arrears_band $band.;
run;

ods _all_ close;

Save the hashing for unusual lookup conditions, high-speed master detail lookups and when necessary 'extreme optimization' requirement situations.

Happy Coding,

Richard A. DeVenezia

View solution in original post


All Replies
Solution
‎08-23-2012 11:58 AM
Occasional Contributor
Posts: 9

Re: Why this hash code works for a single bank but not for 2 banks ?

Mirisage:

The code you present has a DO UNTIL (LAST.ACCOUNT_NUMBER) loop in which you accrete the last occuring record for each CURRENT_DATE and then output the balance from each date.

When you add data for a second bank, there is no guarantee the account numbers will be sequential across ALL banks, thus you can receive the "not properly sorted" log message.  Try BY BANK ACCOUNT_NUMBER inside the first loop, or BY ACCOUNT_NUMBER NOTSORTED if there is a guarantee that account numbers are distinct across all banks.

With only a million records, the use of hash may be hiding more of the processing logic than desired.  Additionally the data step is transposing the band data into metadata as a column name.  Many coder prefer highly categorical data and present the transpositions via the reporting procs REPORT or TABULATE, or as a data processing step if use proc TRANSPOSE.

Conversely, if you go with hash, you might want to do the summary in a hash as well.

Why is NPNA missing for 31MAR2011 and 30APR2011.

According to attached SET_1, both accounts 4444 and 5555 fell into NPNA on 28FEB2011

Because the business logic is the most important feature, my recommendation is to rewrite to code without the hash, for example:


* libname MIRISAGE 'C:\Users\Richard\Downloads';

ods chtml file="%sysfunc(pathname(WORK))\sample.html" style=journal;

* format $band. will be used in tabulate and will control the order in which the categorical columns are layed out;

proc format ;
  value $band (notsorted)
    'Current' = 'Current'
    '1 - 30'  = '1 - 30'
    '30 - 60' = '30 - 60'
    '60 - 90' = '60 - 90'
    '90 +'    = '90 +'
    'NPNA'    = 'NPNA'
  ;
run;

* view performs business logic of transactional sequence evaluation for computing which NPNA balances are summed;

data evaluated_transactions / view=evaluated_transactions;
  length prior_band $20;
  prior_band = '';
  became_npna = .;
  do _n_ = 1 by 1 until (last.account_number);
    set mirisage.set_1;
    by bank_number account_number;
    if current_date > '28feb2010'd and prior_band ne arrears_band then do;
      if prior_band ne '' and arrears_band = 'NPNA' then became_npna = current_date;
      prior_band = arrears_band;
    end;
  end;

  do _n_ = 1 to _n_;
    set mirisage.set_1;
    if arrears_band = 'NPNA' and became_npna <= '28feb2010'd then balance = .;
    output;
  end;

  drop prior_band;
  format became_npna date9.;
run;

* use tabulate to report the data;

proc tabulate data=evaluated_transactions;
  where current_date >= '28feb2010'd;
  class current_date;
  class arrears_band / PRELOADFMT order=data ;
  var balance;

    * tweak this, balance*N shows how many balances were set to missing by business logic;
  table current_date, arrears_band*(N balance*SUM balance*NMISS);

  format arrears_band $band.;
run;

ods _all_ close;

Save the hashing for unusual lookup conditions, high-speed master detail lookups and when necessary 'extreme optimization' requirement situations.

Happy Coding,

Richard A. DeVenezia

Super Contributor
Posts: 338

Re: Why this hash code works for a single bank but not for 2 banks ?

Posted in reply to RichardADeVenezia

Hi Richard,

Thank you so much for this alternative code which works very well except in one "condition" that is yet to be satisfied.

If you could help me to tweak your code to satisfy this condition it would be a great help as I am in a tight deadline.

This was raised in your previous response too like this "Why is NPNA missing for 31MAR2011 and 30APR2011 .

According to attached SET_1, both accounts 4444 and 5555 fell into NPNA on 28FEB2011"

The reason is this.

I provide below two pictures to show how the business logic neccesitates me to treat NPNA balances for summing up across months.

AsTable 6 shows, if an account continues its entire life in NPNA bucket, then all NPNA dollars should be igonred from summing up (I striked off all $ s

in yellow column). - this condition is captured by your code - thanks

Table 6: Continuing NPNAs

Current_date

Delinquency Cycle

Current

1-30

30-60

60-90

90+

NPNA

28-Feb-10

0

0

0

0

0

$ X

31-Mar-10

0

0

0

0

0

$ X

30-Apr-10

0

0

0

0

0

$ X

31-May-10

0

0

0

0

0

$ X

30-Jun-10

0

0

0

0

0

$ X

31-Jul-10

0

0

0

0

0

$ X

31-Aug-10

0

0

0

0

0

$ X

30-Sep-10

0

0

0

0

0

$ X

31-Oct-10

0

0

0

0

0

$ X

30-Nov-10

0

0

0

0

0

$ X

31-Dec-10

0

0

0

0

0

$ X

31-Jan-11

0

0

0

0

0

$ X

28-Feb-11

0

0

0

0

0

$ X

31-Mar-11

0

0

0

0

0

$ X

30-Apr-11

0

0

0

0

0

$ X

31-May-11

0

0

0

0

0

$ X

30-Jun-11

0

0

0

0

0

$ X

31-Jul-11

0

0

0

0

0

$ X

31-Aug-11

0

0

0

0

0

$ X

30-Sep-11

0

0

0

0

0

$ X

31-Oct-11

0

0

0

0

0

$ X

30-Nov-11

0

0

0

0

0

$ X

31-Dec-11

0

0

0

0

0

$ X

31-Jan-12

0

0

0

0

0

$ X

Table 7 shows how to treat NPNA $ s in other situations which is not captured by the code.

Stricken off NPNA $ s in the yellow column should be excluded from summing up while only non-stricekn off

$ s in yellow coulmn should be summed up (this is yet to be incorproated into the code).

I wonder if you could help me to tweak your code to incorparate this condition for NPNA $ summing up (last request).

Table 7:      Continuing NPNAs that cured, fallen into NPNA again, cured and fallen back into NPNA eventually

Current_date

Delinquency Cycle

Current

1-30

30-60

60-90

90+

NPNA

28-Feb-10

0

0

0

0

0

$6248

31-Mar-10

0

0

0

0

0

$6248

30-Apr-10

0

0

0

0

0

$6248

31-May-10

0

0

0

0

0

$6248

30-Jun-10

$4830

0

0

0

0

.

31-Jul-10

0

0

$4830

0

0

.

31-Aug-10

0

0

$4830

0

0

.

30-Sep-10

0

0

0

$4830

0

.

31-Oct-10

0

0

0

0

0

$4831

30-Nov-10

$4831

0

0

0

0

31-Dec-10

0

0

0

0

0

$4831

31-Jan-11

0

0

0

0

0

$4831

28-Feb-11

$2630

0

0

0

0

.

31-Mar-11

0

$2630

0

0

0

.

30-Apr-11

0

0

$2630

0

0

.

31-May-11

0

$2630

0

0

0

.

30-Jun-11

$2630

0

0

0

0

.

31-Jul-11

0

0

0

0

$2630

31-Aug-11

0

0

0

$2630

0

.

30-Sep-11

0

0

0

0

0

$247

31-Oct-11

0

0

0

0

0

$247

30-Nov-11

0

0

0

0

0

$247

31-Dec-11

0

0

0

0

0

$247

Thank you for your time and expertise.

Regards

Mirisage

Occasional Contributor
Posts: 9

Re: Why this hash code works for a single bank but not for 2 banks ?

Mirisage:


I speculate that your updated business logic boils down to two simple processing rules:

  • Only the first of consecutive NPNA banded transactions in an account are reportable.
  • Create a summary report for transactions >= 28feb2010
* data must be sorted by bank_number, account_number and current_date;
data evaluated_transactions / view=evaluated_transactions;
   do _n_ = 1 by 1 until (last.account_number);
     set mirisage.set_1;
     by bank_number account_number arrears_band notsorted;
     if arrears_band = 'NPNA' then do;
       if first.arrears_band then do;
         if current_date < '28feb2010'd then balance = .;
       end;
       else
         balance = .;
     end;
     output;
   end;
run;

Use the same tabulate as shown in prior reply.

Super Contributor
Posts: 338

Re: Why this hash code works for a single bank but not for 2 banks ?

Posted in reply to RichardADeVenezia
Hi RichardADeVenezia,

Thank you very much for this great help.

The code works marvelously!

I am reading many lireature to grasp the expertise you have applied in this discussion series.

Again thanks, for reading my long "posts", comprehending them, and then provising coding support.

Warm regards

Mirisage

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 335 views
  • 3 likes
  • 2 in conversation